China already uses voice-cloning tool as OpenAI unveils Voice Engine

AI Photo:VCG

AI Photo:VCG

As San Francisco-based OpenAI just unveiled on Friday its Voice Engine tool, which can replicate people’s voices, in small commodity hub Yiwu, East China’s Zhejiang Province, people adopted a similar domestic artificial intelligence (AI) application to help engage with foreign traders in 36 different languages as early as in October 2023.

Voice Engine, a model for creating custom voices, uses text input and a single 15-second audio sample to generate natural-sounding speech that closely resembles that of the original speaker, said the company in a statement released on Friday.

It also outlined application scenarios from some early cases, such as providing reading assistance, translating content, reaching global communities by improving essential service delivery in remote settings and helping patients recover their voices. 

Notably, when used for translation, the inserted text does not necessarily need to be in the user’s native language. For example, English speakers can have their voice copied into Spanish, French, Chinese, or other languages. 

In a recent case, business owners in small commodity hub Yiwu adopted a domestic Chinagoods AI Smart Service Platform to work as digital human anchors for product marketing or sales demos. 

“We use Chinagoods for video translation and it can generate speech into multiple languages and reach global clients from all over the world,” Yiwu toy seller Sun Lijuan told the Global Times on Sunday. 

According to Sun, such AI technology has revolutionized the way she conducts business, facilitating communication with customers and saving costs.

With the development of generative AI, digital avatars began to appear frequently in short videos, customer service and other fields, and livestreaming for sales is a scenario that cannot be ignored, experts said.

In addition to AI translation, Yiwu also provides other AI functionalities, such as AI-generated online livestreaming hosts known as “digital humans,” who are capable of 24/7 online livestreaming at a significantly lower cost than real human hosts.

China’s market is rich and diverse in terms of the application of digital avatars. The application scenarios of digital human technology in China are diverse, from livestreaming sales and education to marketing, Wang Peng, an associate research fellow at the Beijing Academy of Social Sciences, told the Global Times on Sunday.

China is in a leading position in the field of digital human technology applications, Luan Qing, general manager of digital entertainment and culture business at SenseTime’s digital world group, said in a recent interview with the Global Times.

“Compared with its international counterparts, the development of China’s livestreaming and short-video industry has driven the rapid development of digital human technology, leading to an earlier and more explosive growth of digital avatar applications compared with overseas markets.”

Many enterprises and factories are achieving profitability with the help of digital human anchors, driving a large number of new job opportunities in the industry chain, and virtual technology is feeding back into the real economy, Wang said.

As of December 2023, there were at least 2,805 companies in Beijing engaged in digital human-related businesses. Among them, 217 had digital humans as their core business, with total revenue of 5.1 billion yuan ($706 million) in 2023, media outlets reported.

Virtual human industry experiencing rapid growth, with China taking a lead in technology’s commercialization

Photo: VCG

Photo: VCG

“Look forward, never look back… Let’s drive forward on the path of artificial intelligence (AI),” the speech delivered by “digital Tang Xiao’ou,” SenseTime’s founder, at the Chinese AI company’s online annual meeting on March 1 made his colleagues burst into tears.

Tang passed away last year. The “digital Tang,” featuring Tang in business attire, replicated Tang’s natural voice and gestures during a 9-minute video, and the avatar even imitated Tang in taking a sip of water.

The “digital Tang” also largely mimics the personalities and expressions of the real Tang, who was born in Northeast China – a region which has played an important role in Chinese comedy – and thus has given Tang an inborn unique sense of humor.

The “digital Tang” is so real that some of his colleagues thought it was recorded before Tang passed away. It is not until the digital avatar spoke about a Chinese movie which premiered in February this year that they realized that they were communicating with their beloved friend through a “digital bridge.”

Behind the creation of “digital Tang,” as well as an array of other “digital humans,” is the fast-track development of AI technologies, especially the large language model (LLM) that has been on an unprecedented rise since the beginning of 2023.

Teaching digital avatars like ‘tutoring children’

With the approach of the Qingming Festival, a time for Chinese to pay respects to their deceased family members, the “digital human” industry has been seeing rising demand as people crave the opportunity to “communicate” with their deceased loved ones. In addition to emotional companionship, industry insiders also believed that China could also lead the world in terms of technology’s commercialization in other application scenarios such as live broadcasting and short videos.

According to SenseTime, many descendants of deceased celebrities have approached SenseTime after watching the 9-minute footage of “digital Tang,” with the hope to create digital avatars of their own deceased relatives.

“Digital avatar technology is a gem we all wish to claim in the field of AI research,” Luan Qing, general manager of digital entertainment and culture business at SenseTime’s digital world group, told the Global Times. She stressed the technological difficulty. The 9-minute video embodies nearly decade-long research efforts and accumulation of related technology, she said.

With regards to the standards for digital humans, Luan referred to the Turing Test, a method of determining whether a machine can demonstrate human intelligence. Achieving this level of realism in digital avatars is challenging, requiring precise reproduction of image, movement, expression, and voice, as well as conveying the human being’s thoughts.

The process of creating the digital human includes image training, such as clothes, movements and facial expressions, followed by voice training, according to Luan. The first step was done using SenseTime’s self-developed AI model “SenseNova,” which was launched in April 2023. The model includes AI avatar video generation platform “SenseAvatar.”

“The second step to mimic Tang’s language style is more complicated. We select about four to five voice clips featuring Tang’s different talking styles as prompts, each was three to five seconds. Although it took some time for us to select the voice samples, the training was completed quite fast thanks to our voice large models,” Luan said.

SenseTime made important breakthrough in its voice large models in 2023, a banner year for generative AI. The company also has plans to unveil larger voice models in the first half of this year.

Luan described the whole training process as akin to teaching a child, “feeding” various video clips to the AI models to make a child learn and mimic the moves. For example, the impressive move of “digital Tang” taking a sip of water was generated after a footage of Tang drinking water was put in the AI large model, and a prompt on the time for performing the action was also pre-scripted.

Making a ‘lifelike’ digital avatar

While the technology is yet mature enough to create a “lifelike digital avatar” with complicated interactive features, demand has already experienced “explosive growth,” some industry insiders told the Global Times.

According to a report from iMedia Research, it is estimated that by 2025, the core market size of virtual digital humans will reach 48.06 billion yuan ($6.65 billion), expanding from the 12 billion yuan market in 2022.

Chinese tech company 360 last year launched a group of AI-powered digital humans based on innovations built on LLM, which the company said have a “soul” that make them differentiated from the traditional “repeater” form of digital humans. The “soul” is developed based on LLM training that equips the digital avatar with personalities and memory that makes it to think like a human being.

According to a statement that 360 sent to the Global Times on Monday, the company listed a wide range of application scenarios for digital humans, including news broadcasting, knowledge sharing, product marketing and digital company spokesperson.

“Compared with its international counterparts, China is ‘way ahead’ in the field of digital human technology application,” Luan stated, adding that this is because the rapid development of China’s livestreaming and short-video industry has driven the progress of digital human technology, which in turn fueled an earlier and more explosive growth of digital avatar applications compared to overseas markets.

Tian Feng, dean of SenseTime Intelligent Industry Research Institute, told the Global Times that high-quality AI replication technology can be put into practical use in some specific scenarios. For example, if a scientist’s papers and speeches are integrated into an LLM, then relevant scientific research and science education can still continue after the scientist’s death.

Industry insiders also said that Chinese companies need to make more technological breakthroughs and research and development inputs to achieve real-time interactions with AI-driven digital humans.

For example, while a human being could ask a digital human to walk toward him with verbal descriptions, the digital human still cannot perform small “unconsciousness” actions and expressions, such as flinging the hair back from forehead during the process.