Speech Generation after WaveNet

Before WaveNet, speech generation was suffering in the uncanny valley of “good enough to be understood but not good enough to sound pleasant” for a long time. Everyone is familiar with the robotic voices of parametric speech generation of old. WaveNet has changed all this. First published in a research paper by DeepMind in 2016, it was launched in Google Assistant in September 2017.

from https://deepmind.com/blog/article/wavenet-generative-model-raw-audio

When Google Assistant replies to you, it uses a voice generated by WaveNet. The generated speech sounds much more pleasant now, and it has become harder to distinguish it from a real human voice. As resource costs decrease and the technology behind WaveNet advances, this new technology will pose both risks and offer opportunities for businesses and our society alike.

This article covers predictions for the future, opportunities, and risks (in this order).

What will the future bring?

Given the current state of the art, we can think about the advances of the next 1–2 years. (Since I’ve begun working on this article, the future has already arrived in some ways-see further remarks below.)

More than just words

First, we can expect research in speech generation to take into account more than just the words. By this, I mean everything that adds to speech: pitch, speed, intonation, and stress, for example.

This can be different in different languages: Chinese uses intonation to differentiate between words, while it does not matter in English. In general, however, the meaning of what you say changes if you shout or whisper or if you talk with a sad or an angry voice.

Just as research in speech recognition has been able to extract this additional information, speech generation will be able to take into emotional state and mode of speech. For example, there is already published research by Microsoft on how to extract emotions from speech.

Automated voice transfer

More recent research by DeepMind already shows how to perform voice transfer: a recording of one voice can be transferred to another voice without any additional work. Check out these two examples from DeepMind’s voice transfer:

from https://avdnoord.github.io/homepage/vqvae/

Funnily, but not surprising, it also keeps accents intact.

Convincing generated voices with little training data

Second, the amount of training data necessary to learn a given voice will decrease. Currently, technologies like WaveNet still need 50 hours of recordings of a single speaker to sound convincing. However, a model trained on many voices will need less data to learn and store an additional voice.

In the limit, it will be possible to use just a few minutes of someone’s voice to be able to imitate it. These recordings will need to cover all idiosyncrasies of that person’s way of speaking to be truly accurate, but nothing beyond that will be needed.

To see why this is possible, imagine someone training to become a voice actor. The first new voice will be the hardest to learn. As one gains experience by imitating more voices, picking up a new voice or accent will become easier.

The reason for this is that we learn more effective ways to remember and reproduce voices, and our internal representation for the qualities of a voice improves. For example, when you have a friend with a very particular accent, you are more prone to identify them by their accent instead of other qualities of their voice. The first time you hear someone else with the same accent, you might mistake them for your friend.

As you meet more people with the same kind of accent, you will learn to differentiate better between them. (This is based on personal anecdotes with people from Leytonstone, London.)

The same applies to neural networks and deep learning. Neural networks will learn a better representation of a voice as they learn more than one voice and form a more compact model. If the neural network recognizes a British English speaker and it already knows what many British English speakers sound like, it will need fewer additional voice samples to learn the differences between this particular speaker and the “stereotype” of British English speaker that it has learned before.

Putting this into context, this means that a lot of recordings are needed to “seed” a model and then it will be able to pick up a new voice with just a few scrap recordings of that voice. Companies like Google or Amazon store millions of hours of voice search requests in their logs thanks to Google Home/Assistant and Alexa, so data will not be a problem.

Cost reduction and “democratization”

Third, the technology will spread beyond Google and DeepMind. Speech recognition and generation as a service have been around for a while already, and other companies like Baidu are making advances as well. Hardware costs have been steadily declining, so all in all, general costs for speech generation and voice model training will go down, and it will become easier and easier to emulate a particular person’s voice in a convincing way.

Depending on how far along we are, maybe just an hour of someone’s voice will be necessary to train a voice model that can create new speech that sounds just like them, or maybe already 5 minutes will be enough to train a rough model that will fool an acquaintance but probably not a close friend.

It is important to note here that quality of the generated voice will remain high regardless, because there will be enough other data available to fill any gaps. It will sound like a real voice and show realistic emotions. It will just not match the target voice accurately.

As with many other deep learning technologies, we can expect that what only big companies with huge data centers and lots of data can do now, an individual will be able to do with their computer at home soon enough in a quality that can convince the casual observer. DeepFakes are just the most recent evidence of this, and the simplicity of automated voice transfers could do to speech generation what DeepFakes have done to face swapping in videos.

Opportunities abound

There are several industries that can benefit from improved voice generation technology like WaveNet. I want to name a few to show what could be of interest and why these ideas make more sense now.

Audiobooks

Given a pleasant enough generated voice, audiobooks that use generated speech will be just as attractive as ones recorded by regular voice actors, but cheaper.

Various levels of automation seem plausible: from fully automated text-to-speech to a human-directed production with hand-picked generated voices, assigned speech roles, and manual corrections of tone to convey specific emotions.

This will not cut into the existing market right away. Demand for high-quality voice actors will remain, but it will open the market for audiobooks that would not have been lucrative so far.

Realtime voice-to-voice translation

Imagine you are skyping with someone whose language you don’t speak. They speak with you and you hear them speak English with a short delay in the video and audio transmission. Likewise, you are recorded speaking English, but they hear you speaking their local language.

This can be achieved by training a voice model on your voice and using it together with real-time translation. Such a voice model could be trained locally on your phone by listening to your regular calls without having to compromise any user privacy.

Such real-time voice-to-voice translation could be particularly attractive for customer support and services that are offered only via phone.

Google has recently launched translating earbuds, so this idea is already becoming reality. With just 50 milliseconds needed for generating 1 second of speech, there is a lot of time available for speech recognition and translation with acceptable latency.

Movie dubbing

Dubbing movies is big in non-English countries, and it does not come cheap: it costs about $100,000 per language per a movie. Instead of using the voice of regional actors, one could train voice models on the movie actors’ voices and make the voice actors sound like the original ones, or use speech generation for everything, assuming we can model everything well enough automatically.

The latter might be particularly attractive for low-cost productions and could be combined with existing translation and auto-captioning services on services like YouTube. Especially channels for children or people with low-reading abilities would benefit from this, as more content would become accessible to audiences.

Again, in the beginning using technology like WaveNet will benefit the long-tail and breadth of users before making its way into the high-end as it matures.

Voice sponsorships

Celebrities often lend their voices to sponsorship deals and endorsements in radio or TV ads. This could be streamlined by using speech generation for endorsements. A celebrity would only have to sign off on their voice being used. Usually, there is already enough voice material available from interviews or other similar recordings to train a voice model. Moreover, it would be easy to translate the voice into many languages, which would increase the brand recognition of celebrities.

Perfect announcer voice

Having a voice model that can be changed easily would allow for large scale A/B testing. One could optimize for perceived attractiveness and persuasiveness to find the perfect announcer voice for ads and infomercials, which could then be reused and licensed over and over again. (Reinforcement learning techniques might be especially useful in such a case, as it comes with very sparse reward signals.)

Imagine you could target people with the voice that is most convincing to them specifically. Right now, choosing the voice for a voice over is based on generic studies and simple choices: male or female, for example. Even an improvement of one percent in the performance of ads would mean many billions of dollars in added revenue. Since this can be done at scale and in an automated fashion, it is very likely it will happen.

Artificial pop-stars

Likewise, singing voices could be sampled to be used for new pop hits, or a voice model could be created for a deceased singer, and their estate could “record” new songs with their voice. Michael Jackson could release new songs and 2pac could return and rap about social issues and current presidents. Singers could be insured against death or loss of voice in a similar vein as to what happened with Carrie Fischer in Star Wars: The Last Jedi.

Low-bandwidth voice communication

A different venue is using a voice model for compression. Normal sound compression formats like MP3 take into account how we hear in order to achieve better lossy compression. Usually phone conversations and video chats are purely speech-based, so voice models like WaveNet could provide a much better compression.

To allow for low-bandwidth communication, each speaker could have their own voice recognition and generation model. The generative models are exchanged between calling partners at the beginning of the conversation, and the actual conversation is converted to phonemes (or more fine grained representation) and transmitted as such. The voice model is then used to convert the phonemes back to speech in the caller’s voice.

This is a very technical use-case for this technology, but the data savings could be massive. Especially in areas with bad data connection, this could make a real difference. This could also be combined with optimized voice models for ads, for example, to reduce the lag of ads. Ad lag is a major issue for publishers as it is very important that sites load quickly. Otherwise, users might just switch to something else.

Moreover, the line between text and voice messaging can be blurred even more. You could listen to text messages in your friend’s voice on the one hand, and on the other hand your voice messages could be converted to text for easier reading and indexing. There would be more freedom on how to consume communication.

Risks

The opportunities are very interesting and plenty. However, many readers might have already feel some unease. While the possibilities are great and are becoming reality (Google’s new earbuds), the chances for abuse also increase. As always, with great power comes great responsibility, for users, researchers, and regulators.

In the beginning and at the moment, the technology and know-how is and will be mainly be available to big companies and state actors. But soon, smaller groups and individuals will have access to this technology. As with the internet as a whole, abuse and fraud will become easier.

The biggest difference from what was before is that now generated speech has actually become hard to distinguish from real speech. And also, especially with voice transfer outlined by DeepMind’s latest research, it will become very easy to impersonate another person in an automated way. As with DeepFakes, it is the democratization of this that should concern. Soon enough, everybody will be able to do this without needing much technical expertise.

Being able to model someone’s voice and make them say things they might not agree with is the premise of all the risks I am going to cover. The ideas are not knew and watchers of the highly recommended series Black Mirror probably will recognize some scenarios.

Erosion of trust

The biggest threat - let’s call it a meta-threat - is a further erosion of our trust in information that is publicly available, and the further division of our society in separate belief systems that will hinder discourse and compromise.

Pretending to be someone else to commit fraud or espionage comes to mind quickly.

Propaganda

Real fake news and fake leaks is another danger. Fake recordings could be disseminated to damage people’s reputation and to influence elections. A follow-up effect would be that people will stop trusting the news more and more.

Actual whistle-blowing could be discredited and become more difficult: autocratic governments already discredit damaging leaks as fake by default, and now they will have good arguments on their side to question the validity. Conspiracy theorists will have more weight to their claims. Using recordings as evidence will become impossible.

Leaks could also be sprinkled with more-damaging but fake statements that will make them hard to disprove.

Fraud

It will become easier to scam people using scams like the grandparent scam. Unprotected institutions will be defrauded using social hacking. Evidence at trials could be faked. Both to persecute political opponents in autocratic states, or as revenge.

Bullying

How would a teen feel if they suddenly heard their voice making false admissions and experiencing that being spread online?

Possible remedies

Trusted devices and trusted communication channels are needed more than ever now. Moreover, laws need to be updated to deter and prevent against the abuse of someone’s voice. As with all recent technological advances, one can only hope that law enforcement and regulations will be able to keep up and find the right balance.

The big difference

Now you might wonder: why haven’t we seen any of this before with Photoshop and other CGI tools? Right now, editing photos and videos convincingly requires skill. Every instance requires custom, specialized work and it does not generalize or scale easily.

But to fake someone’s voice (and someone’s image while speaking) requires only computing power, as general solutions are available and easily adaptable. As more computing power becomes cheaper and generally available, we will have to find strategies to avoid these negative effects.

We don’t want to be caught off-guard.

This article was originally published as a Medium Post on FreeCodeCamp. It is still on Medium as an unsyndicated article.