OpenVoice: Versatile Instant Voice Cloning - Approach

cover
30 May 2024

Authors:

(1) Zengyi Qin, MIT & MyShell.ai and (email: qinzy@mit.edu);

(2) Wenliang Zhao, Tsinghua University;

(3) Xumin Yu, Tsinghua University;

(4 ) Xin Sun, MyShell.ai;

Abstract and Introduction

Approach
Experiment

Discussion and References

2 Approach

The technical approach is simple to implement but surprisingly effective. We first present the intuition behind OpenVoice, then elaborate on the model structure and training.

2.1 Intuition

The Hard. It is obvious that simultaneously cloning the tone color for any speaker, enabling flexible control of all other styles, and adding new language with little effort could be very challenging. It requires a huge amount of combinatorial datasets where the controlled parameters intersect, and pairs of data that only differ in one attribute, and are well-labeled, as well as a relatively large-capacity model to fit the dataset.

The Easy. We also notice that in regular single-speaker TTS, as long as voice cloning is not required, it is relatively easy to add control over other style parameters and add a new language. For example, recording a single-speaker dataset with 10K short audio samples with labeled emotions and intonation is sufficient to train a single-speaker TTS model that provides control over emotion and intonation. Adding a new language or accent is also straightforward by including another speaker in the dataset.

The intuition behind OpenVoice is to decouple the IVC task into separate subtasks where every subtask is much easier to achieve compared to the coupled task. The cloning of tone color is fully decoupled from the control over all remaining style parameters and languages. We propose to use a base speaker TTS model to control the style parameters and languages, and use a tone color converter to embody the reference tone color into the generated voice.

2.2 Model Structure

We illustrate the model structure in Figure. 1. The two main components of OpenVoice are the base speaker TTS model and the tone color converter. The base speaker TTS model is a single-speaker or multi-speaker model, which allows control over the style parameters (e.g., emotion, accent, rhythm, pauses and intonation), accent and language. The voice generated by this model is passed to the tone color converter, which changes the tone color of the base speaker into that of the reference speaker.

Base Speaker TTS Model. The choice of the base speaker TTS model is very flexible. For example, the VITS [6] model can be modified to accept style and language embedding in its text encoder and duration predictor. Other choices such as InstructTTS [17] can also accept style prompts. It is also possible to use commercially available (and cheap) models such as Microsoft TTS, which accepts speech synthesis markup language (SSML) that specifies the emotion, pauses and articulation. One can even skip the base speaker TTS model, and read the text by themselves in whatever styles and languages they desire. In our OpenVoice implementation, we used the VITS [6] model by default, but other choices are completely feasible. We denote the outputs of the base model as X(LI , SI , CI ), where the three parameters represent the language, styles and tone color respectively. Similarly, the speech audio from the reference speaker is denoted as X(LO, SO, CO).

Tone Color Converter. The tone color converter is an encoder-decoder structure with a invertible normalizing flow [12] in the middle. The encoder is an 1D convolutional neural network that takes the short-time Fourier transformed spectrum of X(LI , SI , CI ) as input. All convolutions are singlestrided. The feature maps outputted by the encoder are denoted as Y(LI , SI , CI ). The tone color extractor is a simple 2D convolutional neural network that operates on the mel-spectrogram of the input voice and outputs a single feature vector that encodes the tone color information. We apply it on X(LI , SI , CI ) to obtain vector v(CI ), then apply it on X(LO, SO, CO) to obtain vector v(CO).

The normalizing flow layers take Y(LI , SI , CI ) and v(CI ) as input and outputs a feature representation Z(LI , SI ) that eliminates the tone color information but preserves all remaining style properties. The feature Z(LI , SI ) is aligned with International Phonetic Alphabet (IPA) [1] along the time dimension. Details about how such feature representation is learned will be explained in the next section. Then we apply the normalizing flow layers in the inverse direction, which takes Z(LI , SI ) and v(CO) as input and outputs Y(LI , SI , CO). This is a critical step where the tone color CO from the reference speaker is embodied into the feature maps. Then the Y(LI , SI , CO) is decoded into raw waveforms X(LI , SI , CO) by HiFi-Gan [7] that contains a stack of transposed 1D convolutions. The entire model in our OpenVoice implementation is feed-forward without any auto-regressive component. The tone color converter is conceptually similar to voice conversion [14, 11], but with different emphasis on its functionality, inductive bias on its model structure and training objectives. The flow layers in the tone color converter are structurally similar to the flow-based TTS methods [6, 5] but with different functionalities and training objectives.

Alternative Ways and Drawbacks. Although there are alternative ways [4, 9, 14] to extract Z(LI , SI ), we empirically found that the proposed approach achieves the best audio quality. One can use HuBERT [4] to extract discrete or continuous acoustic units [14] to eliminate tone color information, but we found that such method also eliminates emotion and accent from the input speech. When the input is an unseen language, this type of method also has issues preserving the natural pronunciation of the phonemes. We also studied another approach [9] that carefully constructs information bottleneck to only preserve speech content, but we observed that this method is unable to completely eliminate the tone color.

Remark on Novelty. OpenVoice does not intend to invent the submodules in the model structure. Both the base speaker TTS model and the tone color converter borrow the model structure from existing work [5, 6]. The contribution of OpenVoice is the decoupled framework that seperates the voice style and language control from the tone color cloning. This is very simple, but very effective, especially when one wants to control styles, accents or generalize to new languages. If one wanted to have the same control on a coupled framework such as XTTS [3], it could require a tremendous amount of data and computing, and it is relatively hard to fluently speak every language. In OpenVoice, as long as the single-speaker TTS speaks fluently, the cloned voice will be fluent. Decoupling the generation of voice styles and language from the generation of tone color is the core philosophy of OpenVoice. We also provided our insights of using flow layers in tone color converter, and the importance of choosing a universal phoneme system in language generalization in our experiment section.

2.3 Training

In order to train the base speaker TTS model, we collected audio samples from two English speakers (American and British accents), one Chinese speaker and one Japanese speaker. There are 30K sentences in total, and the average sentence length is 7s. The English and Chinese data has emotion classification labels. We modified the VITS [6] model and input the emotion categorical embedding, language categorical embedding and speaker id into the text encoder, duration predictor and flow layers. The training follows the standard procedure provided by the authors of VITS [6]. The trained model is able to change the accent and language by switching between different base speakers, and read the input text in different emotions. We also experimented with additional training data and confirmed that rhythm, pauses and intonation can be learned in exactly the same way as emotions.

In order to train the tone color converter, we collected 300K audio samples from 20K individuals. Around 180K samples are English, 60K samples are Chinese and 60K samples are Japanese. This is what we called the MSML dataset. The training objectives of the tone color converter is two-fold. First, we require the encoder-decoder to produce natural sound. During training, we feed the encoder output directly to the decoder, and supervised the generated waveform using the original waveform with mel-spectrogram loss and HiFi-GAN [7] loss. We will not detail here as it has been well explained by previous literature [7, 6].

Second, we require flow layers to eliminate as much tone color information as possible from the audio features. During training, for each audio sample, its text is converted to a sequence of phonemes in IPA [1], and each phoneme is represented by a learnable vector embedding. The sequence of vector embedding is passed to a transformer [15] encoder to produce the feature representation of the text content. Denote this feature as L ∈ R c×l , where c is the number of feature channels and l is the number of phonemes in the input text. The audio waveform is processed by the encoder and flow layers to produce the feature representation Z ∈ R c×t , where t is the length of the features along the time dimension. Then we align L with Z along the time dimension using dynamic time warping [13, 10] (an alternative is monotonic alignment [5, 6]) to produce L¯ ∈ R c×t , and minimize the KL-divergence between L¯ and Z. Since L¯ does not contain any tone color information, the minimization objective would encourage the flow layers to remove tone color information from their output Z. The flow layers are conditioned on the tone color information from the tone color encoder, which further helps the flow layers to identify what information needs to be eliminated. In addition, we do not provide any style or language information for the flow layers to condition on, which prevents the flow layers to eliminate information other than tone color. Since the flow layers are invertible, conditioning them on a new piece of tone color information and running its inverse process can add the new tone color back to the feature representations, which are then decoded to the raw waveform with the new tone color embodied.

This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.