Voice

Many-to-One, One-to-Many, Many-to-Many, and Zero-shot Voice Conversion.

This tutorial is available as an IPython notebook at malaya-speech/example/voice-conversion.

This module is language independent, so it save to use on different languages.

Explanation

We created super fast Voice Conversion model, called FastVC, Faster and Accurate Voice Conversion using Transformer. No paper produced.

Steps to reproduce can check at https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/voice-conversion

[1]:
import malaya_speech
import numpy as np

List available Voice Conversion models

[2]:
malaya_speech.voice_conversion.available_deep_conversion()
[2]:
Size (MB) Quantized Size (MB) Total loss
fastvc-32-vggvox-v2 190.0 54.1 0.2851
fastvc-64-vggvox-v2 194.0 55.7 0.2764

Load Deep Conversion

def deep_conversion(
    model: str = 'fastvc-vggvox-v2', quantized: bool = False, **kwargs
):
    """
    Load Voice Conversion model.

    Parameters
    ----------
    model : str, optional (default='fastvc-vggvox-v2')
        Model architecture supported. Allowed values:

        * ``'fastvc-32-vggvox-v2'`` - FastVC bottleneck size 32 with VGGVox-v2 Speaker Vector.
        * ``'fastvc-64-vggvox-v2'`` - FastVC bottleneck size 64 with VGGVox-v2 Speaker Vector.

    quantized : bool, optional (default=False)
        if True, will load 8-bit quantized model.
        Quantized model not necessary faster, totally depends on the machine.

    Returns
    -------
    result : malaya_speech.supervised.voice_conversion.load function
    """
[3]:
model = malaya_speech.voice_conversion.deep_conversion(model = 'fastvc-32-vggvox-v2')
quantized_model = malaya_speech.voice_conversion.deep_conversion(model = 'fastvc-32-vggvox-v2', quantized = True)
WARNING:root:Load quantized model will cause accuracy drop.

Predict

def predict(self, original_audio, target_audio):
    """
    Change original voice audio to follow targeted voice.

    Parameters
    ----------
    original_audio: np.array or malaya_speech.model.frame.Frame
    target_audio: np.array or malaya_speech.model.frame.Frame

    Returns
    -------
    result: Dict[decoder-output, postnet-output]
    """

``original_audio`` and ``target_audio`` must 22050 sample rate.

[4]:
sr = 22050
original_audio = malaya_speech.load('speech/example-speaker/haqkiem.wav', sr = sr)[0]
target_audio = malaya_speech.load('speech/example-speaker/female.wav', sr = sr)[0]
[5]:
import IPython.display as ipd

ipd.Audio(original_audio, rate = sr)
[5]:
[6]:
ipd.Audio(target_audio[:sr * 2], rate = sr)
[6]:
[7]:
%%time
r = model.predict(original_audio, target_audio)
r
CPU times: user 9.52 s, sys: 1.6 s, total: 11.1 s
Wall time: 3.21 s
[7]:
{'decoder-output': array([[ 0.16796653,  0.27031827,  0.25115278, ...,  1.9728385 ,
          2.0013132 ,  1.9959606 ],
        [ 0.1876081 ,  0.31539977,  0.21735613, ...,  2.105957  ,
          2.1475153 ,  2.135561  ],
        [ 0.11078158,  0.24430256,  0.13483176, ...,  2.2050035 ,
          2.2327175 ,  2.2086055 ],
        ...,
        [-0.46983352, -0.37537116, -0.46007934, ..., -1.3968909 ,
         -1.4182267 , -1.445814  ],
        [-0.6261345 , -0.52298963, -0.6305046 , ..., -1.6692938 ,
         -1.6694924 , -1.670802  ],
        [-0.7858655 , -0.6631793 , -0.7685092 , ..., -1.7505003 ,
         -1.7430477 , -1.7306981 ]], dtype=float32),
 'postnet-output': array([[ 0.16796653,  0.27031827,  0.25115278, ...,  1.9728385 ,
          2.0013132 ,  1.9959606 ],
        [ 0.1876081 ,  0.31539977,  0.21735613, ...,  2.105957  ,
          2.1475153 ,  2.135561  ],
        [ 0.11078158,  0.24430256,  0.13483176, ...,  2.2050035 ,
          2.2327175 ,  2.2086055 ],
        ...,
        [-0.46983352, -0.37537116, -0.46007934, ..., -1.3968909 ,
         -1.4182267 , -1.445814  ],
        [-0.6261345 , -0.52298963, -0.6305046 , ..., -1.6692938 ,
         -1.6694924 , -1.670802  ],
        [-0.7858655 , -0.6631793 , -0.7685092 , ..., -1.7505003 ,
         -1.7430477 , -1.7306981 ]], dtype=float32)}
[8]:
%%time
quantized_r = quantized_model.predict(original_audio, target_audio)
quantized_r
CPU times: user 9.27 s, sys: 1.48 s, total: 10.7 s
Wall time: 3.07 s
[8]:
{'decoder-output': array([[ 0.20622607,  0.31927785,  0.30248964, ...,  1.8387263 ,
          1.8538276 ,  1.8661375 ],
        [ 0.26772612,  0.37867302,  0.28368202, ...,  2.0063264 ,
          2.0300496 ,  2.027563  ],
        [ 0.22045831,  0.35479122,  0.24202934, ...,  2.1292984 ,
          2.1489828 ,  2.1232607 ],
        ...,
        [-0.37217844, -0.30496663, -0.40188327, ..., -1.4102241 ,
         -1.47401   , -1.4887681 ],
        [-0.553902  , -0.47220862, -0.60245174, ..., -1.6579611 ,
         -1.7115406 , -1.7125119 ],
        [-0.7077116 , -0.60712785, -0.7680642 , ..., -1.7266644 ,
         -1.7799759 , -1.766048  ]], dtype=float32),
 'postnet-output': array([[ 0.20622607,  0.31927785,  0.30248964, ...,  1.8387263 ,
          1.8538276 ,  1.8661375 ],
        [ 0.26772612,  0.37867302,  0.28368202, ...,  2.0063264 ,
          2.0300496 ,  2.027563  ],
        [ 0.22045831,  0.35479122,  0.24202934, ...,  2.1292984 ,
          2.1489828 ,  2.1232607 ],
        ...,
        [-0.37217844, -0.30496663, -0.40188327, ..., -1.4102241 ,
         -1.47401   , -1.4887681 ],
        [-0.553902  , -0.47220862, -0.60245174, ..., -1.6579611 ,
         -1.7115406 , -1.7125119 ],
        [-0.7077116 , -0.60712785, -0.7680642 , ..., -1.7266644 ,
         -1.7799759 , -1.766048  ]], dtype=float32)}

Voice Conversion output

  1. Will returned mel feature size 80.

  2. This mel feature only able to synthesize using Universal Vocoder, eg, Universal Melgan, https://malaya-speech.readthedocs.io/en/latest/load-universal-melgan.html

Load Universal MelGAN

Read more about Universal MelGAN at https://malaya-speech.readthedocs.io/en/latest/load-universal-melgan.html

[9]:
melgan = malaya_speech.vocoder.melgan(model = 'universal-1024')
[12]:
%%time

y_ = melgan.predict([r['postnet-output']])
ipd.Audio(y_[0], rate = sr)
CPU times: user 14.6 s, sys: 2.29 s, total: 16.9 s
Wall time: 3.46 s
[12]:
[13]:
%%time

y_ = melgan.predict([quantized_r['postnet-output']])
ipd.Audio(y_[0], rate = sr)
CPU times: user 14.1 s, sys: 1.93 s, total: 16 s
Wall time: 3 s
[13]:

Pretty good!

More example

This time we try, original voice is English, target voice from Malay and English.

[14]:
original_audio = malaya_speech.load('speech/44k/test-2.wav', sr = sr)[0]
ipd.Audio(original_audio, rate = sr)
[14]:
[15]:
target_audio = malaya_speech.load('speech/vctk/p300_298_mic1.flac', sr = sr)[0]
r = model.predict(original_audio, target_audio)
ipd.Audio(target_audio, rate = sr)
[15]:
[17]:
y_ = melgan.predict([r['postnet-output']])
ipd.Audio(y_[0], rate = sr)
[17]:
[18]:
target_audio = malaya_speech.load('speech/vctk/p323_158_mic2.flac', sr = sr)[0]
r = model.predict(original_audio, target_audio)
ipd.Audio(target_audio, rate = sr)
[18]:
[19]:
y_ = melgan.predict([r['postnet-output']])
ipd.Audio(y_[0], rate = sr)
[19]:
[20]:
target_audio = malaya_speech.load('speech/vctk/p360_292_mic2.flac', sr = sr)[0]
r = model.predict(original_audio, target_audio)
ipd.Audio(target_audio, rate = sr)
[20]:
[22]:
y_ = melgan.predict([r['postnet-output']])
ipd.Audio(y_[0], rate = sr)
[22]:
[23]:
target_audio = malaya_speech.load('speech/vctk/p361_077_mic1.flac', sr = sr)[0]
r = model.predict(original_audio, target_audio)
ipd.Audio(target_audio, rate = sr)
[23]:
[24]:
y_ = melgan.predict([r['postnet-output']])
ipd.Audio(y_[0], rate = sr)
[24]:
[25]:
target_audio = malaya_speech.load('speech/example-speaker/female.wav', sr = sr)[0]
r = model.predict(original_audio, target_audio)
ipd.Audio(target_audio, rate = sr)
[25]:
[26]:
y_ = melgan.predict([r['postnet-output']])
ipd.Audio(y_[0], rate = sr)
[26]:
[27]:
target_audio = malaya_speech.load('speech/example-speaker/husein-zolkepli.wav', sr = sr)[0]
ipd.Audio(target_audio, rate = sr)
[27]:

If you have a low quality audio, you can use speech enhancement, https://malaya-speech.readthedocs.io/en/latest/load-speech-enhancement.html

[28]:
enhancer = malaya_speech.speech_enhancement.deep_enhance(model = 'unet-enhance-24')
[29]:
logits = enhancer.predict(target_audio)
ipd.Audio(logits, rate = sr)
[29]:
[30]:
r = model.predict(original_audio, target_audio)
[32]:
y_ = melgan.predict([r['postnet-output']])
ipd.Audio(y_[0], rate = sr)
[32]: