Voice#

Many-to-One, One-to-Many, Many-to-Many, and Zero-shot Voice Conversion.

This tutorial is available as an IPython notebook at malaya-speech/example/voice-conversion.

This module is language independent, so it save to use on different languages.

Explanation#

We created super fast Voice Conversion model, called FastVC, Faster and Accurate Voice Conversion using Transformer. No paper produced.

Steps to reproduce can check at https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/voice-conversion

[1]:
import malaya_speech
import numpy as np

List available Voice Conversion models#

[2]:
malaya_speech.voice_conversion.available_deep_conversion()
[2]:
Size (MB) Quantized Size (MB) Total loss
fastvc-32-vggvox-v2 190.0 54.1 0.2851
fastvc-64-vggvox-v2 194.0 55.7 0.2764

Load Deep Conversion#

def deep_conversion(
    model: str = 'fastvc-vggvox-v2', quantized: bool = False, **kwargs
):
    """
    Load Voice Conversion model.

    Parameters
    ----------
    model : str, optional (default='fastvc-vggvox-v2')
        Model architecture supported. Allowed values:

        * ``'fastvc-32-vggvox-v2'`` - FastVC bottleneck size 32 with VGGVox-v2 Speaker Vector.
        * ``'fastvc-64-vggvox-v2'`` - FastVC bottleneck size 64 with VGGVox-v2 Speaker Vector.

    quantized : bool, optional (default=False)
        if True, will load 8-bit quantized model.
        Quantized model not necessary faster, totally depends on the machine.

    Returns
    -------
    result : malaya_speech.supervised.voice_conversion.load function
    """
[3]:
model = malaya_speech.voice_conversion.deep_conversion(model = 'fastvc-32-vggvox-v2')
quantized_model = malaya_speech.voice_conversion.deep_conversion(model = 'fastvc-32-vggvox-v2', quantized = True)
WARNING:root:Load quantized model will cause accuracy drop.

Predict#

def predict(self, original_audio, target_audio):
    """
    Change original voice audio to follow targeted voice.

    Parameters
    ----------
    original_audio: np.array or malaya_speech.model.frame.Frame
    target_audio: np.array or malaya_speech.model.frame.Frame

    Returns
    -------
    result: Dict[decoder-output, postnet-output]
    """

``original_audio`` and ``target_audio`` must 22050 sample rate.

[4]:
sr = 22050
original_audio = malaya_speech.load('speech/example-speaker/haqkiem.wav', sr = sr)[0]
target_audio = malaya_speech.load('speech/example-speaker/female.wav', sr = sr)[0]
[5]:
import IPython.display as ipd

ipd.Audio(original_audio, rate = sr)
[5]:
[6]:
ipd.Audio(target_audio[:sr * 2], rate = sr)
[6]:
[7]:
%%time
r = model.predict(original_audio, target_audio)
r
CPU times: user 9.52 s, sys: 1.6 s, total: 11.1 s
Wall time: 3.21 s
[7]:
{'decoder-output': array([[ 0.16796653,  0.27031827,  0.25115278, ...,  1.9728385 ,
          2.0013132 ,  1.9959606 ],
        [ 0.1876081 ,  0.31539977,  0.21735613, ...,  2.105957  ,
          2.1475153 ,  2.135561  ],
        [ 0.11078158,  0.24430256,  0.13483176, ...,  2.2050035 ,
          2.2327175 ,  2.2086055 ],
        ...,
        [-0.46983352, -0.37537116, -0.46007934, ..., -1.3968909 ,
         -1.4182267 , -1.445814  ],
        [-0.6261345 , -0.52298963, -0.6305046 , ..., -1.6692938 ,
         -1.6694924 , -1.670802  ],
        [-0.7858655 , -0.6631793 , -0.7685092 , ..., -1.7505003 ,
         -1.7430477 , -1.7306981 ]], dtype=float32),
 'postnet-output': array([[ 0.16796653,  0.27031827,  0.25115278, ...,  1.9728385 ,
          2.0013132 ,  1.9959606 ],
        [ 0.1876081 ,  0.31539977,  0.21735613, ...,  2.105957  ,
          2.1475153 ,  2.135561  ],
        [ 0.11078158,  0.24430256,  0.13483176, ...,  2.2050035 ,
          2.2327175 ,  2.2086055 ],
        ...,
        [-0.46983352, -0.37537116, -0.46007934, ..., -1.3968909 ,
         -1.4182267 , -1.445814  ],
        [-0.6261345 , -0.52298963, -0.6305046 , ..., -1.6692938 ,
         -1.6694924 , -1.670802  ],
        [-0.7858655 , -0.6631793 , -0.7685092 , ..., -1.7505003 ,
         -1.7430477 , -1.7306981 ]], dtype=float32)}
[8]:
%%time
quantized_r = quantized_model.predict(original_audio, target_audio)
quantized_r
CPU times: user 9.27 s, sys: 1.48 s, total: 10.7 s
Wall time: 3.07 s
[8]:
{'decoder-output': array([[ 0.20622607,  0.31927785,  0.30248964, ...,  1.8387263 ,
          1.8538276 ,  1.8661375 ],
        [ 0.26772612,  0.37867302,  0.28368202, ...,  2.0063264 ,
          2.0300496 ,  2.027563  ],
        [ 0.22045831,  0.35479122,  0.24202934, ...,  2.1292984 ,
          2.1489828 ,  2.1232607 ],
        ...,
        [-0.37217844, -0.30496663, -0.40188327, ..., -1.4102241 ,
         -1.47401   , -1.4887681 ],
        [-0.553902  , -0.47220862, -0.60245174, ..., -1.6579611 ,
         -1.7115406 , -1.7125119 ],
        [-0.7077116 , -0.60712785, -0.7680642 , ..., -1.7266644 ,
         -1.7799759 , -1.766048  ]], dtype=float32),
 'postnet-output': array([[ 0.20622607,  0.31927785,  0.30248964, ...,  1.8387263 ,
          1.8538276 ,  1.8661375 ],
        [ 0.26772612,  0.37867302,  0.28368202, ...,  2.0063264 ,
          2.0300496 ,  2.027563  ],
        [ 0.22045831,  0.35479122,  0.24202934, ...,  2.1292984 ,
          2.1489828 ,  2.1232607 ],
        ...,
        [-0.37217844, -0.30496663, -0.40188327, ..., -1.4102241 ,
         -1.47401   , -1.4887681 ],
        [-0.553902  , -0.47220862, -0.60245174, ..., -1.6579611 ,
         -1.7115406 , -1.7125119 ],
        [-0.7077116 , -0.60712785, -0.7680642 , ..., -1.7266644 ,
         -1.7799759 , -1.766048  ]], dtype=float32)}

Voice Conversion output#

  1. Will returned mel feature size 80.

  2. This mel feature only able to synthesize using Universal Vocoder, eg, Universal Melgan, https://malaya-speech.readthedocs.io/en/latest/load-universal-melgan.html

Load Universal MelGAN#

Read more about Universal MelGAN at https://malaya-speech.readthedocs.io/en/latest/load-universal-melgan.html

[9]:
melgan = malaya_speech.vocoder.melgan(model = 'universal-1024')
[12]:
%%time

y_ = melgan.predict([r['postnet-output']])
ipd.Audio(y_[0], rate = sr)
CPU times: user 14.6 s, sys: 2.29 s, total: 16.9 s
Wall time: 3.46 s
[12]:
[13]:
%%time

y_ = melgan.predict([quantized_r['postnet-output']])
ipd.Audio(y_[0], rate = sr)
CPU times: user 14.1 s, sys: 1.93 s, total: 16 s
Wall time: 3 s
[13]:

Pretty good!

More example#

This time we try, original voice is English, target voice from Malay and English.

[14]:
original_audio = malaya_speech.load('speech/44k/test-2.wav', sr = sr)[0]
ipd.Audio(original_audio, rate = sr)
[14]:
[15]:
target_audio = malaya_speech.load('speech/vctk/p300_298_mic1.flac', sr = sr)[0]
r = model.predict(original_audio, target_audio)
ipd.Audio(target_audio, rate = sr)
[15]:
[17]:
y_ = melgan.predict([r['postnet-output']])
ipd.Audio(y_[0], rate = sr)
[17]:
[18]:
target_audio = malaya_speech.load('speech/vctk/p323_158_mic2.flac', sr = sr)[0]
r = model.predict(original_audio, target_audio)
ipd.Audio(target_audio, rate = sr)
[18]:
[19]:
y_ = melgan.predict([r['postnet-output']])
ipd.Audio(y_[0], rate = sr)
[19]:
[20]:
target_audio = malaya_speech.load('speech/vctk/p360_292_mic2.flac', sr = sr)[0]
r = model.predict(original_audio, target_audio)
ipd.Audio(target_audio, rate = sr)
[20]:
[22]:
y_ = melgan.predict([r['postnet-output']])
ipd.Audio(y_[0], rate = sr)
[22]:
[23]:
target_audio = malaya_speech.load('speech/vctk/p361_077_mic1.flac', sr = sr)[0]
r = model.predict(original_audio, target_audio)
ipd.Audio(target_audio, rate = sr)
[23]:
[24]:
y_ = melgan.predict([r['postnet-output']])
ipd.Audio(y_[0], rate = sr)
[24]:
[25]:
target_audio = malaya_speech.load('speech/example-speaker/female.wav', sr = sr)[0]
r = model.predict(original_audio, target_audio)
ipd.Audio(target_audio, rate = sr)
[25]:
[26]:
y_ = melgan.predict([r['postnet-output']])
ipd.Audio(y_[0], rate = sr)
[26]:
[27]:
target_audio = malaya_speech.load('speech/example-speaker/husein-zolkepli.wav', sr = sr)[0]
ipd.Audio(target_audio, rate = sr)
[27]:

If you have a low quality audio, you can use speech enhancement, https://malaya-speech.readthedocs.io/en/latest/load-speech-enhancement.html

[28]:
enhancer = malaya_speech.speech_enhancement.deep_enhance(model = 'unet-enhance-24')
[29]:
logits = enhancer.predict(target_audio)
ipd.Audio(logits, rate = sr)
[29]:
[30]:
r = model.predict(original_audio, target_audio)
[32]:
y_ = melgan.predict([r['postnet-output']])
ipd.Audio(y_[0], rate = sr)
[32]: