Voice

Explanation#

We created super fast Voice Conversion model, called FastVC, Faster and Accurate Voice Conversion using Transformer. No paper produced.

Steps to reproduce can check at https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/voice-conversion

[1]:

import malaya_speech
import numpy as np

List available Voice Conversion models#

[2]:

malaya_speech.voice_conversion.available_deep_conversion()

[2]:

	Size (MB)	Quantized Size (MB)	Total loss
fastvc-32-vggvox-v2	190.0	54.1	0.2851
fastvc-64-vggvox-v2	194.0	55.7	0.2764

Load Deep Conversion#

def deep_conversion(
    model: str = 'fastvc-vggvox-v2', quantized: bool = False, **kwargs
):
    """
    Load Voice Conversion model.

    Parameters
    ----------
    model : str, optional (default='fastvc-vggvox-v2')
        Model architecture supported. Allowed values:

        * ``'fastvc-32-vggvox-v2'`` - FastVC bottleneck size 32 with VGGVox-v2 Speaker Vector.
        * ``'fastvc-64-vggvox-v2'`` - FastVC bottleneck size 64 with VGGVox-v2 Speaker Vector.

    quantized : bool, optional (default=False)
        if True, will load 8-bit quantized model.
        Quantized model not necessary faster, totally depends on the machine.

    Returns
    -------
    result : malaya_speech.supervised.voice_conversion.load function
    """

[3]:

model = malaya_speech.voice_conversion.deep_conversion(model = 'fastvc-32-vggvox-v2')
quantized_model = malaya_speech.voice_conversion.deep_conversion(model = 'fastvc-32-vggvox-v2', quantized = True)

WARNING:root:Load quantized model will cause accuracy drop.

Predict#

def predict(self, original_audio, target_audio):
    """
    Change original voice audio to follow targeted voice.

    Parameters
    ----------
    original_audio: np.array or malaya_speech.model.frame.Frame
    target_audio: np.array or malaya_speech.model.frame.Frame

    Returns
    -------
    result: Dict[decoder-output, postnet-output]
    """

``original_audio`` and ``target_audio`` must 22050 sample rate.

[4]:

sr = 22050
original_audio = malaya_speech.load('speech/example-speaker/haqkiem.wav', sr = sr)[0]
target_audio = malaya_speech.load('speech/example-speaker/female.wav', sr = sr)[0]

[5]:

import IPython.display as ipd

ipd.Audio(original_audio, rate = sr)

[5]:

[6]:

ipd.Audio(target_audio[:sr * 2], rate = sr)

[6]:

[7]:

%%time
r = model.predict(original_audio, target_audio)
r

CPU times: user 9.52 s, sys: 1.6 s, total: 11.1 s
Wall time: 3.21 s

[7]:

{'decoder-output': array([[ 0.16796653,  0.27031827,  0.25115278, ...,  1.9728385 ,
          2.0013132 ,  1.9959606 ],
        [ 0.1876081 ,  0.31539977,  0.21735613, ...,  2.105957  ,
          2.1475153 ,  2.135561  ],
        [ 0.11078158,  0.24430256,  0.13483176, ...,  2.2050035 ,
          2.2327175 ,  2.2086055 ],
        ...,
        [-0.46983352, -0.37537116, -0.46007934, ..., -1.3968909 ,
         -1.4182267 , -1.445814  ],
        [-0.6261345 , -0.52298963, -0.6305046 , ..., -1.6692938 ,
         -1.6694924 , -1.670802  ],
        [-0.7858655 , -0.6631793 , -0.7685092 , ..., -1.7505003 ,
         -1.7430477 , -1.7306981 ]], dtype=float32),
 'postnet-output': array([[ 0.16796653,  0.27031827,  0.25115278, ...,  1.9728385 ,
          2.0013132 ,  1.9959606 ],
        [ 0.1876081 ,  0.31539977,  0.21735613, ...,  2.105957  ,
          2.1475153 ,  2.135561  ],
        [ 0.11078158,  0.24430256,  0.13483176, ...,  2.2050035 ,
          2.2327175 ,  2.2086055 ],
        ...,
        [-0.46983352, -0.37537116, -0.46007934, ..., -1.3968909 ,
         -1.4182267 , -1.445814  ],
        [-0.6261345 , -0.52298963, -0.6305046 , ..., -1.6692938 ,
         -1.6694924 , -1.670802  ],
        [-0.7858655 , -0.6631793 , -0.7685092 , ..., -1.7505003 ,
         -1.7430477 , -1.7306981 ]], dtype=float32)}

[8]:

%%time
quantized_r = quantized_model.predict(original_audio, target_audio)
quantized_r

CPU times: user 9.27 s, sys: 1.48 s, total: 10.7 s
Wall time: 3.07 s

[8]:

{'decoder-output': array([[ 0.20622607,  0.31927785,  0.30248964, ...,  1.8387263 ,
          1.8538276 ,  1.8661375 ],
        [ 0.26772612,  0.37867302,  0.28368202, ...,  2.0063264 ,
          2.0300496 ,  2.027563  ],
        [ 0.22045831,  0.35479122,  0.24202934, ...,  2.1292984 ,
          2.1489828 ,  2.1232607 ],
        ...,
        [-0.37217844, -0.30496663, -0.40188327, ..., -1.4102241 ,
         -1.47401   , -1.4887681 ],
        [-0.553902  , -0.47220862, -0.60245174, ..., -1.6579611 ,
         -1.7115406 , -1.7125119 ],
        [-0.7077116 , -0.60712785, -0.7680642 , ..., -1.7266644 ,
         -1.7799759 , -1.766048  ]], dtype=float32),
 'postnet-output': array([[ 0.20622607,  0.31927785,  0.30248964, ...,  1.8387263 ,
          1.8538276 ,  1.8661375 ],
        [ 0.26772612,  0.37867302,  0.28368202, ...,  2.0063264 ,
          2.0300496 ,  2.027563  ],
        [ 0.22045831,  0.35479122,  0.24202934, ...,  2.1292984 ,
          2.1489828 ,  2.1232607 ],
        ...,
        [-0.37217844, -0.30496663, -0.40188327, ..., -1.4102241 ,
         -1.47401   , -1.4887681 ],
        [-0.553902  , -0.47220862, -0.60245174, ..., -1.6579611 ,
         -1.7115406 , -1.7125119 ],
        [-0.7077116 , -0.60712785, -0.7680642 , ..., -1.7266644 ,
         -1.7799759 , -1.766048  ]], dtype=float32)}

Voice Conversion output#

Will returned mel feature size 80.
This mel feature only able to synthesize using Universal Vocoder, eg, Universal Melgan, https://malaya-speech.readthedocs.io/en/latest/load-universal-melgan.html

Load Universal MelGAN#

Read more about Universal MelGAN at https://malaya-speech.readthedocs.io/en/latest/load-universal-melgan.html

[9]:

melgan = malaya_speech.vocoder.melgan(model = 'universal-1024')

[12]:

%%time

y_ = melgan.predict([r['postnet-output']])
ipd.Audio(y_[0], rate = sr)

CPU times: user 14.6 s, sys: 2.29 s, total: 16.9 s
Wall time: 3.46 s

[12]:

[13]:

%%time

y_ = melgan.predict([quantized_r['postnet-output']])
ipd.Audio(y_[0], rate = sr)

CPU times: user 14.1 s, sys: 1.93 s, total: 16 s
Wall time: 3 s

[13]:

Pretty good!

More example#

This time we try, original voice is English, target voice from Malay and English.

[14]:

original_audio = malaya_speech.load('speech/44k/test-2.wav', sr = sr)[0]
ipd.Audio(original_audio, rate = sr)

[14]:

[15]:

target_audio = malaya_speech.load('speech/vctk/p300_298_mic1.flac', sr = sr)[0]
r = model.predict(original_audio, target_audio)
ipd.Audio(target_audio, rate = sr)

[15]:

[17]:

y_ = melgan.predict([r['postnet-output']])
ipd.Audio(y_[0], rate = sr)

[17]:

[18]:

target_audio = malaya_speech.load('speech/vctk/p323_158_mic2.flac', sr = sr)[0]
r = model.predict(original_audio, target_audio)
ipd.Audio(target_audio, rate = sr)

[18]:

[19]:

y_ = melgan.predict([r['postnet-output']])
ipd.Audio(y_[0], rate = sr)

[19]:

[20]:

target_audio = malaya_speech.load('speech/vctk/p360_292_mic2.flac', sr = sr)[0]
r = model.predict(original_audio, target_audio)
ipd.Audio(target_audio, rate = sr)

[20]:

[22]:

y_ = melgan.predict([r['postnet-output']])
ipd.Audio(y_[0], rate = sr)

[22]:

[23]:

target_audio = malaya_speech.load('speech/vctk/p361_077_mic1.flac', sr = sr)[0]
r = model.predict(original_audio, target_audio)
ipd.Audio(target_audio, rate = sr)

[23]:

[24]:

y_ = melgan.predict([r['postnet-output']])
ipd.Audio(y_[0], rate = sr)

[24]:

[25]:

target_audio = malaya_speech.load('speech/example-speaker/female.wav', sr = sr)[0]
r = model.predict(original_audio, target_audio)
ipd.Audio(target_audio, rate = sr)

[25]:

[26]:

y_ = melgan.predict([r['postnet-output']])
ipd.Audio(y_[0], rate = sr)

[26]:

[27]:

target_audio = malaya_speech.load('speech/example-speaker/husein-zolkepli.wav', sr = sr)[0]
ipd.Audio(target_audio, rate = sr)

[27]:

If you have a low quality audio, you can use speech enhancement, https://malaya-speech.readthedocs.io/en/latest/load-speech-enhancement.html

[28]:

enhancer = malaya_speech.speech_enhancement.deep_enhance(model = 'unet-enhance-24')

[29]:

logits = enhancer.predict(target_audio)
ipd.Audio(logits, rate = sr)

[29]:

[30]:

r = model.predict(original_audio, target_audio)

[32]:

y_ = melgan.predict([r['postnet-output']])
ipd.Audio(y_[0], rate = sr)

[32]:

Contents

Voice#

Explanation#

List available Voice Conversion models#

Load Deep Conversion#

Predict#

Voice Conversion output#

Load Universal MelGAN#

More example#