Voice¶
Many-to-One, One-to-Many, Many-to-Many, and Zero-shot Voice Conversion.
This tutorial is available as an IPython notebook at malaya-speech/example/voice-conversion.
This module is language independent, so it save to use on different languages.
Explanation¶
We created super fast Voice Conversion model, called FastVC, Faster and Accurate Voice Conversion using Transformer. No paper produced.
Steps to reproduce can check at https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/voice-conversion
[1]:
import malaya_speech
import numpy as np
List available Voice Conversion models¶
[2]:
malaya_speech.voice_conversion.available_deep_conversion()
[2]:
| Size (MB) | Quantized Size (MB) | Total loss | |
|---|---|---|---|
| fastvc-32-vggvox-v2 | 190.0 | 54.1 | 0.2851 |
| fastvc-64-vggvox-v2 | 194.0 | 55.7 | 0.2764 |
Load Deep Conversion¶
def deep_conversion(
model: str = 'fastvc-vggvox-v2', quantized: bool = False, **kwargs
):
"""
Load Voice Conversion model.
Parameters
----------
model : str, optional (default='fastvc-vggvox-v2')
Model architecture supported. Allowed values:
* ``'fastvc-32-vggvox-v2'`` - FastVC bottleneck size 32 with VGGVox-v2 Speaker Vector.
* ``'fastvc-64-vggvox-v2'`` - FastVC bottleneck size 64 with VGGVox-v2 Speaker Vector.
quantized : bool, optional (default=False)
if True, will load 8-bit quantized model.
Quantized model not necessary faster, totally depends on the machine.
Returns
-------
result : malaya_speech.supervised.voice_conversion.load function
"""
[3]:
model = malaya_speech.voice_conversion.deep_conversion(model = 'fastvc-32-vggvox-v2')
quantized_model = malaya_speech.voice_conversion.deep_conversion(model = 'fastvc-32-vggvox-v2', quantized = True)
WARNING:root:Load quantized model will cause accuracy drop.
Predict¶
def predict(self, original_audio, target_audio):
"""
Change original voice audio to follow targeted voice.
Parameters
----------
original_audio: np.array or malaya_speech.model.frame.Frame
target_audio: np.array or malaya_speech.model.frame.Frame
Returns
-------
result: Dict[decoder-output, postnet-output]
"""
``original_audio`` and ``target_audio`` must 22050 sample rate.
[4]:
sr = 22050
original_audio = malaya_speech.load('speech/example-speaker/haqkiem.wav', sr = sr)[0]
target_audio = malaya_speech.load('speech/example-speaker/female.wav', sr = sr)[0]
[5]:
import IPython.display as ipd
ipd.Audio(original_audio, rate = sr)
[5]:
[6]:
ipd.Audio(target_audio[:sr * 2], rate = sr)
[6]:
[7]:
%%time
r = model.predict(original_audio, target_audio)
r
CPU times: user 9.52 s, sys: 1.6 s, total: 11.1 s
Wall time: 3.21 s
[7]:
{'decoder-output': array([[ 0.16796653, 0.27031827, 0.25115278, ..., 1.9728385 ,
2.0013132 , 1.9959606 ],
[ 0.1876081 , 0.31539977, 0.21735613, ..., 2.105957 ,
2.1475153 , 2.135561 ],
[ 0.11078158, 0.24430256, 0.13483176, ..., 2.2050035 ,
2.2327175 , 2.2086055 ],
...,
[-0.46983352, -0.37537116, -0.46007934, ..., -1.3968909 ,
-1.4182267 , -1.445814 ],
[-0.6261345 , -0.52298963, -0.6305046 , ..., -1.6692938 ,
-1.6694924 , -1.670802 ],
[-0.7858655 , -0.6631793 , -0.7685092 , ..., -1.7505003 ,
-1.7430477 , -1.7306981 ]], dtype=float32),
'postnet-output': array([[ 0.16796653, 0.27031827, 0.25115278, ..., 1.9728385 ,
2.0013132 , 1.9959606 ],
[ 0.1876081 , 0.31539977, 0.21735613, ..., 2.105957 ,
2.1475153 , 2.135561 ],
[ 0.11078158, 0.24430256, 0.13483176, ..., 2.2050035 ,
2.2327175 , 2.2086055 ],
...,
[-0.46983352, -0.37537116, -0.46007934, ..., -1.3968909 ,
-1.4182267 , -1.445814 ],
[-0.6261345 , -0.52298963, -0.6305046 , ..., -1.6692938 ,
-1.6694924 , -1.670802 ],
[-0.7858655 , -0.6631793 , -0.7685092 , ..., -1.7505003 ,
-1.7430477 , -1.7306981 ]], dtype=float32)}
[8]:
%%time
quantized_r = quantized_model.predict(original_audio, target_audio)
quantized_r
CPU times: user 9.27 s, sys: 1.48 s, total: 10.7 s
Wall time: 3.07 s
[8]:
{'decoder-output': array([[ 0.20622607, 0.31927785, 0.30248964, ..., 1.8387263 ,
1.8538276 , 1.8661375 ],
[ 0.26772612, 0.37867302, 0.28368202, ..., 2.0063264 ,
2.0300496 , 2.027563 ],
[ 0.22045831, 0.35479122, 0.24202934, ..., 2.1292984 ,
2.1489828 , 2.1232607 ],
...,
[-0.37217844, -0.30496663, -0.40188327, ..., -1.4102241 ,
-1.47401 , -1.4887681 ],
[-0.553902 , -0.47220862, -0.60245174, ..., -1.6579611 ,
-1.7115406 , -1.7125119 ],
[-0.7077116 , -0.60712785, -0.7680642 , ..., -1.7266644 ,
-1.7799759 , -1.766048 ]], dtype=float32),
'postnet-output': array([[ 0.20622607, 0.31927785, 0.30248964, ..., 1.8387263 ,
1.8538276 , 1.8661375 ],
[ 0.26772612, 0.37867302, 0.28368202, ..., 2.0063264 ,
2.0300496 , 2.027563 ],
[ 0.22045831, 0.35479122, 0.24202934, ..., 2.1292984 ,
2.1489828 , 2.1232607 ],
...,
[-0.37217844, -0.30496663, -0.40188327, ..., -1.4102241 ,
-1.47401 , -1.4887681 ],
[-0.553902 , -0.47220862, -0.60245174, ..., -1.6579611 ,
-1.7115406 , -1.7125119 ],
[-0.7077116 , -0.60712785, -0.7680642 , ..., -1.7266644 ,
-1.7799759 , -1.766048 ]], dtype=float32)}
Voice Conversion output¶
Will returned mel feature size 80.
This mel feature only able to synthesize using Universal Vocoder, eg, Universal Melgan, https://malaya-speech.readthedocs.io/en/latest/load-universal-melgan.html
Load Universal MelGAN¶
Read more about Universal MelGAN at https://malaya-speech.readthedocs.io/en/latest/load-universal-melgan.html
[9]:
melgan = malaya_speech.vocoder.melgan(model = 'universal-1024')
[12]:
%%time
y_ = melgan.predict([r['postnet-output']])
ipd.Audio(y_[0], rate = sr)
CPU times: user 14.6 s, sys: 2.29 s, total: 16.9 s
Wall time: 3.46 s
[12]:
[13]:
%%time
y_ = melgan.predict([quantized_r['postnet-output']])
ipd.Audio(y_[0], rate = sr)
CPU times: user 14.1 s, sys: 1.93 s, total: 16 s
Wall time: 3 s
[13]:
Pretty good!
More example¶
This time we try, original voice is English, target voice from Malay and English.
[14]:
original_audio = malaya_speech.load('speech/44k/test-2.wav', sr = sr)[0]
ipd.Audio(original_audio, rate = sr)
[14]:
[15]:
target_audio = malaya_speech.load('speech/vctk/p300_298_mic1.flac', sr = sr)[0]
r = model.predict(original_audio, target_audio)
ipd.Audio(target_audio, rate = sr)
[15]:
[17]:
y_ = melgan.predict([r['postnet-output']])
ipd.Audio(y_[0], rate = sr)
[17]:
[18]:
target_audio = malaya_speech.load('speech/vctk/p323_158_mic2.flac', sr = sr)[0]
r = model.predict(original_audio, target_audio)
ipd.Audio(target_audio, rate = sr)
[18]:
[19]:
y_ = melgan.predict([r['postnet-output']])
ipd.Audio(y_[0], rate = sr)
[19]:
[20]:
target_audio = malaya_speech.load('speech/vctk/p360_292_mic2.flac', sr = sr)[0]
r = model.predict(original_audio, target_audio)
ipd.Audio(target_audio, rate = sr)
[20]:
[22]:
y_ = melgan.predict([r['postnet-output']])
ipd.Audio(y_[0], rate = sr)
[22]:
[23]:
target_audio = malaya_speech.load('speech/vctk/p361_077_mic1.flac', sr = sr)[0]
r = model.predict(original_audio, target_audio)
ipd.Audio(target_audio, rate = sr)
[23]:
[24]:
y_ = melgan.predict([r['postnet-output']])
ipd.Audio(y_[0], rate = sr)
[24]:
[25]:
target_audio = malaya_speech.load('speech/example-speaker/female.wav', sr = sr)[0]
r = model.predict(original_audio, target_audio)
ipd.Audio(target_audio, rate = sr)
[25]:
[26]:
y_ = melgan.predict([r['postnet-output']])
ipd.Audio(y_[0], rate = sr)
[26]:
[27]:
target_audio = malaya_speech.load('speech/example-speaker/husein-zolkepli.wav', sr = sr)[0]
ipd.Audio(target_audio, rate = sr)
[27]:
If you have a low quality audio, you can use speech enhancement, https://malaya-speech.readthedocs.io/en/latest/load-speech-enhancement.html
[28]:
enhancer = malaya_speech.speech_enhancement.deep_enhance(model = 'unet-enhance-24')
[29]:
logits = enhancer.predict(target_audio)
ipd.Audio(logits, rate = sr)
[29]:
[30]:
r = model.predict(original_audio, target_audio)
[32]:
y_ = melgan.predict([r['postnet-output']])
ipd.Audio(y_[0], rate = sr)
[32]: