Voice
Contents
Voice#
Many-to-One, One-to-Many, Many-to-Many, and Zero-shot Voice Conversion.
This tutorial is available as an IPython notebook at malaya-speech/example/voice-conversion.
This module is language independent, so it save to use on different languages.
Explanation#
We created super fast Voice Conversion model, called FastVC, Faster and Accurate Voice Conversion using Transformer. No paper produced.
Steps to reproduce can check at https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/voice-conversion
[1]:
import malaya_speech
import numpy as np
List available Voice Conversion models#
[2]:
malaya_speech.voice_conversion.available_deep_conversion()
[2]:
Size (MB) | Quantized Size (MB) | Total loss | |
---|---|---|---|
fastvc-32-vggvox-v2 | 190.0 | 54.1 | 0.2851 |
fastvc-64-vggvox-v2 | 194.0 | 55.7 | 0.2764 |
Load Deep Conversion#
def deep_conversion(
model: str = 'fastvc-vggvox-v2', quantized: bool = False, **kwargs
):
"""
Load Voice Conversion model.
Parameters
----------
model : str, optional (default='fastvc-vggvox-v2')
Model architecture supported. Allowed values:
* ``'fastvc-32-vggvox-v2'`` - FastVC bottleneck size 32 with VGGVox-v2 Speaker Vector.
* ``'fastvc-64-vggvox-v2'`` - FastVC bottleneck size 64 with VGGVox-v2 Speaker Vector.
quantized : bool, optional (default=False)
if True, will load 8-bit quantized model.
Quantized model not necessary faster, totally depends on the machine.
Returns
-------
result : malaya_speech.supervised.voice_conversion.load function
"""
[3]:
model = malaya_speech.voice_conversion.deep_conversion(model = 'fastvc-32-vggvox-v2')
quantized_model = malaya_speech.voice_conversion.deep_conversion(model = 'fastvc-32-vggvox-v2', quantized = True)
WARNING:root:Load quantized model will cause accuracy drop.
Predict#
def predict(self, original_audio, target_audio):
"""
Change original voice audio to follow targeted voice.
Parameters
----------
original_audio: np.array or malaya_speech.model.frame.Frame
target_audio: np.array or malaya_speech.model.frame.Frame
Returns
-------
result: Dict[decoder-output, postnet-output]
"""
``original_audio`` and ``target_audio`` must 22050 sample rate.
[4]:
sr = 22050
original_audio = malaya_speech.load('speech/example-speaker/haqkiem.wav', sr = sr)[0]
target_audio = malaya_speech.load('speech/example-speaker/female.wav', sr = sr)[0]
[5]:
import IPython.display as ipd
ipd.Audio(original_audio, rate = sr)
[5]:
[6]:
ipd.Audio(target_audio[:sr * 2], rate = sr)
[6]:
[7]:
%%time
r = model.predict(original_audio, target_audio)
r
CPU times: user 9.52 s, sys: 1.6 s, total: 11.1 s
Wall time: 3.21 s
[7]:
{'decoder-output': array([[ 0.16796653, 0.27031827, 0.25115278, ..., 1.9728385 ,
2.0013132 , 1.9959606 ],
[ 0.1876081 , 0.31539977, 0.21735613, ..., 2.105957 ,
2.1475153 , 2.135561 ],
[ 0.11078158, 0.24430256, 0.13483176, ..., 2.2050035 ,
2.2327175 , 2.2086055 ],
...,
[-0.46983352, -0.37537116, -0.46007934, ..., -1.3968909 ,
-1.4182267 , -1.445814 ],
[-0.6261345 , -0.52298963, -0.6305046 , ..., -1.6692938 ,
-1.6694924 , -1.670802 ],
[-0.7858655 , -0.6631793 , -0.7685092 , ..., -1.7505003 ,
-1.7430477 , -1.7306981 ]], dtype=float32),
'postnet-output': array([[ 0.16796653, 0.27031827, 0.25115278, ..., 1.9728385 ,
2.0013132 , 1.9959606 ],
[ 0.1876081 , 0.31539977, 0.21735613, ..., 2.105957 ,
2.1475153 , 2.135561 ],
[ 0.11078158, 0.24430256, 0.13483176, ..., 2.2050035 ,
2.2327175 , 2.2086055 ],
...,
[-0.46983352, -0.37537116, -0.46007934, ..., -1.3968909 ,
-1.4182267 , -1.445814 ],
[-0.6261345 , -0.52298963, -0.6305046 , ..., -1.6692938 ,
-1.6694924 , -1.670802 ],
[-0.7858655 , -0.6631793 , -0.7685092 , ..., -1.7505003 ,
-1.7430477 , -1.7306981 ]], dtype=float32)}
[8]:
%%time
quantized_r = quantized_model.predict(original_audio, target_audio)
quantized_r
CPU times: user 9.27 s, sys: 1.48 s, total: 10.7 s
Wall time: 3.07 s
[8]:
{'decoder-output': array([[ 0.20622607, 0.31927785, 0.30248964, ..., 1.8387263 ,
1.8538276 , 1.8661375 ],
[ 0.26772612, 0.37867302, 0.28368202, ..., 2.0063264 ,
2.0300496 , 2.027563 ],
[ 0.22045831, 0.35479122, 0.24202934, ..., 2.1292984 ,
2.1489828 , 2.1232607 ],
...,
[-0.37217844, -0.30496663, -0.40188327, ..., -1.4102241 ,
-1.47401 , -1.4887681 ],
[-0.553902 , -0.47220862, -0.60245174, ..., -1.6579611 ,
-1.7115406 , -1.7125119 ],
[-0.7077116 , -0.60712785, -0.7680642 , ..., -1.7266644 ,
-1.7799759 , -1.766048 ]], dtype=float32),
'postnet-output': array([[ 0.20622607, 0.31927785, 0.30248964, ..., 1.8387263 ,
1.8538276 , 1.8661375 ],
[ 0.26772612, 0.37867302, 0.28368202, ..., 2.0063264 ,
2.0300496 , 2.027563 ],
[ 0.22045831, 0.35479122, 0.24202934, ..., 2.1292984 ,
2.1489828 , 2.1232607 ],
...,
[-0.37217844, -0.30496663, -0.40188327, ..., -1.4102241 ,
-1.47401 , -1.4887681 ],
[-0.553902 , -0.47220862, -0.60245174, ..., -1.6579611 ,
-1.7115406 , -1.7125119 ],
[-0.7077116 , -0.60712785, -0.7680642 , ..., -1.7266644 ,
-1.7799759 , -1.766048 ]], dtype=float32)}
Voice Conversion output#
Will returned mel feature size 80.
This mel feature only able to synthesize using Universal Vocoder, eg, Universal Melgan, https://malaya-speech.readthedocs.io/en/latest/load-universal-melgan.html
Load Universal MelGAN#
Read more about Universal MelGAN at https://malaya-speech.readthedocs.io/en/latest/load-universal-melgan.html
[9]:
melgan = malaya_speech.vocoder.melgan(model = 'universal-1024')
[12]:
%%time
y_ = melgan.predict([r['postnet-output']])
ipd.Audio(y_[0], rate = sr)
CPU times: user 14.6 s, sys: 2.29 s, total: 16.9 s
Wall time: 3.46 s
[12]:
[13]:
%%time
y_ = melgan.predict([quantized_r['postnet-output']])
ipd.Audio(y_[0], rate = sr)
CPU times: user 14.1 s, sys: 1.93 s, total: 16 s
Wall time: 3 s
[13]:
Pretty good!
More example#
This time we try, original voice is English, target voice from Malay and English.
[14]:
original_audio = malaya_speech.load('speech/44k/test-2.wav', sr = sr)[0]
ipd.Audio(original_audio, rate = sr)
[14]:
[15]:
target_audio = malaya_speech.load('speech/vctk/p300_298_mic1.flac', sr = sr)[0]
r = model.predict(original_audio, target_audio)
ipd.Audio(target_audio, rate = sr)
[15]:
[17]:
y_ = melgan.predict([r['postnet-output']])
ipd.Audio(y_[0], rate = sr)
[17]:
[18]:
target_audio = malaya_speech.load('speech/vctk/p323_158_mic2.flac', sr = sr)[0]
r = model.predict(original_audio, target_audio)
ipd.Audio(target_audio, rate = sr)
[18]:
[19]:
y_ = melgan.predict([r['postnet-output']])
ipd.Audio(y_[0], rate = sr)
[19]:
[20]:
target_audio = malaya_speech.load('speech/vctk/p360_292_mic2.flac', sr = sr)[0]
r = model.predict(original_audio, target_audio)
ipd.Audio(target_audio, rate = sr)
[20]:
[22]:
y_ = melgan.predict([r['postnet-output']])
ipd.Audio(y_[0], rate = sr)
[22]:
[23]:
target_audio = malaya_speech.load('speech/vctk/p361_077_mic1.flac', sr = sr)[0]
r = model.predict(original_audio, target_audio)
ipd.Audio(target_audio, rate = sr)
[23]:
[24]:
y_ = melgan.predict([r['postnet-output']])
ipd.Audio(y_[0], rate = sr)
[24]:
[25]:
target_audio = malaya_speech.load('speech/example-speaker/female.wav', sr = sr)[0]
r = model.predict(original_audio, target_audio)
ipd.Audio(target_audio, rate = sr)
[25]:
[26]:
y_ = melgan.predict([r['postnet-output']])
ipd.Audio(y_[0], rate = sr)
[26]:
[27]:
target_audio = malaya_speech.load('speech/example-speaker/husein-zolkepli.wav', sr = sr)[0]
ipd.Audio(target_audio, rate = sr)
[27]:
If you have a low quality audio, you can use speech enhancement, https://malaya-speech.readthedocs.io/en/latest/load-speech-enhancement.html
[28]:
enhancer = malaya_speech.speech_enhancement.deep_enhance(model = 'unet-enhance-24')
[29]:
logits = enhancer.predict(target_audio)
ipd.Audio(logits, rate = sr)
[29]:
[30]:
r = model.predict(original_audio, target_audio)
[32]:
y_ = melgan.predict([r['postnet-output']])
ipd.Audio(y_[0], rate = sr)
[32]: