Vocoder
Contents
Vocoder#
synthesize Melspectrogram to waveform.
This tutorial is available as an IPython notebook at malaya-speech/example/vocoder.
This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.
[2]:
import malaya_speech
import numpy as np
import IPython.display as ipd
Available Vocoder#
MelGAN, https://arxiv.org/abs/1910.06711
Multiband MelGAN, https://arxiv.org/abs/2005.05106
Universal MelGAN, https://arxiv.org/abs/2011.09631
HiFiGAN, https://arxiv.org/abs/2010.05646
Vocoder description#
These vocoder models only able to convert melspectrogram generated by TTS malaya-speech models
Only accept mel feature size 80.
Will generate waveform with 22050 sample rate.
List available MelGAN#
[3]:
malaya_speech.vocoder.available_melgan()
[3]:
Size (MB) | Quantized Size (MB) | Mel loss | |
---|---|---|---|
male | 17.3 | 4.53 | 0.4443 |
female | 17.3 | 4.53 | 0.4434 |
husein | 17.3 | 4.53 | 0.4442 |
haqkiem | 17.3 | 4.53 | 0.4819 |
universal | 309.0 | 77.50 | 0.4463 |
universal-1024 | 78.4 | 19.90 | 0.4591 |
husein
voice contributed by Husein-Zolkepli, recorded using low-end microphone in a small room with no reverberation absorber.
haqkiem
voice contributed by Haqkiem Hamdan, recorded using high-end microphone in an audio studio.
List available MB MelGAN#
[4]:
malaya_speech.vocoder.available_mbmelgan()
[4]:
Size (MB) | Quantized Size (MB) | Mel loss | |
---|---|---|---|
female | 10.4 | 2.82 | 0.4356 |
male | 10.4 | 2.82 | 0.3735 |
husein | 10.4 | 2.82 | 0.4356 |
haqkiem | 10.4 | 2.82 | 0.4192 |
husein
voice contributed by Husein-Zolkepli, recorded using low-end microphone in a small room with no reverberation absorber.
haqkiem
voice contributed by Haqkiem Hamdan, recorded using high-end microphone in an audio studio.
List available HiFiGAN#
[5]:
malaya_speech.vocoder.available_hifigan()
[5]:
Size (MB) | Quantized Size (MB) | Mel loss | |
---|---|---|---|
male | 8.8 | 2.49 | 0.4650 |
female | 8.8 | 2.49 | 0.5547 |
universal-768 | 72.8 | 18.50 | 0.3617 |
universal-512 | 32.6 | 8.60 | 0.3253 |
Load MelGAN model#
def melgan(model: str = 'female', quantized: bool = False, **kwargs):
"""
Load MelGAN Vocoder model.
Parameters
----------
model : str, optional (default='universal-1024')
Model architecture supported. Allowed values:
* ``'female'`` - MelGAN trained on female voice.
* ``'male'`` - MelGAN trained on male voice.
* ``'husein'`` - MelGAN trained on Husein voice, https://www.linkedin.com/in/husein-zolkepli/
* ``'haqkiem'`` - MelGAN trained on Haqkiem voice, https://www.linkedin.com/in/haqkiem-daim/
* ``'universal'`` - Universal MelGAN trained on multiple speakers.
* ``'universal-1024'`` - Universal MelGAN with 1024 filters trained on multiple speakers.
quantized : bool, optional (default=False)
if True, will load 8-bit quantized model.
Quantized model not necessary faster, totally depends on the machine.
Returns
-------
result : malaya_speech.supervised.vocoder.load function
"""
[4]:
melgan = malaya_speech.vocoder.melgan(model = 'female')
WARNING:tensorflow:From /Users/huseinzolkepli/Documents/malaya-speech/malaya_speech/utils/__init__.py:66: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.
WARNING:tensorflow:From /Users/huseinzolkepli/Documents/malaya-speech/malaya_speech/utils/__init__.py:68: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef instead.
WARNING:tensorflow:From /Users/huseinzolkepli/Documents/malaya-speech/malaya_speech/utils/__init__.py:61: The name tf.InteractiveSession is deprecated. Please use tf.compat.v1.InteractiveSession instead.
[7]:
husein_melgan = malaya_speech.vocoder.melgan(model = 'husein')
[6]:
quantized_melgan = malaya_speech.vocoder.melgan(model = 'female', quantized = True)
WARNING:root:Load quantized model will cause accuracy drop.
[7]:
quantized_husein_melgan = malaya_speech.vocoder.melgan(model = 'husein', quantized = True)
WARNING:root:Load quantized model will cause accuracy drop.
Load Multiband MelGAN model#
def mbmelgan(model: str = 'female', quantized: bool = False, **kwargs):
"""
Load Multiband MelGAN Vocoder model.
Parameters
----------
model : str, optional (default='jasper')
Model architecture supported. Allowed values:
* ``'female'`` - MelGAN trained on female voice.
* ``'male'`` - MelGAN trained on male voice.
* ``'husein'`` - MelGAN trained on Husein voice, https://www.linkedin.com/in/husein-zolkepli/
* ``'haqkiem'`` - MelGAN trained on Haqkiem voice, https://www.linkedin.com/in/haqkiem-daim/
quantized : bool, optional (default=False)
if True, will load 8-bit quantized model.
Quantized model not necessary faster, totally depends on the machine.
Returns
-------
result : malaya_speech.supervised.vocoder.load function
"""
[8]:
mbmelgan = malaya_speech.vocoder.mbmelgan(model = 'female')
[9]:
quantized_mbmelgan = malaya_speech.vocoder.mbmelgan(model = 'female', quantized = True)
WARNING:root:Load quantized model will cause accuracy drop.
Load HiFiGAN model#
def hifigan(model: str = 'female', quantized: bool = False, **kwargs):
"""
Load HiFiGAN Vocoder model.
Parameters
----------
model : str, optional (default='female')
Model architecture supported. Allowed values:
* ``'female'`` - HiFiGAN trained on female voice.
* ``'male'`` - HiFiGAN trained on male voice.
* ``'universal-768'`` - Universal HiFiGAN with 768 filters trained on multiple speakers.
* ``'universal-512'`` - Universal HiFiGAN with 512 filters trained on multiple speakers.
quantized : bool, optional (default=False)
if True, will load 8-bit quantized model.
Quantized model not necessary faster, totally depends on the machine.
Returns
-------
result : malaya_speech.supervised.vocoder.load function
"""
[6]:
hifigan = malaya_speech.vocoder.hifigan(model = 'female')
[7]:
quantized_hifigan = malaya_speech.vocoder.hifigan(model = 'female', quantized = True)
WARNING:root:Load quantized model will cause accuracy drop.
Load data#
this data from validation set.
[8]:
import pickle
with open('speech/pickle/example-female.pkl', 'rb') as fopen:
example = pickle.load(fopen)
example
[8]:
{'mel': array([[-5.135642 , -4.9542203, -5.045578 , ..., -2.9754655, -2.4993045,
-2.552092 ],
[-4.9802437, -5.013033 , -4.753161 , ..., -2.8229384, -2.4302876,
-2.4801488],
[-5.174154 , -5.3979354, -4.799525 , ..., -2.5164714, -2.5151956,
-2.750568 ],
...,
[-1.4169824, -1.1434933, -1.3719425, ..., -1.5436271, -1.6565201,
-1.8572053],
[-1.5044638, -1.6360878, -1.6556237, ..., -1.5360395, -1.6257277,
-1.8962083],
[-2.642538 , -2.923341 , -2.8665295, ..., -2.355686 , -2.3283741,
-2.5134325]], dtype=float32),
'audio': array([-6.1828476e-05, -6.1828476e-05, 0.0000000e+00, ...,
0.0000000e+00, 0.0000000e+00, 0.0000000e+00], dtype=float32)}
Predict#
def predict(self, inputs):
"""
Change Mel to Waveform.
Parameters
----------
inputs: List[np.array]
List[np.array] or List[malaya_speech.model.frame.FRAME].
Returns
-------
result: List
"""
[11]:
ipd.Audio(example['audio'], rate = 22050)
[11]:
[12]:
%%time
y_ = melgan.predict([example['mel']])
ipd.Audio(y_[0], rate = 22050)
CPU times: user 1.81 s, sys: 401 ms, total: 2.21 s
Wall time: 651 ms
[12]:
[13]:
np.abs(example['audio'] - y_).mean()
[13]:
0.110681236
[14]:
%%time
y_ = quantized_melgan.predict([example['mel']])
ipd.Audio(y_[0], rate = 22050)
CPU times: user 1.81 s, sys: 383 ms, total: 2.19 s
Wall time: 688 ms
[14]:
[15]:
np.abs(example['audio'] - y_).mean()
[15]:
0.111384235
[16]:
%%time
y_ = mbmelgan.predict([example['mel']])
ipd.Audio(y_[0], rate = 22050)
CPU times: user 782 ms, sys: 153 ms, total: 935 ms
Wall time: 340 ms
[16]:
[17]:
np.abs(example['audio'] - y_).mean()
[17]:
0.141786
[18]:
%%time
y_ = quantized_mbmelgan.predict([example['mel']])
ipd.Audio(y_[0], rate = 22050)
CPU times: user 846 ms, sys: 142 ms, total: 988 ms
Wall time: 356 ms
[18]:
[19]:
np.abs(example['audio'] - y_).mean()
[19]:
0.1441468
[14]:
%%time
y_ = hifigan.predict([example['mel']])
ipd.Audio(y_[0], rate = 22050)
CPU times: user 1.79 s, sys: 260 ms, total: 2.05 s
Wall time: 486 ms
[14]:
[12]:
%%time
y_ = quantized_hifigan.predict([example['mel']])
ipd.Audio(y_[0], rate = 22050)
CPU times: user 1.59 s, sys: 273 ms, total: 1.86 s
Wall time: 449 ms
[12]:
[20]:
with open('speech/pickle/example-husein.pkl', 'rb') as fopen:
example = pickle.load(fopen)
example
[20]:
{'mel': array([[-3.0871143 , -2.1933894 , -1.9330187 , ..., -1.0192316 ,
-1.3248271 , -1.0974363 ],
[-2.6055963 , -2.3006353 , -1.6657838 , ..., -1.0752352 ,
-0.9072396 , -0.96403134],
[-1.9718108 , -2.147272 , -2.0438907 , ..., -0.89703083,
-0.73646724, -0.99775743],
...,
[-1.3376398 , -1.70222 , -1.4850315 , ..., -1.0758084 ,
-1.4132309 , -0.98251915],
[-1.1394596 , -1.5324714 , -1.5667722 , ..., -1.1989553 ,
-1.2888682 , -1.0891267 ],
[-1.3729159 , -1.4634348 , -1.9626601 , ..., -1.4223598 ,
-1.1820908 , -1.3431906 ]], dtype=float32),
'audio': array([0.00092771, 0.00076513, 0.00053338, ..., 0.00252281, 0.00252281,
0.00252281], dtype=float32)}
[22]:
%%time
y_ = husein_melgan.predict([example['mel']])
ipd.Audio(y_[0], rate = 22050)
CPU times: user 1.29 s, sys: 325 ms, total: 1.62 s
Wall time: 686 ms
[22]:
[24]:
%%time
y_ = quantized_husein_melgan.predict([example['mel']])
ipd.Audio(y_[0], rate = 22050)
CPU times: user 1.29 s, sys: 218 ms, total: 1.51 s
Wall time: 345 ms
[24]: