Vocoder#

synthesize Melspectrogram to waveform.

This tutorial is available as an IPython notebook at malaya-speech/example/vocoder.

This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.

[2]:
import malaya_speech
import numpy as np
import IPython.display as ipd

Vocoder description#

  1. These vocoder models only able to convert melspectrogram generated by TTS malaya-speech models

  2. Only accept mel feature size 80.

  3. Will generate waveform with 22050 sample rate.

List available MelGAN#

[3]:
malaya_speech.vocoder.available_melgan()
[3]:
Size (MB) Quantized Size (MB) Mel loss
male 17.3 4.53 0.4443
female 17.3 4.53 0.4434
husein 17.3 4.53 0.4442
haqkiem 17.3 4.53 0.4819
universal 309.0 77.50 0.4463
universal-1024 78.4 19.90 0.4591

husein voice contributed by Husein-Zolkepli, recorded using low-end microphone in a small room with no reverberation absorber.

haqkiem voice contributed by Haqkiem Hamdan, recorded using high-end microphone in an audio studio.

List available MB MelGAN#

[4]:
malaya_speech.vocoder.available_mbmelgan()
[4]:
Size (MB) Quantized Size (MB) Mel loss
female 10.4 2.82 0.4356
male 10.4 2.82 0.3735
husein 10.4 2.82 0.4356
haqkiem 10.4 2.82 0.4192

husein voice contributed by Husein-Zolkepli, recorded using low-end microphone in a small room with no reverberation absorber.

haqkiem voice contributed by Haqkiem Hamdan, recorded using high-end microphone in an audio studio.

List available HiFiGAN#

[5]:
malaya_speech.vocoder.available_hifigan()
[5]:
Size (MB) Quantized Size (MB) Mel loss
male 8.8 2.49 0.4650
female 8.8 2.49 0.5547
universal-768 72.8 18.50 0.3617
universal-512 32.6 8.60 0.3253

Load MelGAN model#

def melgan(model: str = 'female', quantized: bool = False, **kwargs):
    """
    Load MelGAN Vocoder model.

    Parameters
    ----------
    model : str, optional (default='universal-1024')
        Model architecture supported. Allowed values:

        * ``'female'`` - MelGAN trained on female voice.
        * ``'male'`` - MelGAN trained on male voice.
        * ``'husein'`` - MelGAN trained on Husein voice, https://www.linkedin.com/in/husein-zolkepli/
        * ``'haqkiem'`` - MelGAN trained on Haqkiem voice, https://www.linkedin.com/in/haqkiem-daim/
        * ``'universal'`` - Universal MelGAN trained on multiple speakers.
        * ``'universal-1024'`` - Universal MelGAN with 1024 filters trained on multiple speakers.

    quantized : bool, optional (default=False)
        if True, will load 8-bit quantized model.
        Quantized model not necessary faster, totally depends on the machine.

    Returns
    -------
    result : malaya_speech.supervised.vocoder.load function
    """
[4]:
melgan = malaya_speech.vocoder.melgan(model = 'female')
WARNING:tensorflow:From /Users/huseinzolkepli/Documents/malaya-speech/malaya_speech/utils/__init__.py:66: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/malaya-speech/malaya_speech/utils/__init__.py:68: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/malaya-speech/malaya_speech/utils/__init__.py:61: The name tf.InteractiveSession is deprecated. Please use tf.compat.v1.InteractiveSession instead.

[7]:
husein_melgan = malaya_speech.vocoder.melgan(model = 'husein')
[6]:
quantized_melgan = malaya_speech.vocoder.melgan(model = 'female', quantized = True)
WARNING:root:Load quantized model will cause accuracy drop.
[7]:
quantized_husein_melgan = malaya_speech.vocoder.melgan(model = 'husein', quantized = True)
WARNING:root:Load quantized model will cause accuracy drop.

Load Multiband MelGAN model#

def mbmelgan(model: str = 'female', quantized: bool = False, **kwargs):
    """
    Load Multiband MelGAN Vocoder model.

    Parameters
    ----------
    model : str, optional (default='jasper')
        Model architecture supported. Allowed values:

        * ``'female'`` - MelGAN trained on female voice.
        * ``'male'`` - MelGAN trained on male voice.
        * ``'husein'`` - MelGAN trained on Husein voice, https://www.linkedin.com/in/husein-zolkepli/
        * ``'haqkiem'`` - MelGAN trained on Haqkiem voice, https://www.linkedin.com/in/haqkiem-daim/

    quantized : bool, optional (default=False)
        if True, will load 8-bit quantized model.
        Quantized model not necessary faster, totally depends on the machine.

    Returns
    -------
    result : malaya_speech.supervised.vocoder.load function
    """
[8]:
mbmelgan = malaya_speech.vocoder.mbmelgan(model = 'female')
[9]:
quantized_mbmelgan = malaya_speech.vocoder.mbmelgan(model = 'female', quantized = True)
WARNING:root:Load quantized model will cause accuracy drop.

Load HiFiGAN model#

def hifigan(model: str = 'female', quantized: bool = False, **kwargs):
    """
    Load HiFiGAN Vocoder model.

    Parameters
    ----------
    model : str, optional (default='female')
        Model architecture supported. Allowed values:

        * ``'female'`` - HiFiGAN trained on female voice.
        * ``'male'`` - HiFiGAN trained on male voice.
        * ``'universal-768'`` - Universal HiFiGAN with 768 filters trained on multiple speakers.
        * ``'universal-512'`` - Universal HiFiGAN with 512 filters trained on multiple speakers.


    quantized : bool, optional (default=False)
        if True, will load 8-bit quantized model.
        Quantized model not necessary faster, totally depends on the machine.

    Returns
    -------
    result : malaya_speech.supervised.vocoder.load function
    """
[6]:
hifigan = malaya_speech.vocoder.hifigan(model = 'female')
[7]:
quantized_hifigan = malaya_speech.vocoder.hifigan(model = 'female', quantized = True)
WARNING:root:Load quantized model will cause accuracy drop.

Load data#

this data from validation set.

[8]:
import pickle

with open('speech/pickle/example-female.pkl', 'rb') as fopen:
    example = pickle.load(fopen)

example
[8]:
{'mel': array([[-5.135642 , -4.9542203, -5.045578 , ..., -2.9754655, -2.4993045,
         -2.552092 ],
        [-4.9802437, -5.013033 , -4.753161 , ..., -2.8229384, -2.4302876,
         -2.4801488],
        [-5.174154 , -5.3979354, -4.799525 , ..., -2.5164714, -2.5151956,
         -2.750568 ],
        ...,
        [-1.4169824, -1.1434933, -1.3719425, ..., -1.5436271, -1.6565201,
         -1.8572053],
        [-1.5044638, -1.6360878, -1.6556237, ..., -1.5360395, -1.6257277,
         -1.8962083],
        [-2.642538 , -2.923341 , -2.8665295, ..., -2.355686 , -2.3283741,
         -2.5134325]], dtype=float32),
 'audio': array([-6.1828476e-05, -6.1828476e-05,  0.0000000e+00, ...,
         0.0000000e+00,  0.0000000e+00,  0.0000000e+00], dtype=float32)}

Predict#

def predict(self, inputs):
    """
    Change Mel to Waveform.

    Parameters
    ----------
    inputs: List[np.array]
        List[np.array] or List[malaya_speech.model.frame.FRAME].
    Returns
    -------
    result: List
    """
[11]:
ipd.Audio(example['audio'], rate = 22050)
[11]:
[12]:
%%time

y_ = melgan.predict([example['mel']])
ipd.Audio(y_[0], rate = 22050)
CPU times: user 1.81 s, sys: 401 ms, total: 2.21 s
Wall time: 651 ms
[12]:
[13]:
np.abs(example['audio'] - y_).mean()
[13]:
0.110681236
[14]:
%%time

y_ = quantized_melgan.predict([example['mel']])
ipd.Audio(y_[0], rate = 22050)
CPU times: user 1.81 s, sys: 383 ms, total: 2.19 s
Wall time: 688 ms
[14]:
[15]:
np.abs(example['audio'] - y_).mean()
[15]:
0.111384235
[16]:
%%time

y_ = mbmelgan.predict([example['mel']])
ipd.Audio(y_[0], rate = 22050)
CPU times: user 782 ms, sys: 153 ms, total: 935 ms
Wall time: 340 ms
[16]:
[17]:
np.abs(example['audio'] - y_).mean()
[17]:
0.141786
[18]:
%%time

y_ = quantized_mbmelgan.predict([example['mel']])
ipd.Audio(y_[0], rate = 22050)
CPU times: user 846 ms, sys: 142 ms, total: 988 ms
Wall time: 356 ms
[18]:
[19]:
np.abs(example['audio'] - y_).mean()
[19]:
0.1441468
[14]:
%%time

y_ = hifigan.predict([example['mel']])
ipd.Audio(y_[0], rate = 22050)
CPU times: user 1.79 s, sys: 260 ms, total: 2.05 s
Wall time: 486 ms
[14]:
[12]:
%%time

y_ = quantized_hifigan.predict([example['mel']])
ipd.Audio(y_[0], rate = 22050)
CPU times: user 1.59 s, sys: 273 ms, total: 1.86 s
Wall time: 449 ms
[12]:
[20]:
with open('speech/pickle/example-husein.pkl', 'rb') as fopen:
    example = pickle.load(fopen)

example
[20]:
{'mel': array([[-3.0871143 , -2.1933894 , -1.9330187 , ..., -1.0192316 ,
         -1.3248271 , -1.0974363 ],
        [-2.6055963 , -2.3006353 , -1.6657838 , ..., -1.0752352 ,
         -0.9072396 , -0.96403134],
        [-1.9718108 , -2.147272  , -2.0438907 , ..., -0.89703083,
         -0.73646724, -0.99775743],
        ...,
        [-1.3376398 , -1.70222   , -1.4850315 , ..., -1.0758084 ,
         -1.4132309 , -0.98251915],
        [-1.1394596 , -1.5324714 , -1.5667722 , ..., -1.1989553 ,
         -1.2888682 , -1.0891267 ],
        [-1.3729159 , -1.4634348 , -1.9626601 , ..., -1.4223598 ,
         -1.1820908 , -1.3431906 ]], dtype=float32),
 'audio': array([0.00092771, 0.00076513, 0.00053338, ..., 0.00252281, 0.00252281,
        0.00252281], dtype=float32)}
[22]:
%%time

y_ = husein_melgan.predict([example['mel']])
ipd.Audio(y_[0], rate = 22050)
CPU times: user 1.29 s, sys: 325 ms, total: 1.62 s
Wall time: 686 ms
[22]:
[24]:
%%time

y_ = quantized_husein_melgan.predict([example['mel']])
ipd.Audio(y_[0], rate = 22050)
CPU times: user 1.29 s, sys: 218 ms, total: 1.51 s
Wall time: 345 ms
[24]: