Contents

Vocoder

Contents

Vocoder#

synthesize Melspectrogram to waveform.

This tutorial is available as an IPython notebook at malaya-speech/example/vocoder.

This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.

[2]:

import malaya_speech
import numpy as np
import IPython.display as ipd

Available Vocoder#

MelGAN, https://arxiv.org/abs/1910.06711
Multiband MelGAN, https://arxiv.org/abs/2005.05106
Universal MelGAN, https://arxiv.org/abs/2011.09631
HiFiGAN, https://arxiv.org/abs/2010.05646

Vocoder description#

These vocoder models only able to convert melspectrogram generated by TTS malaya-speech models
Only accept mel feature size 80.
Will generate waveform with 22050 sample rate.

List available MelGAN#

[3]:

malaya_speech.vocoder.available_melgan()

[3]:

	Size (MB)	Quantized Size (MB)	Mel loss
male	17.3	4.53	0.4443
female	17.3	4.53	0.4434
husein	17.3	4.53	0.4442
haqkiem	17.3	4.53	0.4819
universal	309.0	77.50	0.4463
universal-1024	78.4	19.90	0.4591

husein voice contributed by Husein-Zolkepli, recorded using low-end microphone in a small room with no reverberation absorber.

haqkiem voice contributed by Haqkiem Hamdan, recorded using high-end microphone in an audio studio.

List available MB MelGAN#

[4]:

malaya_speech.vocoder.available_mbmelgan()

[4]:

	Size (MB)	Quantized Size (MB)	Mel loss
female	10.4	2.82	0.4356
male	10.4	2.82	0.3735
husein	10.4	2.82	0.4356
haqkiem	10.4	2.82	0.4192

husein voice contributed by Husein-Zolkepli, recorded using low-end microphone in a small room with no reverberation absorber.

haqkiem voice contributed by Haqkiem Hamdan, recorded using high-end microphone in an audio studio.

List available HiFiGAN#

[5]:

malaya_speech.vocoder.available_hifigan()

[5]:

	Size (MB)	Quantized Size (MB)	Mel loss
male	8.8	2.49	0.4650
female	8.8	2.49	0.5547
universal-768	72.8	18.50	0.3617
universal-512	32.6	8.60	0.3253

Load MelGAN model#

def melgan(model: str = 'female', quantized: bool = False, **kwargs):
    """
    Load MelGAN Vocoder model.

    Parameters
    ----------
    model : str, optional (default='universal-1024')
        Model architecture supported. Allowed values:

        * ``'female'`` - MelGAN trained on female voice.
        * ``'male'`` - MelGAN trained on male voice.
        * ``'husein'`` - MelGAN trained on Husein voice, https://www.linkedin.com/in/husein-zolkepli/
        * ``'haqkiem'`` - MelGAN trained on Haqkiem voice, https://www.linkedin.com/in/haqkiem-daim/
        * ``'universal'`` - Universal MelGAN trained on multiple speakers.
        * ``'universal-1024'`` - Universal MelGAN with 1024 filters trained on multiple speakers.

    quantized : bool, optional (default=False)
        if True, will load 8-bit quantized model.
        Quantized model not necessary faster, totally depends on the machine.

    Returns
    -------
    result : malaya_speech.supervised.vocoder.load function
    """

[4]:

melgan = malaya_speech.vocoder.melgan(model = 'female')

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/malaya-speech/malaya_speech/utils/__init__.py:66: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/malaya-speech/malaya_speech/utils/__init__.py:68: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/malaya-speech/malaya_speech/utils/__init__.py:61: The name tf.InteractiveSession is deprecated. Please use tf.compat.v1.InteractiveSession instead.

[7]:

husein_melgan = malaya_speech.vocoder.melgan(model = 'husein')

[6]:

quantized_melgan = malaya_speech.vocoder.melgan(model = 'female', quantized = True)

WARNING:root:Load quantized model will cause accuracy drop.

[7]:

quantized_husein_melgan = malaya_speech.vocoder.melgan(model = 'husein', quantized = True)

WARNING:root:Load quantized model will cause accuracy drop.

Load Multiband MelGAN model#

def mbmelgan(model: str = 'female', quantized: bool = False, **kwargs):
    """
    Load Multiband MelGAN Vocoder model.

    Parameters
    ----------
    model : str, optional (default='jasper')
        Model architecture supported. Allowed values:

        * ``'female'`` - MelGAN trained on female voice.
        * ``'male'`` - MelGAN trained on male voice.
        * ``'husein'`` - MelGAN trained on Husein voice, https://www.linkedin.com/in/husein-zolkepli/
        * ``'haqkiem'`` - MelGAN trained on Haqkiem voice, https://www.linkedin.com/in/haqkiem-daim/

    quantized : bool, optional (default=False)
        if True, will load 8-bit quantized model.
        Quantized model not necessary faster, totally depends on the machine.

    Returns
    -------
    result : malaya_speech.supervised.vocoder.load function
    """

[8]:

mbmelgan = malaya_speech.vocoder.mbmelgan(model = 'female')

[9]:

quantized_mbmelgan = malaya_speech.vocoder.mbmelgan(model = 'female', quantized = True)

WARNING:root:Load quantized model will cause accuracy drop.

Load HiFiGAN model#

def hifigan(model: str = 'female', quantized: bool = False, **kwargs):
    """
    Load HiFiGAN Vocoder model.

    Parameters
    ----------
    model : str, optional (default='female')
        Model architecture supported. Allowed values:

        * ``'female'`` - HiFiGAN trained on female voice.
        * ``'male'`` - HiFiGAN trained on male voice.
        * ``'universal-768'`` - Universal HiFiGAN with 768 filters trained on multiple speakers.
        * ``'universal-512'`` - Universal HiFiGAN with 512 filters trained on multiple speakers.


    quantized : bool, optional (default=False)
        if True, will load 8-bit quantized model.
        Quantized model not necessary faster, totally depends on the machine.

    Returns
    -------
    result : malaya_speech.supervised.vocoder.load function
    """

[6]:

hifigan = malaya_speech.vocoder.hifigan(model = 'female')

[7]:

quantized_hifigan = malaya_speech.vocoder.hifigan(model = 'female', quantized = True)

WARNING:root:Load quantized model will cause accuracy drop.

Load data#

this data from validation set.

[8]:

import pickle

with open('speech/pickle/example-female.pkl', 'rb') as fopen:
    example = pickle.load(fopen)

example

[8]:

{'mel': array([[-5.135642 , -4.9542203, -5.045578 , ..., -2.9754655, -2.4993045,
         -2.552092 ],
        [-4.9802437, -5.013033 , -4.753161 , ..., -2.8229384, -2.4302876,
         -2.4801488],
        [-5.174154 , -5.3979354, -4.799525 , ..., -2.5164714, -2.5151956,
         -2.750568 ],
        ...,
        [-1.4169824, -1.1434933, -1.3719425, ..., -1.5436271, -1.6565201,
         -1.8572053],
        [-1.5044638, -1.6360878, -1.6556237, ..., -1.5360395, -1.6257277,
         -1.8962083],
        [-2.642538 , -2.923341 , -2.8665295, ..., -2.355686 , -2.3283741,
         -2.5134325]], dtype=float32),
 'audio': array([-6.1828476e-05, -6.1828476e-05,  0.0000000e+00, ...,
         0.0000000e+00,  0.0000000e+00,  0.0000000e+00], dtype=float32)}

Predict#

def predict(self, inputs):
    """
    Change Mel to Waveform.

    Parameters
    ----------
    inputs: List[np.array]
        List[np.array] or List[malaya_speech.model.frame.FRAME].
    Returns
    -------
    result: List
    """

[11]:

ipd.Audio(example['audio'], rate = 22050)

[11]:

[12]:

%%time

y_ = melgan.predict([example['mel']])
ipd.Audio(y_[0], rate = 22050)

CPU times: user 1.81 s, sys: 401 ms, total: 2.21 s
Wall time: 651 ms

[12]:

[13]:

np.abs(example['audio'] - y_).mean()

[13]:

0.110681236

[14]:

%%time

y_ = quantized_melgan.predict([example['mel']])
ipd.Audio(y_[0], rate = 22050)

CPU times: user 1.81 s, sys: 383 ms, total: 2.19 s
Wall time: 688 ms

[14]:

[15]:

np.abs(example['audio'] - y_).mean()

[15]:

0.111384235

[16]:

%%time

y_ = mbmelgan.predict([example['mel']])
ipd.Audio(y_[0], rate = 22050)

CPU times: user 782 ms, sys: 153 ms, total: 935 ms
Wall time: 340 ms

[16]:

[17]:

np.abs(example['audio'] - y_).mean()

[17]:

0.141786

[18]:

%%time

y_ = quantized_mbmelgan.predict([example['mel']])
ipd.Audio(y_[0], rate = 22050)

CPU times: user 846 ms, sys: 142 ms, total: 988 ms
Wall time: 356 ms

[18]:

[19]:

np.abs(example['audio'] - y_).mean()

[19]:

0.1441468

[14]:

%%time

y_ = hifigan.predict([example['mel']])
ipd.Audio(y_[0], rate = 22050)

CPU times: user 1.79 s, sys: 260 ms, total: 2.05 s
Wall time: 486 ms

[14]:

[12]:

%%time

y_ = quantized_hifigan.predict([example['mel']])
ipd.Audio(y_[0], rate = 22050)

CPU times: user 1.59 s, sys: 273 ms, total: 1.86 s
Wall time: 449 ms

[12]:

[20]:

with open('speech/pickle/example-husein.pkl', 'rb') as fopen:
    example = pickle.load(fopen)

example

[20]:

{'mel': array([[-3.0871143 , -2.1933894 , -1.9330187 , ..., -1.0192316 ,
         -1.3248271 , -1.0974363 ],
        [-2.6055963 , -2.3006353 , -1.6657838 , ..., -1.0752352 ,
         -0.9072396 , -0.96403134],
        [-1.9718108 , -2.147272  , -2.0438907 , ..., -0.89703083,
         -0.73646724, -0.99775743],
        ...,
        [-1.3376398 , -1.70222   , -1.4850315 , ..., -1.0758084 ,
         -1.4132309 , -0.98251915],
        [-1.1394596 , -1.5324714 , -1.5667722 , ..., -1.1989553 ,
         -1.2888682 , -1.0891267 ],
        [-1.3729159 , -1.4634348 , -1.9626601 , ..., -1.4223598 ,
         -1.1820908 , -1.3431906 ]], dtype=float32),
 'audio': array([0.00092771, 0.00076513, 0.00053338, ..., 0.00252281, 0.00252281,
        0.00252281], dtype=float32)}

[22]:

%%time

y_ = husein_melgan.predict([example['mel']])
ipd.Audio(y_[0], rate = 22050)

CPU times: user 1.29 s, sys: 325 ms, total: 1.62 s
Wall time: 686 ms

[22]:

[24]:

%%time

y_ = quantized_husein_melgan.predict([example['mel']])
ipd.Audio(y_[0], rate = 22050)

CPU times: user 1.29 s, sys: 218 ms, total: 1.51 s
Wall time: 345 ms

[24]: