Universal MelGAN#

synthesize Melspectrogram to waveform and these models able to synthesize multiple speakers.

This tutorial is available as an IPython notebook at malaya-speech/example/universal-melgan.

This module is language independent, so it save to use on different languages.

Vocoder description#

  1. Only accept mel feature size 80.

  2. Will generate waveform with 22050 sample rate.

Explanation#

If you use MelGAN Vocoder from https://malaya-speech.readthedocs.io/en/latest/load-vocoder.html, each speaker got their own MelGAN Vocoder.

So Universal MelGAN, https://arxiv.org/abs/2011.09631 solved this problem, able to synthesize any melspectrogram to waveform.

[1]:
import malaya_speech
import numpy as np

List available MelGAN#

[2]:
malaya_speech.vocoder.available_melgan()
[2]:
Size (MB) Quantized Size (MB) Mel loss
male 17.3 4.53 0.4443
female 17.3 4.53 0.4434
husein 17.3 4.53 0.4442
haqkiem 17.3 4.53 0.4819
yasmin 17.3 4.53 0.4867
osman 17.3 4.53 0.4819
universal 309.0 77.50 0.4463
universal-1024 78.4 19.90 0.4591
universal-384 11.3 3.06 0.4445

Load MelGAN model#

def melgan(model: str = 'female', quantized: bool = False, **kwargs):
    """
    Load MelGAN Vocoder model.

    Parameters
    ----------
    model : str, optional (default='universal-1024')
        Model architecture supported. Allowed values:

        * ``'female'`` - MelGAN trained on female voice.
        * ``'male'`` - MelGAN trained on male voice.
        * ``'husein'`` - MelGAN trained on Husein voice, https://www.linkedin.com/in/husein-zolkepli/
        * ``'haqkiem'`` - MelGAN trained on Haqkiem voice, https://www.linkedin.com/in/haqkiem-daim/
        * ``'universal'`` - Universal MelGAN trained on multiple speakers.
        * ``'universal-1024'`` - Universal MelGAN with 1024 filters trained on multiple speakers.
        * ``'universal-384'`` - Universal MelGAN with 384 filters trained on multiple speakers.

    quantized : bool, optional (default=False)
        if True, will load 8-bit quantized model.
        Quantized model not necessary faster, totally depends on the machine.

    Returns
    -------
    result : malaya_speech.supervised.vocoder.load function
    """
[3]:
melgan = malaya_speech.vocoder.melgan(model = 'universal')
quantized_melgan = malaya_speech.vocoder.melgan(model = 'universal', quantized = True)
WARNING:root:Load quantized model will cause accuracy drop.
/Users/huseinzolkepli/Documents/tf-1.15/env/lib/python3.7/site-packages/tensorflow_core/python/client/session.py:1750: UserWarning: An interactive session is already active. This can cause out-of-memory errors in some cases. You must explicitly call `InteractiveSession.close()` to release resources held by the other session(s).
  warnings.warn('An interactive session is already active. This can '
[4]:
melgan_1024 = malaya_speech.vocoder.melgan(model = 'universal-1024')
quantized_melgan_1024 = malaya_speech.vocoder.melgan(model = 'universal-1024', quantized = True)
WARNING:root:Load quantized model will cause accuracy drop.

Load some examples#

We use specific stft parameters and steps to convert waveform to melspectrogram for training session, or else these universal melgan models not able to work. Our steps,

  1. Change into melspectrogram.

  2. log 10 that melspectrogram.

  3. Normalize using global mean and std.

The models should be able to train without global norm.

So, to reuse the same steps, use malaya_speech.featurization.universal_mel function.

[5]:
y, sr = malaya_speech.load('speech/example-speaker/khalil-nooh.wav', sr = 22050)
mel = malaya_speech.featurization.universal_mel(y)
[6]:
import IPython.display as ipd

ipd.Audio(y, rate = 22050)
[6]:
[7]:
%%time

y_ = melgan.predict([mel])
ipd.Audio(y_[0], rate = 22050)
CPU times: user 24.2 s, sys: 3.11 s, total: 27.3 s
Wall time: 6.18 s
[7]:
[8]:
%%time

y_ = quantized_melgan.predict([mel])
ipd.Audio(y_[0], rate = 22050)
CPU times: user 23.4 s, sys: 2.47 s, total: 25.9 s
Wall time: 5.25 s
[8]:
[9]:
%%time

y_ = melgan_1024.predict([mel])
ipd.Audio(y_[0], rate = 22050)
CPU times: user 6.79 s, sys: 930 ms, total: 7.72 s
Wall time: 1.85 s
[9]:
[10]:
%%time

y_ = quantized_melgan_1024.predict([mel])
ipd.Audio(y_[0], rate = 22050)
CPU times: user 6.7 s, sys: 841 ms, total: 7.54 s
Wall time: 1.71 s
[10]:
[11]:
# try english audio
y, sr = malaya_speech.load('speech/44k/test-2.wav', sr = 22050)
y = y[:sr * 4]
mel = malaya_speech.featurization.universal_mel(y)
ipd.Audio(y, rate = 22050)
[11]:
[12]:
%%time

y_ = melgan.predict([mel])
ipd.Audio(y_[0], rate = 22050)
CPU times: user 25.1 s, sys: 2.16 s, total: 27.2 s
Wall time: 4.39 s
[12]:
[13]:
%%time

y_ = melgan_1024.predict([mel])
ipd.Audio(y_[0], rate = 22050)
CPU times: user 6.81 s, sys: 725 ms, total: 7.54 s
Wall time: 1.34 s
[13]:

Combine with FastSpeech2 TTS#

[14]:
female_v2 = malaya_speech.tts.fastspeech2(model = 'female-v2')
haqkiem = malaya_speech.tts.fastspeech2(model = 'haqkiem')
/Users/huseinzolkepli/Documents/tf-1.15/env/lib/python3.7/site-packages/malaya/preprocessing.py:259: FutureWarning: Possible nested set at position 2289
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
[15]:
string = 'husein busuk masam ketiak pun masam tapi nasib baik comel'
[16]:
%%time

r_female_v2 = female_v2.predict(string)
CPU times: user 667 ms, sys: 181 ms, total: 848 ms
Wall time: 606 ms
[17]:
%%time

r_haqkiem = haqkiem.predict(string)
CPU times: user 1.11 s, sys: 323 ms, total: 1.43 s
Wall time: 1.07 s
[18]:
y_ = melgan(r_female_v2['universal-output'])
ipd.Audio(y_, rate = 22050)
[18]:
[19]:
y_ = melgan_1024(r_female_v2['universal-output'])
ipd.Audio(y_, rate = 22050)
[19]:
[20]:
y_ = melgan(r_haqkiem['universal-output'])
ipd.Audio(y_, rate = 22050)
[20]:
[21]:
y_ = melgan_1024(r_haqkiem['universal-output'])
ipd.Audio(y_, rate = 22050)
[21]:
[22]:
string = 'kau ni apehal bodoh? nak gaduh ke siaaaal'
[23]:
%%time

r_female_v2 = female_v2.predict(string)
CPU times: user 187 ms, sys: 34.7 ms, total: 221 ms
Wall time: 60.7 ms
[24]:
%%time

r_haqkiem = haqkiem.predict(string)
CPU times: user 325 ms, sys: 58 ms, total: 383 ms
Wall time: 71.5 ms
[25]:
y_ = melgan(r_female_v2['universal-output'])
ipd.Audio(y_, rate = 22050)
[25]:
[26]:
y_ = melgan_1024(r_female_v2['universal-output'])
ipd.Audio(y_, rate = 22050)
[26]:
[27]:
y_ = melgan(r_haqkiem['universal-output'])
ipd.Audio(y_, rate = 22050)
[27]:
[28]:
y_ = melgan_1024(r_haqkiem['universal-output'])
ipd.Audio(y_, rate = 22050)
[28]: