Universal HiFiGAN#

synthesize Melspectrogram to waveform and these models able to synthesize multiple speakers.

This tutorial is available as an IPython notebook at malaya-speech/example/universal-hifigan.

This module is language independent, so it save to use on different languages.

Vocoder description#

  1. Only accept mel feature size 80.

  2. Will generate waveform with 22050 sample rate.

Explanation#

If you use HiFiGAN Vocoder from https://malaya-speech.readthedocs.io/en/latest/load-vocoder.html, each speaker got their own HiFiGAN Vocoder.

So we basically scale up the size and trained on multispeakers.

[1]:
import malaya_speech
import numpy as np

List available HiFiGAN#

[2]:
malaya_speech.vocoder.available_hifigan()
[2]:
Size (MB) Quantized Size (MB) Mel loss
male 8.8 2.49 0.4650
female 8.8 2.49 0.5547
universal-1024 170.0 42.90 0.3346
universal-768 72.8 18.50 0.3617
universal-512 32.6 8.60 0.3253

Load HiFiGAN model#

def hifigan(model: str = 'universal-768', quantized: bool = False, **kwargs):
    """
    Load HiFiGAN Vocoder model.

    Parameters
    ----------
    model : str, optional (default='universal-768')
        Model architecture supported. Allowed values:

        * ``'female'`` - HiFiGAN trained on female voice.
        * ``'male'`` - HiFiGAN trained on male voice.
        * ``'universal-1024'`` - Universal HiFiGAN with 1024 filters trained on multiple speakers.
        * ``'universal-768'`` - Universal HiFiGAN with 768 filters trained on multiple speakers.
        * ``'universal-512'`` - Universal HiFiGAN with 512 filters trained on multiple speakers.

    quantized : bool, optional (default=False)
        if True, will load 8-bit quantized model.
        Quantized model not necessary faster, totally depends on the machine.

    Returns
    -------
    result : malaya_speech.supervised.vocoder.load function
    """
[20]:
model_768 = malaya_speech.vocoder.hifigan(model = 'universal-768')
quantized_model_768 = malaya_speech.vocoder.hifigan(model = 'universal-768', quantized = True)
WARNING:root:Load quantized model will cause accuracy drop.
[21]:
model_512 = malaya_speech.vocoder.hifigan(model = 'universal-512')
quantized_model_512 = malaya_speech.vocoder.hifigan(model = 'universal-512', quantized = True)

Load some examples#

We use specific stft parameters and steps to convert waveform to melspectrogram for training session, or else these universal melgan models not able to work. Our steps,

  1. Change into melspectrogram.

  2. log 10 that melspectrogram.

  3. Normalize using global mean and std.

The models should be able to train without global norm.

So, to reuse the same steps, use malaya_speech.featurization.universal_mel function.

[6]:
y, sr = malaya_speech.load('speech/example-speaker/khalil-nooh.wav', sr = 22050)
mel = malaya_speech.featurization.universal_mel(y)
[7]:
import IPython.display as ipd

ipd.Audio(y, rate = 22050)
[7]:
[11]:
%%time

y_ = model_768.predict([mel])
ipd.Audio(y_[0], rate = 22050)
CPU times: user 6.89 s, sys: 597 ms, total: 7.49 s
Wall time: 1.63 s
[11]:
[12]:
%%time

y_ = quantized_model_768.predict([mel])
ipd.Audio(y_[0], rate = 22050)
CPU times: user 6.93 s, sys: 617 ms, total: 7.55 s
Wall time: 1.5 s
[12]:
[13]:
%%time

y_ = model_512.predict([mel])
ipd.Audio(y_[0], rate = 22050)
CPU times: user 3.36 s, sys: 604 ms, total: 3.97 s
Wall time: 696 ms
[13]:
[15]:
%%time

y_ = quantized_model_512.predict([mel])
ipd.Audio(y_[0], rate = 22050)
CPU times: user 3.43 s, sys: 605 ms, total: 4.04 s
Wall time: 760 ms
[15]:
[16]:
# try english audio
y, sr = malaya_speech.load('speech/44k/test-2.wav', sr = 22050)
y = y[:sr * 4]
mel = malaya_speech.featurization.universal_mel(y)
ipd.Audio(y, rate = 22050)
[16]:
[17]:
%%time

y_ = model_768.predict([mel])
ipd.Audio(y_[0], rate = 22050)
CPU times: user 7.11 s, sys: 598 ms, total: 7.71 s
Wall time: 1.56 s
[17]:
[18]:
%%time

y_ = model_512.predict([mel])
ipd.Audio(y_[0], rate = 22050)
CPU times: user 3.42 s, sys: 583 ms, total: 4.01 s
Wall time: 789 ms
[18]: