Universal HiFiGAN
Contents
Universal HiFiGAN#
synthesize Melspectrogram to waveform and these models able to synthesize multiple speakers.
This tutorial is available as an IPython notebook at malaya-speech/example/universal-hifigan.
This module is language independent, so it save to use on different languages.
Vocoder description#
Only accept mel feature size 80.
Will generate waveform with 22050 sample rate.
Explanation#
If you use HiFiGAN Vocoder from https://malaya-speech.readthedocs.io/en/latest/load-vocoder.html, each speaker got their own HiFiGAN Vocoder.
So we basically scale up the size and trained on multispeakers.
[1]:
import malaya_speech
import numpy as np
List available HiFiGAN#
[2]:
malaya_speech.vocoder.available_hifigan()
[2]:
Size (MB) | Quantized Size (MB) | Mel loss | |
---|---|---|---|
male | 8.8 | 2.49 | 0.4650 |
female | 8.8 | 2.49 | 0.5547 |
universal-1024 | 170.0 | 42.90 | 0.3346 |
universal-768 | 72.8 | 18.50 | 0.3617 |
universal-512 | 32.6 | 8.60 | 0.3253 |
Load HiFiGAN model#
def hifigan(model: str = 'universal-768', quantized: bool = False, **kwargs):
"""
Load HiFiGAN Vocoder model.
Parameters
----------
model : str, optional (default='universal-768')
Model architecture supported. Allowed values:
* ``'female'`` - HiFiGAN trained on female voice.
* ``'male'`` - HiFiGAN trained on male voice.
* ``'universal-1024'`` - Universal HiFiGAN with 1024 filters trained on multiple speakers.
* ``'universal-768'`` - Universal HiFiGAN with 768 filters trained on multiple speakers.
* ``'universal-512'`` - Universal HiFiGAN with 512 filters trained on multiple speakers.
quantized : bool, optional (default=False)
if True, will load 8-bit quantized model.
Quantized model not necessary faster, totally depends on the machine.
Returns
-------
result : malaya_speech.supervised.vocoder.load function
"""
[20]:
model_768 = malaya_speech.vocoder.hifigan(model = 'universal-768')
quantized_model_768 = malaya_speech.vocoder.hifigan(model = 'universal-768', quantized = True)
WARNING:root:Load quantized model will cause accuracy drop.
[21]:
model_512 = malaya_speech.vocoder.hifigan(model = 'universal-512')
quantized_model_512 = malaya_speech.vocoder.hifigan(model = 'universal-512', quantized = True)
Load some examples#
We use specific stft parameters and steps to convert waveform to melspectrogram for training session, or else these universal melgan models not able to work. Our steps,
Change into melspectrogram.
log 10 that melspectrogram.
Normalize using global mean and std.
The models should be able to train without global norm.
So, to reuse the same steps, use malaya_speech.featurization.universal_mel
function.
[6]:
y, sr = malaya_speech.load('speech/example-speaker/khalil-nooh.wav', sr = 22050)
mel = malaya_speech.featurization.universal_mel(y)
[7]:
import IPython.display as ipd
ipd.Audio(y, rate = 22050)
[7]:
[11]:
%%time
y_ = model_768.predict([mel])
ipd.Audio(y_[0], rate = 22050)
CPU times: user 6.89 s, sys: 597 ms, total: 7.49 s
Wall time: 1.63 s
[11]:
[12]:
%%time
y_ = quantized_model_768.predict([mel])
ipd.Audio(y_[0], rate = 22050)
CPU times: user 6.93 s, sys: 617 ms, total: 7.55 s
Wall time: 1.5 s
[12]:
[13]:
%%time
y_ = model_512.predict([mel])
ipd.Audio(y_[0], rate = 22050)
CPU times: user 3.36 s, sys: 604 ms, total: 3.97 s
Wall time: 696 ms
[13]:
[15]:
%%time
y_ = quantized_model_512.predict([mel])
ipd.Audio(y_[0], rate = 22050)
CPU times: user 3.43 s, sys: 605 ms, total: 4.04 s
Wall time: 760 ms
[15]:
[16]:
# try english audio
y, sr = malaya_speech.load('speech/44k/test-2.wav', sr = 22050)
y = y[:sr * 4]
mel = malaya_speech.featurization.universal_mel(y)
ipd.Audio(y, rate = 22050)
[16]:
[17]:
%%time
y_ = model_768.predict([mel])
ipd.Audio(y_[0], rate = 22050)
CPU times: user 7.11 s, sys: 598 ms, total: 7.71 s
Wall time: 1.56 s
[17]:
[18]:
%%time
y_ = model_512.predict([mel])
ipd.Audio(y_[0], rate = 22050)
CPU times: user 3.42 s, sys: 583 ms, total: 4.01 s
Wall time: 789 ms
[18]: