Text-to-Speech FastSpeech2#

FastSpeech2, Text to Melspectrogram.

This tutorial is available as an IPython notebook at malaya-speech/example/tts-fastspeech2.

This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.

This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at malaya-speech/example/pipeline.

[1]:

import malaya_speech
import numpy as np
from malaya_speech import Pipeline
import matplotlib.pyplot as plt
import IPython.display as ipd

FastSpeech2 description#

Malaya-speech FastSpeech2 will generate melspectrogram with feature size 80.
Use Malaya-speech vocoder to convert melspectrogram to waveform.
Cannot generate more than melspectrogram longer than 2000 timestamp, it will throw an error. Make sure the texts are not too long.

List available FastSpeech2#

[2]:

malaya_speech.tts.available_fastspeech2()

[2]:

	Size (MB)	Quantized Size (MB)	Understand punctuation	Is lowercase
male	125	31.7	True	True
female	125	31.7	True	True
husein	125	31.7	True	True
haqkiem	125	31.7	True	True
female-singlish	125	31.7	True	True
osman	125	31.7	True	False
yasmin	125	31.7	True	False
yasmin-sdp	128	33.1	True	False
osman-sdp	128	33.1	True	False

husein voice contributed by Husein-Zolkepli, recorded using low-end microphone in a small room with no reverberation absorber.

haqkiem voice contributed by Haqkiem Hamdan, recorded using high-end microphone in an audio studio.

female-singlish voice contributed by SG National Speech Corpus, recorded using high-end microphone in an audio studio.

Load FastSpeech2 model#

Fastspeech2 use text normalizer from Malaya, https://malaya.readthedocs.io/en/latest/load-normalizer.html#Load-normalizer,

Make sure you install Malaya version > 4.0 to make it works, to get better speech synthesis, make sure Malaya version > 4.9.1,

pip install malaya -U

def fastspeech2(
    model: str = 'mal',
    quantized: bool = False,
    pad_to: int = 8,
    **kwargs
):
    """
    Load Fastspeech2 TTS model.

    Parameters
    ----------
    model : str, optional (default='male')
        Model architecture supported. Allowed values:

        * ``'female'`` - Fastspeech2 trained on female voice.
        * ``'male'`` - Fastspeech2 trained on male voice.
        * ``'husein'`` - Fastspeech2 trained on Husein voice, https://www.linkedin.com/in/husein-zolkepli/
        * ``'haqkiem'`` - Fastspeech2 trained on Haqkiem voice, https://www.linkedin.com/in/haqkiem-daim/
        * ``'yasmin'`` - Fastspeech2 trained on female Yasmin voice.
        * ``'osman'`` - Fastspeech2 trained on male Osman voice.
        * ``'female-singlish'`` - Fastspeech2 trained on female Singlish voice, https://www.imda.gov.sg/programme-listing/digital-services-lab/national-speech-corpus

    quantized : bool, optional (default=False)
        if True, will load 8-bit quantized model.
        Quantized model not necessary faster, totally depends on the machine.
    pad_to : int, optional (default=8)
        size of pad character with 0. Increase can stable up prediction on short sentence, we trained on 8.

    Returns
    -------
    result : malaya_speech.model.synthesis.Fastspeech class
    """

[3]:

male = malaya_speech.tts.fastspeech2(model = 'osman')

[4]:

# https://www.sinarharian.com.my/article/115216/BERITA/Politik/Syed-Saddiq-pertahan-Dr-Mahathir
string1 = 'Syed Saddiq berkata, mereka seharusnya mengingati bahawa semasa menjadi Perdana Menteri Pakatan Harapan'

Predict#

def predict(
    self,
    string,
    speed_ratio: float = 1.0,
    f0_ratio: float = 1.0,
    energy_ratio: float = 1.0,
):
    """
    Change string to Mel.

    Parameters
    ----------
    string: str
    speed_ratio: float, optional (default=1.0)
        Increase this variable will increase time voice generated.
    f0_ratio: float, optional (default=1.0)
        Increase this variable will increase frequency, low frequency will generate more deeper voice.
    energy_ratio: float, optional (default=1.0)
        Increase this variable will increase loudness.

    Returns
    -------
    result: Dict[string, decoder-output, universal-output, mel-output]
    """

It only able to predict 1 text for single feed-forward.

[6]:

r_male = male.predict(string1)

[7]:

fig = plt.figure(figsize=(10, 8))
ax1 = fig.add_subplot(311)
ax1.set_title('Predicted Mel-Spectrogram')
im = ax1.imshow(np.rot90(r_male['mel-output']), aspect='auto', interpolation='none')
fig.colorbar(mappable=im, shrink=0.65, orientation='horizontal', ax=ax1)
plt.show()

Load Vocoder model#

There are 2 ways to synthesize melspectrogram output from TTS models,

If you are going to use individual speaker vocoder, make sure the speakers are the same If use female tacotron2, need to use female MelGAN also. Use mel-output from TTS model. Read more at https://malaya-speech.readthedocs.io/en/latest/load-vocoder.html
If you are going to use universal MelGAN, use universal-output from TTS model. Read more at https://malaya-speech.readthedocs.io/en/latest/load-universal-melgan.html

We prefer individual speaker vocoder, size only 17MB and faster than universal vocoder.

[8]:

universal_melgan = malaya_speech.vocoder.melgan(model = 'universal-1024')

[9]:

y_ = universal_melgan(r_male['universal-output'])
ipd.Audio(y_, rate = 22050)

[9]:

Playing around with speed, f0 and energy ratio#

[10]:

r_male = male.predict(string1, speed_ratio = 1.5)

[12]:

y_ = universal_melgan(r_male['universal-output'])
ipd.Audio(y_, rate = 22050)

[12]:

[17]:

r_male = male.predict(string1, f0_ratio = -1.5)

[18]:

y_ = universal_melgan(r_male['universal-output'])
ipd.Audio(y_, rate = 22050)

[18]:

Combined everything using Pipeline#

[25]:

p = Pipeline()
pipeline = (
    p.map(male)
    .map(lambda x: x['universal-output'])
    .map(universal_melgan)
)
p.visualize()

[25]:

[26]:

r = p('husein wangi tetapi ketiak masam nasib baik kacak')

[27]:

ipd.Audio(r['vocoder-melgan'], rate = 22050)

[27]:

Compare speed with Tacotron2#

[28]:

# https://www.hmetro.com.my/mutakhir/2020/12/657604/6-cadangan-tangani-kelemahan-kawal-selia-halal

text = 'Kuasa pensijilan halal, dan penguatkuasaan halal terletak di bawah bidang kuasa agensi yang berbeza.'

[29]:

male_tacotron2 = malaya_speech.tts.tacotron2(model = 'osman')

[34]:

%%time

r_male_tacotron2 = male_tacotron2.predict(text)

CPU times: user 1.66 s, sys: 52.8 ms, total: 1.71 s
Wall time: 1.42 s

[31]:

y_ = universal_melgan(r_male_tacotron2['universal-output'])
ipd.Audio(y_, rate = 22050)

[31]:

[32]:

%%time

r_male = male.predict(text)

CPU times: user 593 ms, sys: 52.4 ms, total: 646 ms
Wall time: 118 ms

[33]:

y_ = universal_melgan(r_male['universal-output'])
ipd.Audio(y_, rate = 22050)

[33]:

Text-to-Speech FastSpeech2

Contents

Text-to-Speech FastSpeech2#

FastSpeech2 description#

List available FastSpeech2#

Load FastSpeech2 model#

Predict#

Load Vocoder model#

Playing around with speed, f0 and energy ratio#

Combined everything using Pipeline#

Compare speed with Tacotron2#