Text-to-Speech End-to-End FastSpeech2#

FastSpeech2 + Neural Vocoder Generator, End-to-End.

This tutorial is available as an IPython notebook at malaya-speech/example/tts-e2e-fastspeech2.

This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.

This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at malaya-speech/example/pipeline.

[1]:

import os

os.environ['CUDA_VISIBLE_DEVICES'] = ''

[2]:

import malaya_speech
import numpy as np
from malaya_speech import Pipeline
import matplotlib.pyplot as plt
import IPython.display as ipd

End-to-End FastSpeech2 description#

Malaya-speech VITS generate End-to-End, from text input into waveforms with 22050 sample rate.
Cannot generate more than melspectrogram longer than 2000 timestamp, it will throw an error. Make sure the texts are not too long.

List available End-to-End FastSpeech2#

[3]:

malaya_speech.tts.available_e2e_fastspeech2()

[3]:

	Size (MB)	Quantized Size (MB)	Understand punctuation	Is lowercase
osman	167	43.3	True	False
yasmin	167	43.3	True	False

Load End-to-End FastSpeech2 model#

Fastspeech2 use text normalizer from Malaya, https://malaya.readthedocs.io/en/latest/load-normalizer.html#Load-normalizer,

Make sure you install Malaya version > 4.0 to make it works, to get better speech synthesis, make sure Malaya version >= 4.9.1,

pip install malaya -U

def e2e_fastspeech2(
    model: str = 'osman',
    quantized: bool = False,
    pad_to: int = 8,
    **kwargs,
):
    """
    Load Fastspeech2 Text-to-Mel TTS model.

    Parameters
    ----------
    model : str, optional (default='male')
        Model architecture supported. Allowed values:

        * ``'yasmin'`` - Fastspeech2 trained on female Yasmin voice.
        * ``'osman'`` - Fastspeech2 trained on male Osman voice.

    quantized : bool, optional (default=False)
        if True, will load 8-bit quantized model.
        Quantized model not necessary faster, totally depends on the machine.
    pad_to : int, optional (default=8)
        size of pad character with 0. Increase can stable up prediction on short sentence, we trained on 8.

    Returns
    -------
    result : malaya_speech.model.synthesis.E2E_FastSpeech class
    """

[5]:

osman = malaya_speech.tts.e2e_fastspeech2(model = 'osman')
yasmin = malaya_speech.tts.e2e_fastspeech2(model = 'yasmin')

[6]:

# https://www.sinarharian.com.my/article/115216/BERITA/Politik/Syed-Saddiq-pertahan-Dr-Mahathir
string1 = 'Syed Saddiq berkata, mereka seharusnya mengingati bahawa semasa menjadi Perdana Menteri Pakatan Harapan'

Predict#

def predict(
    self,
    string,
    speed_ratio: float = 1.0,
    f0_ratio: float = 1.0,
    energy_ratio: float = 1.0,
    temperature_durator: float = 0.6666,
    **kwargs,
):
    """
    Change string to Mel.

    Parameters
    ----------
    string: str
    speed_ratio: float, optional (default=1.0)
        Increase this variable will increase time voice generated.
    f0_ratio: float, optional (default=1.0)
        Increase this variable will increase frequency, low frequency will generate more deeper voice.
    energy_ratio: float, optional (default=1.0)
        Increase this variable will increase loudness.
    temperature_durator: float, optional (default=0.66666)
        Durator trying to predict alignment with random.normal() * temperature_durator.

    Returns
    -------
    result: Dict[string, decoder-output, y]
    """

It only able to predict 1 text for single feed-forward.

[7]:

r_osman = osman.predict(string1)
r_osman.keys()

[7]:

dict_keys(['string', 'ids', 'y'])

[8]:

ipd.Audio(r_osman['y'], rate = 22050)

[8]:

[9]:

r_yasmin = yasmin.predict(string1)
r_yasmin.keys()

[9]:

dict_keys(['string', 'ids', 'y'])

[10]:

ipd.Audio(r_yasmin['y'], rate = 22050)

[10]:

[11]:

string2 = 'Haqkiem adalah pelajar tahun akhir yang mengambil Ijazah Sarjana Muda Sains Komputer Kecerdasan Buatan utama dari Universiti Teknikal Malaysia Melaka (UTeM) yang kini berusaha untuk latihan industri di mana dia secara praktikal dapat menerapkan pengetahuannya dalam Perisikan Perisian dan Pengaturcaraan ke arah organisasi atau industri yang berkaitan.'

[12]:

r_osman = osman.predict(string2)
r_osman.keys()

[12]:

dict_keys(['string', 'ids', 'y'])

[13]:

ipd.Audio(r_osman['y'], rate = 22050)

[13]:

[14]:

r_yasmin = yasmin.predict(string2)
r_yasmin.keys()

[14]:

dict_keys(['string', 'ids', 'y'])

[15]:

ipd.Audio(r_yasmin['y'], rate = 22050)

[15]:

[20]:

string3 = """
Profesor di Fakulti Pengajian Umum dan Pendidikan Lanjutan di Universiti Sultan Zainal Abidin (UniSZA), Dr Mohd Ridhuan Tee Abdullah, berkata rakyat kini menaruh harapan supaya pemimpin sedia dapat membawa negara keluar daripada kemelut yang tidak berkesudahan.
"""

[21]:

r_osman = osman.predict(string3)
r_osman.keys()

[21]:

dict_keys(['string', 'ids', 'y'])

[22]:

ipd.Audio(r_osman['y'], rate = 22050)

[22]:

[23]:

r_yasmin = yasmin.predict(string3)
r_yasmin.keys()

[23]:

dict_keys(['string', 'ids', 'y'])

[24]:

ipd.Audio(r_yasmin['y'], rate = 22050)

[24]:

Text-to-Speech End-to-End FastSpeech2

Contents

Text-to-Speech End-to-End FastSpeech2#

End-to-End FastSpeech2 description#

List available End-to-End FastSpeech2#

Load End-to-End FastSpeech2 model#

Predict#