Text-to-Speech FastPitch#

FastPitch, Text to Melspectrogram.

This tutorial is available as an IPython notebook at malaya-speech/example/tts-fastpitch.

This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.

This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at malaya-speech/example/pipeline.

[1]:
import malaya_speech
import numpy as np
from malaya_speech import Pipeline
import matplotlib.pyplot as plt
import IPython.display as ipd

FastPitch description#

  1. Malaya-speech FastPitch will generate melspectrogram with feature size 80.

  2. Use Malaya-speech vocoder to convert melspectrogram to waveform.

  3. Cannot generate more than melspectrogram longer than 2000 timestamp, it will throw an error. Make sure the texts are not too long.

List available FastPitch#

[2]:
malaya_speech.tts.available_fastpitch()
[2]:
Size (MB) Quantized Size (MB) Understand punctuation Is lowercase
male 123 31.1 True True
female 123 31.1 True True
husein 123 31.1 True True
haqkiem 123 31.1 True True

husein voice contributed by Husein-Zolkepli, recorded using low-end microphone in a small room with no reverberation absorber.

haqkiem voice contributed by Haqkiem Hamdan, recorded using high-end microphone in an audio studio.

Load FastPitch model#

FastPitch use text normalizer from Malaya, https://malaya.readthedocs.io/en/latest/load-normalizer.html#Load-normalizer,

Make sure you install Malaya version > 4.0 to make it works, to get better speech synthesis, make sure Malaya version > 4.9.1,

pip install malaya -U
def fastpitch(
    model: str = 'male',
    quantized: bool = False,
    pad_to: int = 8,
    **kwargs
):
    """
    Load Fastspitch TTS model.

    Parameters
    ----------
    model : str, optional (default='male')
        Model architecture supported. Allowed values:

        * ``'female'`` - Fastpitch trained on female voice.
        * ``'male'`` - Fastpitch trained on male voice.
        * ``'husein'`` - Fastpitch trained on Husein voice, https://www.linkedin.com/in/husein-zolkepli/
        * ``'haqkiem'`` - Fastpitch trained on Haqkiem voice, https://www.linkedin.com/in/haqkiem-daim/

    quantized : bool, optional (default=False)
        if True, will load 8-bit quantized model.
        Quantized model not necessary faster, totally depends on the machine.

    pad_to : int, optional (default=8)
        size of pad character with 0. Increase can stable up prediction on short sentence, we trained on 8.

    Returns
    -------
    result : malaya_speech.model.tf.Fastpitch class
    """
[3]:
male = malaya_speech.tts.fastpitch(model = 'male')
female = malaya_speech.tts.fastpitch(model = 'female')
husein = malaya_speech.tts.fastpitch(model = 'husein')
haqkiem = malaya_speech.tts.fastpitch(model = 'haqkiem')
INFO:root:running tts/fastpitch-male using device /device:CPU:0
INFO:root:running tts/fastpitch-female using device /device:CPU:0
INFO:root:running tts/fastpitch-husein using device /device:CPU:0
INFO:root:running tts/fastpitch-haqkiem using device /device:CPU:0
[3]:
# https://www.sinarharian.com.my/article/115216/BERITA/Politik/Syed-Saddiq-pertahan-Dr-Mahathir
string1 = 'Syed Saddiq berkata, mereka seharusnya mengingati bahawa semasa menjadi Perdana Menteri Pakatan Harapan.'

Predict#

def predict(
    self,
    string,
    speed_ratio: float = 1.0,
    pitch_ratio: float = 1.0,
    pitch_addition: float = 0.0,
    **kwargs,
):
    """
    Change string to Mel.

    Parameters
    ----------
    string: str
    speed_ratio: float, optional (default=1.0)
        Increase this variable will increase time voice generated.
    pitch_ratio: float, optional (default=1.0)
        pitch = pitch * pitch_ratio, amplify existing pitch contour.
    pitch_addition: float, optional (default=0.0)
        pitch = pitch + pitch_addition, change pitch contour.

    Returns
    -------
    result: Dict[string, decoder-output, mel-output, pitch-output, universal-output]
    """

It only able to predict 1 text for single feed-forward.

[5]:
%%time

r_male = male.predict(string1)
CPU times: user 3.35 s, sys: 1.64 s, total: 4.99 s
Wall time: 4.59 s
[6]:
%%time

r_female = female.predict(string1)
CPU times: user 3.42 s, sys: 1.62 s, total: 5.03 s
Wall time: 4.46 s
[7]:
%%time

r_husein = husein.predict(string1)
CPU times: user 3.31 s, sys: 1.59 s, total: 4.9 s
Wall time: 4.4 s
[8]:
%%time

r_haqkiem = haqkiem.predict(string1)
CPU times: user 3.32 s, sys: 1.56 s, total: 4.88 s
Wall time: 4.35 s
[9]:
fig = plt.figure(figsize=(10, 8))
ax1 = fig.add_subplot(311)
ax1.set_title('Male Predicted Mel-Spectrogram')
im = ax1.imshow(np.rot90(r_male['mel-output']), aspect='auto', interpolation='none')
fig.colorbar(mappable=im, shrink=0.65, orientation='horizontal', ax=ax1)
plt.show()
_images/tts-fastpitch-model_18_0.png
[10]:
fig = plt.figure(figsize=(10, 8))
ax1 = fig.add_subplot(311)
ax1.set_title('Female Predicted Mel-Spectrogram')
im = ax1.imshow(np.rot90(r_female['mel-output']), aspect='auto', interpolation='none')
fig.colorbar(mappable=im, shrink=0.65, orientation='horizontal', ax=ax1)
plt.show()
_images/tts-fastpitch-model_19_0.png
[11]:
fig = plt.figure(figsize=(10, 8))
ax1 = fig.add_subplot(311)
ax1.set_title('Husein Predicted Mel-Spectrogram')
im = ax1.imshow(np.rot90(r_husein['mel-output']), aspect='auto', interpolation='none')
fig.colorbar(mappable=im, shrink=0.65, orientation='horizontal', ax=ax1)
plt.show()
_images/tts-fastpitch-model_20_0.png
[12]:
fig = plt.figure(figsize=(10, 8))
ax1 = fig.add_subplot(311)
ax1.set_title('Haqkiem Predicted Mel-Spectrogram')
im = ax1.imshow(np.rot90(r_haqkiem['mel-output']), aspect='auto', interpolation='none')
fig.colorbar(mappable=im, shrink=0.65, orientation='horizontal', ax=ax1)
plt.show()
_images/tts-fastpitch-model_21_0.png
[13]:
fig = plt.figure(figsize=(10, 3))
original_male_pitch = r_male['pitch-output']
original_female_pitch = r_female['pitch-output']
original_husein_pitch = r_husein['pitch-output']
original_haqkiem_pitch = r_haqkiem['pitch-output']

plt.plot(original_male_pitch, label = 'male')
plt.plot(original_female_pitch, label = 'female')
plt.plot(original_husein_pitch, label = 'husein')
plt.plot(original_haqkiem_pitch, label = 'haqkiem')
plt.legend()
plt.show()
_images/tts-fastpitch-model_22_0.png

Load Vocoder model#

There are 2 ways to synthesize melspectrogram output from TTS models,

  1. If you are going to use individual speaker vocoder, make sure the speakers are the same If use female tacotron2, need to use female MelGAN also. Use mel-output from TTS model. Read more at https://malaya-speech.readthedocs.io/en/latest/load-vocoder.html

  2. If you are going to use universal MelGAN, use universal-output from TTS model. Read more at https://malaya-speech.readthedocs.io/en/latest/load-universal-melgan.html

[14]:
universal_melgan = malaya_speech.vocoder.melgan(model = 'universal-1024')
INFO:root:running vocoder-melgan/universal-1024 using device /device:CPU:0
[15]:
y_ = universal_melgan(r_male['universal-output'])
ipd.Audio(y_, rate = 22050)
[15]:
[16]:
y_ = universal_melgan(r_female['universal-output'])
ipd.Audio(y_, rate = 22050)
[16]:
[17]:
y_ = universal_melgan(r_husein['universal-output'])
ipd.Audio(y_, rate = 22050)
[17]:
[18]:
y_ = universal_melgan(r_haqkiem['universal-output'])
ipd.Audio(y_, rate = 22050)
[18]:

Playing around with speed, pitch ratio and pitch addition#

[35]:
r_male = male.predict(string1, pitch_ratio = 2.0, pitch_addition = 1.0)
y_ = universal_melgan(r_male['universal-output'])
ipd.Audio(y_, rate = 22050)
[35]:
[19]:
r_husein = husein.predict(string1, speed_ratio = 1.5)
[20]:
y_ = universal_melgan(r_husein['universal-output'])
ipd.Audio(y_, rate = 22050)
[20]:
[21]:
r_haqkiem = haqkiem.predict(string1, speed_ratio = 1.5)
[22]:
y_ = universal_melgan(r_haqkiem['universal-output'])
ipd.Audio(y_, rate = 22050)
[22]:
[23]:
r_husein = husein.predict(string1, pitch_ratio = 1.2)
[24]:
y_ = universal_melgan(r_husein['universal-output'])
ipd.Audio(y_, rate = 22050)
[24]:
[25]:
r_haqkiem = haqkiem.predict(string1, pitch_ratio = 2.0)
[26]:
y_ = universal_melgan(r_haqkiem['mel-output'])
ipd.Audio(y_, rate = 22050)
[26]:
[27]:
r_haqkiem = haqkiem.predict(string1, pitch_ratio = -0.5)
[28]:
y_ = universal_melgan(r_haqkiem['universal-output'])
ipd.Audio(y_, rate = 22050)
[28]:
[29]:
r_haqkiem = haqkiem.predict(string1, pitch_ratio = 1.0, pitch_addition = 1.1)
[30]:
y_ = universal_melgan(r_haqkiem['universal-output'])
ipd.Audio(y_, rate = 22050)
[30]:
[31]:
r_haqkiem = haqkiem.predict(string1, pitch_ratio = 1.0, pitch_addition = -1.3)
[32]:
y_ = universal_melgan(r_haqkiem['universal-output'])
ipd.Audio(y_, rate = 22050)
[32]: