Text-to-Speech Singlish#

Text to Melspectrogram using Tacotron2, FastSpeech2, GlowTTS and VITS, trained on Singapore National Speech Corpus, https://www.imda.gov.sg/programme-listing/digital-services-lab/national-speech-corpus

This tutorial is available as an IPython notebook at malaya-speech/example/tts-singlish.

This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.

This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at malaya-speech/example/pipeline.

Tacotron2 description#

  1. Malaya-speech Tacotron2 will generate melspectrogram with feature size 80.

  2. Use Malaya-speech vocoder to convert melspectrogram to waveform.

FastSpeech2 description#

  1. Malaya-speech FastSpeech2 will generate melspectrogram with feature size 80.

  2. Use Malaya-speech vocoder to convert melspectrogram to waveform.

  3. Cannot generate more than melspectrogram longer than 2000 timestamp, it will throw an error. Make sure the texts are not too long.

GlowTTS description#

  1. Malaya-speech GlowTTS will generate melspectrogram with feature size 80.

  2. Use Malaya-speech vocoder to convert melspectrogram to waveform.

VITS description#

  1. Malaya-speech VITS generate End-to-End, from text input into waveforms with 22050 sample rate.

  2. No length limit, but to get better results, split the text.

List available Tacotron2#

Size (MB) Quantized Size (MB) Understand punctuation Is lowercase
male 104 26.3 True True
female 104 26.3 True True
husein 104 26.3 True True
haqkiem 104 26.3 True True
female-singlish 104 26.3 True True
yasmin 104 26.3 True False
osman 104 26.3 True False

List available FastSpeech2#

Size (MB) Quantized Size (MB) Understand punctuation Is lowercase
male 125 31.7 True True
female 125 31.7 True True
husein 125 31.7 True True
haqkiem 125 31.7 True True
female-singlish 125 31.7 True True
osman 125 31.7 True False
yasmin 125 31.7 True False
yasmin-sdp 128 33.1 True False
osman-sdp 128 33.1 True False

List available GlowTTS#

Size (MB) Quantized Size (MB) Understand punctuation Is lowercase
male 119 27.6 True True
female 119 27.6 True True
haqkiem 119 27.6 True True
female-singlish 119 27.6 True True
yasmin 119 27.6 True False
osman 119 27.6 True False
multispeaker 404 79.9 True True

List available VITS#

Size (MB) Understand punctuation Is lowercase
mesolitica/VITS-osman 145 True False
mesolitica/VITS-yasmin 145 True False
mesolitica/VITS-female-singlish 145 True True
mesolitica/VITS-haqkiem 145 True True

Load Tacotron2 model#

Read more about Tacotron2 model at https://malaya-speech.readthedocs.io/en/latest/tts-tacotron2-model.html

tacotron = malaya_speech.tts.tacotron2(model = 'female-singlish')
quantized_tacotron = malaya_speech.tts.tacotron2(model = 'female-singlish', quantized = True)
Load FastSpeech2 model#

Read more about FastSpeech2 model at https://malaya-speech.readthedocs.io/en/latest/tts-fastspeech2-model.html

fastspeech = malaya_speech.tts.fastspeech2(model = 'female-singlish')
quantized_fastspeech = malaya_speech.tts.fastspeech2(model = 'female-singlish', quantized = True)
Load GlowTTS model#

Read more about GlowTTS model at https://malaya-speech.readthedocs.io/en/latest/tts-glowtts-model.html

glowtts = malaya_speech.tts.glowtts(model = 'female-singlish')
quantized_glowtts = malaya_speech.tts.glowtts(model = 'female-singlish', quantized = True)
Load VITS model#

Read more about VITS model at https://malaya-speech.readthedocs.io/en/latest/tts-vits-model.html

vits = malaya_speech.tts.vits(model = 'mesolitica/VITS-female-singlish')

Predict Tacotron2#

def predict(self, string):
    Change string to Mel.

    string: str

    result: Dict[string, decoder-output, mel-output, universal-output, alignment]

It only able to predict 1 text for single feed-forward.

string1 = 'PETALING JAYA: Former prime minister Najib Razak has criticised the Inland Revenue Board’s (LHDN) move to serve him a bankruptcy notice, which his legal team had earlier called a political ploy.'

r = tacotron.predict(string1)
quantized_r = quantized_tacotron.predict(string1)
fig = plt.figure(figsize = (8, 6))
ax = fig.add_subplot(111)
ax.set_title('Female Singlish Attention alignment steps')
im = ax.imshow(
fig.colorbar(im, ax=ax)
xlabel = 'Decoder timestep'
plt.ylabel('Encoder timestep')
fig = plt.figure(figsize = (8, 6))
ax = fig.add_subplot(111)
ax.set_title('Female Singlish Attention alignment steps')
im = ax.imshow(
fig.colorbar(im, ax=ax)
xlabel = 'Decoder timestep'
plt.ylabel('Encoder timestep')
fig = plt.figure(figsize=(10, 8))
ax1 = fig.add_subplot(311)
ax1.set_title('Female Singlish predicted Mel-Spectrogram')
im = ax1.imshow(np.rot90(r['mel-output']), aspect='auto', interpolation='none')
fig.colorbar(mappable=im, shrink=0.65, orientation='horizontal', ax=ax1)
fig = plt.figure(figsize=(10, 8))
ax1 = fig.add_subplot(311)
ax1.set_title('Female Singlish predicted Mel-Spectrogram')
im = ax1.imshow(np.rot90(quantized_r['mel-output']), aspect='auto', interpolation='none')
fig.colorbar(mappable=im, shrink=0.65, orientation='horizontal', ax=ax1)

Predict FastSpeech2#

def predict(
    speed_ratio: float = 1.0,
    f0_ratio: float = 1.0,
    energy_ratio: float = 1.0,
    Change string to Mel.

    string: str
    speed_ratio: float, optional (default=1.0)
        Increase this variable will increase time voice generated.
    f0_ratio: float, optional (default=1.0)
        Increase this variable will increase frequency, low frequency will generate more deeper voice.
    energy_ratio: float, optional (default=1.0)
        Increase this variable will increase loudness.

    result: Dict[string, decoder-output, universal-output, mel-output]

It only able to predict 1 text for single feed-forward.


r_fastspeech = fastspeech.predict(string1)
fig = plt.figure(figsize=(10, 8))
ax1 = fig.add_subplot(311)
ax1.set_title('Female Singlish predicted Mel-Spectrogram')
im = ax1.imshow(np.rot90(r_fastspeech['mel-output']), aspect='auto', interpolation='none')
fig.colorbar(mappable=im, shrink=0.65, orientation='horizontal', ax=ax1)

Predict GlowTTS#

def predict(
    temperature: float = 0.3333,
    length_ratio: float = 1.0,
    Change string to Mel.

    string: str
    temperature: float, optional (default=0.3333)
        Decoder model trying to decode with encoder(text) + random.normal() * temperature.
    length_ratio: float, optional (default=1.0)
        Increase this variable will increase time voice generated.

    result: Dict[string, ids, mel-output, alignment, universal-output]

It only able to predict 1 text for single feed-forward.


r_glowtts = glowtts.predict(string1)
fig = plt.figure(figsize=(10, 8))
ax1 = fig.add_subplot(311)
ax1.set_title('Female Singlish predicted Mel-Spectrogram')
im = ax1.imshow(np.rot90(r_glowtts['mel-output']), aspect='auto', interpolation='none')
fig.colorbar(mappable=im, shrink=0.65, orientation='horizontal', ax=ax1)

Predict VITS#

def predict(
    temperature: float = 0.6666,
    temperature_durator: float = 0.6666,
    length_ratio: float = 1.0,
    Change string to waveform.

    string: str
    temperature: float, optional (default=0.6666)
        Decoder model trying to decode with encoder(text) + random.normal() * temperature.
    temperature_durator: float, optional (default=0.6666)
        Durator trying to predict alignment with random.normal() * temperature_durator.
    length_ratio: float, optional (default=1.0)
        Increase this variable will increase time voice generated.

    result: Dict[string, ids, alignment, y]

It only able to predict 1 text for single feed-forward.


r_vits = vits.predict(string1)
fig = plt.figure(figsize = (8, 6))
ax = fig.add_subplot(111)
ax.set_title('Female Singlish Attention alignment steps')
im = ax.imshow(
fig.colorbar(im, ax=ax)
xlabel = 'Decoder timestep'
plt.ylabel('Encoder timestep')

Load Vocoder model#

There is only 1 way to synthesize melspectrogram output from Singlish TTS models,

  1. If you are going to use universal MelGAN, use universal-output from TTS model. Read more at https://malaya-speech.readthedocs.io/en/latest/load-universal-melgan.html

vocoder = malaya_speech.vocoder.melgan(model = 'universal-1024')
y_ = vocoder(r['universal-output'])
ipd.Audio(y_, rate = 22050)
y_ = vocoder(quantized_r['universal-output'])
ipd.Audio(y_, rate = 22050)
y_ = vocoder(r_fastspeech['universal-output'])
ipd.Audio(y_, rate = 22050)
y_ = vocoder(r_glowtts['universal-output'])
ipd.Audio(y_, rate = 22050)
ipd.Audio(r_vits['y'], rate = 22050)

Predict Bahasa text#

string = 'husein zolkepli sangatlah comel, ketiak wangi dan mempunyai baby yang comel. Husein juga suka mandi pada waktu pagi dan petang sambil menggunakan sabun lifeboy.'

r = tacotron.predict(string)
quantized_r = quantized_tacotron.predict(string)
r_fastspeech = fastspeech.predict(string)
r_vits = vits.predict(string)
y_ = vocoder(r['universal-output'])
ipd.Audio(y_, rate = 22050)
y_ = vocoder(quantized_r['universal-output'])
ipd.Audio(y_, rate = 22050)
y_ = vocoder(r_fastspeech['universal-output'])
ipd.Audio(y_, rate = 22050)
ipd.Audio(r_vits['y'], rate = 22050)
