More examples FastSpeech2#

This tutorial is available as an IPython notebook at malaya-speech/example/tts-more-fastspeech2.

This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.

This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at malaya-speech/example/pipeline.

[1]:
import malaya_speech
import numpy as np
from malaya_speech import Pipeline
import matplotlib.pyplot as plt
import IPython.display as ipd

List available FastSpeech2#

[2]:
malaya_speech.tts.available_fastspeech2()
[2]:
Size (MB) Quantized Size (MB) Understand punctuation Is lowercase
male 125 31.7 True True
female 125 31.7 True True
husein 125 31.7 True True
haqkiem 125 31.7 True True
female-singlish 125 31.7 True True
osman 125 31.7 True False
yasmin 125 31.7 True False
yasmin-sdp 128 33.1 True False
osman-sdp 128 33.1 True False

husein voice contributed by Husein-Zolkepli, recorded using low-end microphone in a small room with no reverberation absorber.

haqkiem voice contributed by Haqkiem Hamdan, recorded using high-end microphone in an audio studio.

female-singlish voice contributed by SG National Speech Corpus, recorded using high-end microphone in an audio studio.

Load FastSpeech2 model#

Fastspeech2 use text normalizer from Malaya, https://malaya.readthedocs.io/en/latest/load-normalizer.html#Load-normalizer,

Make sure you install Malaya version > 4.0 to make it works, to get better speech synthesis, make sure Malaya version > 4.9.1,

pip install malaya -U
[3]:
male = malaya_speech.tts.fastspeech2(model = 'male')
female = malaya_speech.tts.fastspeech2(model = 'female')
husein = malaya_speech.tts.fastspeech2(model = 'husein')
haqkiem = malaya_speech.tts.fastspeech2(model = 'haqkiem')
[4]:
yasmin = malaya_speech.tts.fastspeech2(model = 'yasmin')
osman = malaya_speech.tts.fastspeech2(model = 'osman')

Load Vocoder model#

I will use MelGAN in this example. But, make sure speakers are same. If use female fastspeech2, need to use female MelGAN also.

[4]:
vocoder_male = malaya_speech.vocoder.melgan(model = 'male')
vocoder_female = malaya_speech.vocoder.melgan(model = 'female')
vocoder_husein = malaya_speech.vocoder.melgan(model = 'husein')
vocoder_haqkiem = malaya_speech.vocoder.melgan(model = 'haqkiem')
INFO:root:running vocoder-melgan/male using device /device:CPU:0
INFO:root:running vocoder-melgan/female using device /device:CPU:0
INFO:root:running vocoder-melgan/husein using device /device:CPU:0
INFO:root:running vocoder-melgan/haqkiem using device /device:CPU:0
[5]:
universal_melgan = malaya_speech.vocoder.melgan(model = 'universal-1024')

Predict#

[6]:
string = 'Masa aku kat kuala lumpur, ada main mintak rasuah. Aku cakap kat agensi kerajaan tu. Dia cuma kata tak payah bagi. Tak ambik nama pun. mungkin itu la kot selemah lemah iman, tolak dalam hati tapi tak mampu buat tindakan.'
[6]:
%%time

r_male = male.predict(string)
CPU times: user 4.06 s, sys: 1.71 s, total: 5.77 s
Wall time: 4.46 s
[7]:
%%time

r_female = female.predict(string)
CPU times: user 4.22 s, sys: 1.75 s, total: 5.97 s
Wall time: 4.55 s
[8]:
%%time

r_husein = husein.predict(string)
CPU times: user 4.14 s, sys: 1.76 s, total: 5.91 s
Wall time: 4.62 s
[9]:
%%time

r_haqkiem = haqkiem.predict(string)
CPU times: user 4.14 s, sys: 1.66 s, total: 5.8 s
Wall time: 4.44 s
[10]:
%%time

r_yasmin = yasmin.predict(string)
CPU times: user 1.8 s, sys: 118 ms, total: 1.92 s
Wall time: 320 ms
[9]:
%%time

r_osman = osman.predict(string)
CPU times: user 1.46 s, sys: 143 ms, total: 1.61 s
Wall time: 588 ms
[12]:
y_ = vocoder_male(r_male['mel-output'])
ipd.Audio(y_, rate = 22050)
[12]:
[13]:
y_ = vocoder_female(r_female['mel-output'])
ipd.Audio(y_, rate = 22050)
[13]:
[14]:
y_ = vocoder_husein(r_husein['mel-output'])
ipd.Audio(y_, rate = 22050)
[14]:
[15]:
y_ = vocoder_haqkiem(r_haqkiem['mel-output'])
ipd.Audio(y_, rate = 22050)
[15]:
[11]:
y_ = universal_melgan(r_yasmin['universal-output'])
ipd.Audio(y_, rate = 22050)
[11]:
[12]:
y_ = universal_melgan(r_osman['universal-output'])
ipd.Audio(y_, rate = 22050)
[12]:
[13]:
string = 'husein busuk masam ketiak pun masam tapi nasib baik comel'
[17]:
%%time

r_male = male.predict(string)
CPU times: user 368 ms, sys: 72.7 ms, total: 440 ms
Wall time: 114 ms
[18]:
%%time

r_female = female.predict(string)
CPU times: user 442 ms, sys: 56.3 ms, total: 498 ms
Wall time: 112 ms
[19]:
%%time

r_husein = husein.predict(string)
CPU times: user 407 ms, sys: 54.4 ms, total: 461 ms
Wall time: 95.9 ms
[20]:
%%time

r_haqkiem = haqkiem.predict(string)
CPU times: user 393 ms, sys: 51 ms, total: 444 ms
Wall time: 92.2 ms
[14]:
%%time

r_yasmin = yasmin.predict(string)
CPU times: user 384 ms, sys: 62.6 ms, total: 446 ms
Wall time: 100 ms
[15]:
%%time

r_osman = osman.predict(string)
CPU times: user 409 ms, sys: 69.1 ms, total: 478 ms
Wall time: 101 ms
[21]:
y_ = vocoder_male(r_male['mel-output'])
ipd.Audio(y_, rate = 22050)
[21]:
[22]:
y_ = vocoder_female(r_female['mel-output'])
ipd.Audio(y_, rate = 22050)
[22]:
[23]:
y_ = vocoder_husein(r_husein['mel-output'])
ipd.Audio(y_, rate = 22050)
[23]:
[24]:
y_ = vocoder_haqkiem(r_haqkiem['mel-output'])
ipd.Audio(y_, rate = 22050)
[24]:
[16]:
y_ = universal_melgan(r_yasmin['universal-output'])
ipd.Audio(y_, rate = 22050)
[16]:
[17]:
y_ = universal_melgan(r_osman['universal-output'])
ipd.Audio(y_, rate = 22050)
[17]:
[18]:
string = 'emel saya ialah husein.zol123456@gmail.com, dan emel ini adalah palsuu'
[26]:
%%time

r_male = male.predict(string)
CPU times: user 652 ms, sys: 55 ms, total: 707 ms
Wall time: 145 ms
[27]:
%%time

r_female = female.predict(string)
CPU times: user 802 ms, sys: 47.5 ms, total: 849 ms
Wall time: 163 ms
[28]:
%%time

r_husein = husein.predict(string)
CPU times: user 867 ms, sys: 63.8 ms, total: 930 ms
Wall time: 171 ms
[29]:
%%time

r_haqkiem = haqkiem.predict(string)
CPU times: user 744 ms, sys: 52.5 ms, total: 796 ms
Wall time: 157 ms
[19]:
%%time

r_yasmin = yasmin.predict(string)
CPU times: user 782 ms, sys: 66.2 ms, total: 848 ms
Wall time: 159 ms
[20]:
%%time

r_osman = osman.predict(string)
CPU times: user 810 ms, sys: 63.6 ms, total: 873 ms
Wall time: 159 ms
[30]:
y_ = vocoder_male(r_male['mel-output'])
ipd.Audio(y_, rate = 22050)
[30]:
[31]:
y_ = vocoder_female(r_female['mel-output'])
ipd.Audio(y_, rate = 22050)
[31]:
[32]:
y_ = vocoder_husein(r_husein['mel-output'])
ipd.Audio(y_, rate = 22050)
[32]:
[33]:
y_ = vocoder_haqkiem(r_haqkiem['mel-output'])
ipd.Audio(y_, rate = 22050)
[33]:
[21]:
y_ = universal_melgan(r_yasmin['universal-output'])
ipd.Audio(y_, rate = 22050)
[21]:
[22]:
y_ = universal_melgan(r_osman['universal-output'])
ipd.Audio(y_, rate = 22050)
[22]:
[24]:
# https://www.sinarharian.com.my/article/116460/BERITA/Nasional/Tiada-isu-kartel-daging-ketika-jadi-PM-Najib
string = 'Najib berkata, walaupun media melaporkan ia telah berlaku sejak 40 tahun lalu, kerajaan Barisan Nasional (BN) tidak pernah menerima apa-apa aduan rasmi berhubung perkara itu.'
[35]:
%%time

r_male = male.predict(string)
CPU times: user 1.12 s, sys: 101 ms, total: 1.22 s
Wall time: 249 ms
[36]:
%%time

r_female = female.predict(string)
CPU times: user 1.28 s, sys: 88.2 ms, total: 1.36 s
Wall time: 254 ms
[37]:
%%time

r_husein = husein.predict(string)
CPU times: user 1.32 s, sys: 95.2 ms, total: 1.41 s
Wall time: 257 ms
[38]:
%%time

r_haqkiem = haqkiem.predict(string)
CPU times: user 1.21 s, sys: 85.7 ms, total: 1.3 s
Wall time: 220 ms
[25]:
%%time

r_yasmin = yasmin.predict(string)
CPU times: user 1.47 s, sys: 98.6 ms, total: 1.57 s
Wall time: 267 ms
[26]:
%%time

r_osman = osman.predict(string)
CPU times: user 1.38 s, sys: 108 ms, total: 1.49 s
Wall time: 250 ms
[39]:
y_ = vocoder_male(r_male['mel-output'])
ipd.Audio(y_, rate = 22050)
[39]:
[40]:
y_ = vocoder_female(r_female['mel-output'])
ipd.Audio(y_, rate = 22050)
[40]:
[41]:
y_ = vocoder_husein(r_husein['mel-output'])
ipd.Audio(y_, rate = 22050)
[41]:
[42]:
y_ = vocoder_haqkiem(r_haqkiem['mel-output'])
ipd.Audio(y_, rate = 22050)
[42]:
[27]:
y_ = universal_melgan(r_yasmin['universal-output'])
ipd.Audio(y_, rate = 22050)
[27]:
[28]:
y_ = universal_melgan(r_osman['universal-output'])
ipd.Audio(y_, rate = 22050)
[28]: