Text-to-Speech VITS Multispeaker
Contents
Text-to-Speech VITS Multispeaker#
VITS Multispeaker, End-to-End.
This tutorial is available as an IPython notebook at malaya-speech/example/tts-vits-multispeaker.
This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.
This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at malaya-speech/example/pipeline.
[1]:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = ''
[2]:
import malaya_speech
import numpy as np
from malaya_speech import Pipeline
import matplotlib.pyplot as plt
import IPython.display as ipd
`pyaudio` is not available, `malaya_speech.streaming.pyaudio` is not able to use.
VITS description#
Malaya-speech VITS generate End-to-End, from text input into waveforms with 22050 sample rate.
No length limit, but to get better results, split the text.
List available VITS#
[3]:
malaya_speech.tts.available_vits()
[3]:
Size (MB) | Understand punctuation | Is lowercase | num speakers | |
---|---|---|---|---|
mesolitica/VITS-osman | 145 | True | False | 1 |
mesolitica/VITS-yasmin | 145 | True | False | 1 |
mesolitica/VITS-female-singlish | 145 | True | True | 1 |
mesolitica/VITS-haqkiem | 145 | True | True | 1 |
mesolitica/VITS-orkid | 145 | True | False | 1 |
mesolitica/VITS-bunga | 145 | True | False | 1 |
mesolitica/VITS-jebat | 145 | True | False | 1 |
mesolitica/VITS-tuah | 145 | True | False | 1 |
mesolitica/VITS-male | 145 | True | False | 1 |
mesolitica/VITS-female | 145 | True | False | 1 |
mesolitica/VITS-multispeaker-clean | 159 | True | False | 9 |
mesolitica/VITS-multispeaker-noisy | 159 | True | False | 3 |
Load VITS model#
VITS use text normalizer from Malaya, https://malaya.readthedocs.io/en/latest/load-normalizer.html#Load-normalizer,
Make sure you install Malaya version > 4.0 to make it works, to get better speech synthesis, make sure Malaya version > 4.9.1,
pip install malaya -U
def vits(model: str = 'mesolitica/VITS-osman', **kwargs):
"""
Load VITS End-to-End TTS model.
Parameters
----------
model : str, optional (default='mesolitica/VITS-osman')
Check available models at `malaya_speech.tts.available_vits()`.
Returns
-------
result : malaya_speech.torch_model.synthesis.VITS class
"""
[22]:
osman = malaya_speech.tts.vits(model = 'mesolitica/VITS-osman')
[4]:
model = malaya_speech.tts.vits(model = 'mesolitica/VITS-multispeaker-clean')
[5]:
# https://www.sinarharian.com.my/article/115216/BERITA/Politik/Syed-Saddiq-pertahan-Dr-Mahathir
string1 = 'Syed Saddiq berkata, mereka seharusnya mengingati bahawa semasa menjadi Perdana Menteri Pakatan Harapan'
List available speakers#
[6]:
model.list_sid()
[6]:
{0: 'yasmin',
1: 'osman',
2: 'orkid',
3: 'tuah',
4: 'bunga',
5: 'jebat',
6: 'haqkiem',
7: 'male',
8: 'female'}
Predict#
def predict(
self,
string,
temperature: float = 0.0,
temperature_durator: float = 0.0,
length_ratio: float = 1.0,
sid: int = None,
**kwargs,
):
"""
Change string to waveform.
Parameters
----------
string: str
temperature: float, optional (default=0.0)
Decoder model trying to decode with encoder(text) + random.normal() * temperature.
Manipulate this variable will change speaking style.
temperature_durator: float, optional (default=0.0)
Durator trying to predict alignment with random.normal() * temperature_durator.
Manipulate this variable will change speaking style.
length_ratio: float, optional (default=1.0)
Manipulate this variable will change length frames generated.
sid: int, optional (default=None)
speaker id, only available for multispeaker models.
will throw an error if sid is None for multispeaker models.
Returns
-------
result: Dict[string, ids, alignment, y]
"""
It only able to predict 1 text for single feed-forward.
[8]:
r = model.predict(string1, sid = 1)
r.keys()
[8]:
dict_keys(['string', 'ids', 'alignment', 'y'])
[9]:
ipd.Audio(r['y'], rate = 22050)
[9]:
[10]:
r_osman = osman.predict(string1)
r_osman.keys()
[10]:
dict_keys(['string', 'ids', 'alignment', 'y'])
[11]:
ipd.Audio(r_osman['y'], rate = 22050)
[11]:
Compare different speakers#
[12]:
s = 'Haqkiem adalah pelajar tahun akhir yang mengambil Ijazah Sarjana Muda Sains Komputer Kecerdasan Buatan utama dari Universiti Teknikal Malaysia Melaka (UTeM) yang kini berusaha untuk latihan industri di mana dia secara praktikal dapat menerapkan pengetahuannya dalam Perisikan Perisian dan Pengaturcaraan ke arah organisasi atau industri yang berkaitan.'
[13]:
r = model.predict(s, sid = 0)
ipd.Audio(r['y'], rate = 22050)
[13]:
[14]:
r = model.predict(s, sid = 1)
ipd.Audio(r['y'], rate = 22050)
[14]:
[15]:
r = model.predict(s, sid = 2)
ipd.Audio(r['y'], rate = 22050)
[15]:
[16]:
r = model.predict(s, sid = 3)
ipd.Audio(r['y'], rate = 22050)
[16]:
[17]:
r = model.predict(s, sid = 4)
ipd.Audio(r['y'], rate = 22050)
[17]:
[18]:
r = model.predict(s, sid = 5)
ipd.Audio(r['y'], rate = 22050)
[18]:
[19]:
r = model.predict(s, sid = 6)
ipd.Audio(r['y'], rate = 22050)
[19]:
[20]:
r = model.predict(s, sid = 7)
ipd.Audio(r['y'], rate = 22050)
[20]:
[21]:
r = model.predict(s, sid = 8)
ipd.Audio(r['y'], rate = 22050)
[21]: