Text-to-Speech VITS Multispeaker Noisy
Contents
Text-to-Speech VITS Multispeaker Noisy#
VITS Multispeaker, End-to-End, trained on small hours of Malay audiobooks.
This tutorial is available as an IPython notebook at malaya-speech/example/tts-vits-multispeaker-noisy.
This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.
This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at malaya-speech/example/pipeline.
[1]:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = ''
[2]:
import malaya_speech
import numpy as np
from malaya_speech import Pipeline
import matplotlib.pyplot as plt
import IPython.display as ipd
`pyaudio` is not available, `malaya_speech.streaming.pyaudio` is not able to use.
VITS description#
Malaya-speech VITS generate End-to-End, from text input into waveforms with 22050 sample rate.
No length limit, but to get better results, split the text.
List available VITS#
[3]:
malaya_speech.tts.available_vits()
[3]:
Size (MB) | Understand punctuation | Is lowercase | num speakers | |
---|---|---|---|---|
mesolitica/VITS-osman | 145 | True | False | 1 |
mesolitica/VITS-yasmin | 145 | True | False | 1 |
mesolitica/VITS-female-singlish | 145 | True | True | 1 |
mesolitica/VITS-haqkiem | 145 | True | True | 1 |
mesolitica/VITS-orkid | 145 | True | False | 1 |
mesolitica/VITS-bunga | 145 | True | False | 1 |
mesolitica/VITS-jebat | 145 | True | False | 1 |
mesolitica/VITS-tuah | 145 | True | False | 1 |
mesolitica/VITS-male | 145 | True | False | 1 |
mesolitica/VITS-female | 145 | True | False | 1 |
mesolitica/VITS-multispeaker-clean | 159 | True | False | 9 |
mesolitica/VITS-multispeaker-noisy | 159 | True | False | 3 |
Load VITS model#
VITS use text normalizer from Malaya, https://malaya.readthedocs.io/en/latest/load-normalizer.html#Load-normalizer,
Make sure you install Malaya version > 4.0 to make it works, to get better speech synthesis, make sure Malaya version > 4.9.1,
pip install malaya -U
def vits(model: str = 'mesolitica/VITS-osman', **kwargs):
"""
Load VITS End-to-End TTS model.
Parameters
----------
model : str, optional (default='mesolitica/VITS-osman')
Check available models at `malaya_speech.tts.available_vits()`.
Returns
-------
result : malaya_speech.torch_model.synthesis.VITS class
"""
[4]:
model = malaya_speech.tts.vits(model = 'mesolitica/VITS-multispeaker-noisy')
[9]:
# https://www.sinarharian.com.my/article/115216/BERITA/Politik/Syed-Saddiq-pertahan-Dr-Mahathir
string1 = 'Syed Saddiq berkata, mereka seharusnya mengingati bahawa semasa menjadi Perdana Menteri Pakatan Harapan'
List available speakers#
[10]:
model.list_sid()
[10]:
{0: 'teme', 1: 'bukan-kerana-aku', 2: 'harry-potter'}
Predict#
def predict(
self,
string,
temperature: float = 0.0,
temperature_durator: float = 0.0,
length_ratio: float = 1.0,
sid: int = None,
**kwargs,
):
"""
Change string to waveform.
Parameters
----------
string: str
temperature: float, optional (default=0.0)
Decoder model trying to decode with encoder(text) + random.normal() * temperature.
Manipulate this variable will change speaking style.
temperature_durator: float, optional (default=0.0)
Durator trying to predict alignment with random.normal() * temperature_durator.
Manipulate this variable will change speaking style.
length_ratio: float, optional (default=1.0)
Manipulate this variable will change length frames generated.
sid: int, optional (default=None)
speaker id, only available for multispeaker models.
will throw an error if sid is None for multispeaker models.
Returns
-------
result: Dict[string, ids, alignment, y]
"""
It only able to predict 1 text for single feed-forward.
[11]:
r = model.predict(string1, sid = 1)
r.keys()
[11]:
dict_keys(['string', 'ids', 'alignment', 'y'])
[12]:
ipd.Audio(r['y'], rate = 22050)
[12]:
Compare different speakers#
[17]:
r = model.predict(string1, sid = 0)
ipd.Audio(r['y'], rate = 22050)
[17]:
[18]:
r = model.predict(string1, sid = 1)
ipd.Audio(r['y'], rate = 22050)
[18]:
[19]:
r = model.predict(string1, sid = 2)
ipd.Audio(r['y'], rate = 22050)
[19]: