Text-to-Speech VITS Multispeaker Noisy#

VITS Multispeaker, End-to-End, trained on small hours of Malay audiobooks.

This tutorial is available as an IPython notebook at malaya-speech/example/tts-vits-multispeaker-noisy.

This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.

This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at malaya-speech/example/pipeline.

[1]:
import os

os.environ['CUDA_VISIBLE_DEVICES'] = ''
[2]:
import malaya_speech
import numpy as np
from malaya_speech import Pipeline
import matplotlib.pyplot as plt
import IPython.display as ipd
`pyaudio` is not available, `malaya_speech.streaming.pyaudio` is not able to use.

VITS description#

  1. Malaya-speech VITS generate End-to-End, from text input into waveforms with 22050 sample rate.

  2. No length limit, but to get better results, split the text.

List available VITS#

[3]:
malaya_speech.tts.available_vits()
[3]:
Size (MB) Understand punctuation Is lowercase num speakers
mesolitica/VITS-osman 145 True False 1
mesolitica/VITS-yasmin 145 True False 1
mesolitica/VITS-female-singlish 145 True True 1
mesolitica/VITS-haqkiem 145 True True 1
mesolitica/VITS-orkid 145 True False 1
mesolitica/VITS-bunga 145 True False 1
mesolitica/VITS-jebat 145 True False 1
mesolitica/VITS-tuah 145 True False 1
mesolitica/VITS-male 145 True False 1
mesolitica/VITS-female 145 True False 1
mesolitica/VITS-multispeaker-clean 159 True False 9
mesolitica/VITS-multispeaker-noisy 159 True False 3

Load VITS model#

VITS use text normalizer from Malaya, https://malaya.readthedocs.io/en/latest/load-normalizer.html#Load-normalizer,

Make sure you install Malaya version > 4.0 to make it works, to get better speech synthesis, make sure Malaya version > 4.9.1,

pip install malaya -U
def vits(model: str = 'mesolitica/VITS-osman', **kwargs):
    """
    Load VITS End-to-End TTS model.

    Parameters
    ----------
    model : str, optional (default='mesolitica/VITS-osman')
        Check available models at `malaya_speech.tts.available_vits()`.
    Returns
    -------
    result : malaya_speech.torch_model.synthesis.VITS class
    """
[4]:
model = malaya_speech.tts.vits(model = 'mesolitica/VITS-multispeaker-noisy')
[9]:
# https://www.sinarharian.com.my/article/115216/BERITA/Politik/Syed-Saddiq-pertahan-Dr-Mahathir
string1 = 'Syed Saddiq berkata, mereka seharusnya mengingati bahawa semasa menjadi Perdana Menteri Pakatan Harapan'

List available speakers#

[10]:
model.list_sid()
[10]:
{0: 'teme', 1: 'bukan-kerana-aku', 2: 'harry-potter'}

Predict#

def predict(
    self,
    string,
    temperature: float = 0.0,
    temperature_durator: float = 0.0,
    length_ratio: float = 1.0,
    sid: int = None,
    **kwargs,
):
    """
    Change string to waveform.

    Parameters
    ----------
    string: str
    temperature: float, optional (default=0.0)
        Decoder model trying to decode with encoder(text) + random.normal() * temperature.
        Manipulate this variable will change speaking style.
    temperature_durator: float, optional (default=0.0)
        Durator trying to predict alignment with random.normal() * temperature_durator.
        Manipulate this variable will change speaking style.
    length_ratio: float, optional (default=1.0)
        Manipulate this variable will change length frames generated.
    sid: int, optional (default=None)
        speaker id, only available for multispeaker models.
        will throw an error if sid is None for multispeaker models.

    Returns
    -------
    result: Dict[string, ids, alignment, y]
    """

It only able to predict 1 text for single feed-forward.

[11]:
r = model.predict(string1, sid = 1)
r.keys()
[11]:
dict_keys(['string', 'ids', 'alignment', 'y'])
[12]:
ipd.Audio(r['y'], rate = 22050)
[12]:

Compare different speakers#

[17]:
r = model.predict(string1, sid = 0)
ipd.Audio(r['y'], rate = 22050)
[17]:
[18]:
r = model.predict(string1, sid = 1)
ipd.Audio(r['y'], rate = 22050)
[18]:
[19]:
r = model.predict(string1, sid = 2)
ipd.Audio(r['y'], rate = 22050)
[19]: