Text-to-Speech VITS Multispeaker Noisy#

VITS Multispeaker, End-to-End, trained on small hours of Malay audiobooks.

This tutorial is available as an IPython notebook at malaya-speech/example/tts-vits-multispeaker-noisy.

This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.

This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at malaya-speech/example/pipeline.

[1]:

import os

os.environ['CUDA_VISIBLE_DEVICES'] = ''

[2]:

import malaya_speech
import numpy as np
from malaya_speech import Pipeline
import matplotlib.pyplot as plt
import IPython.display as ipd

`pyaudio` is not available, `malaya_speech.streaming.pyaudio` is not able to use.

VITS description#

Malaya-speech VITS generate End-to-End, from text input into waveforms with 22050 sample rate.
No length limit, but to get better results, split the text.

List available VITS#

[3]:

malaya_speech.tts.available_vits()

[3]:

	Size (MB)	Understand punctuation	Is lowercase	num speakers
mesolitica/VITS-osman	145	True	False	1
mesolitica/VITS-yasmin	145	True	False	1
mesolitica/VITS-female-singlish	145	True	True	1
mesolitica/VITS-haqkiem	145	True	True	1
mesolitica/VITS-orkid	145	True	False	1
mesolitica/VITS-bunga	145	True	False	1
mesolitica/VITS-jebat	145	True	False	1
mesolitica/VITS-tuah	145	True	False	1
mesolitica/VITS-male	145	True	False	1
mesolitica/VITS-female	145	True	False	1
mesolitica/VITS-multispeaker-clean	159	True	False	9
mesolitica/VITS-multispeaker-noisy	159	True	False	3

Load VITS model#

VITS use text normalizer from Malaya, https://malaya.readthedocs.io/en/latest/load-normalizer.html#Load-normalizer,

Make sure you install Malaya version > 4.0 to make it works, to get better speech synthesis, make sure Malaya version > 4.9.1,

pip install malaya -U

def vits(model: str = 'mesolitica/VITS-osman', **kwargs):
    """
    Load VITS End-to-End TTS model.

    Parameters
    ----------
    model : str, optional (default='mesolitica/VITS-osman')
        Check available models at `malaya_speech.tts.available_vits()`.
    Returns
    -------
    result : malaya_speech.torch_model.synthesis.VITS class
    """

[4]:

model = malaya_speech.tts.vits(model = 'mesolitica/VITS-multispeaker-noisy')

[9]:

# https://www.sinarharian.com.my/article/115216/BERITA/Politik/Syed-Saddiq-pertahan-Dr-Mahathir
string1 = 'Syed Saddiq berkata, mereka seharusnya mengingati bahawa semasa menjadi Perdana Menteri Pakatan Harapan'

List available speakers#

[10]:

model.list_sid()

[10]:

{0: 'teme', 1: 'bukan-kerana-aku', 2: 'harry-potter'}

Predict#

def predict(
    self,
    string,
    temperature: float = 0.0,
    temperature_durator: float = 0.0,
    length_ratio: float = 1.0,
    sid: int = None,
    **kwargs,
):
    """
    Change string to waveform.

    Parameters
    ----------
    string: str
    temperature: float, optional (default=0.0)
        Decoder model trying to decode with encoder(text) + random.normal() * temperature.
        Manipulate this variable will change speaking style.
    temperature_durator: float, optional (default=0.0)
        Durator trying to predict alignment with random.normal() * temperature_durator.
        Manipulate this variable will change speaking style.
    length_ratio: float, optional (default=1.0)
        Manipulate this variable will change length frames generated.
    sid: int, optional (default=None)
        speaker id, only available for multispeaker models.
        will throw an error if sid is None for multispeaker models.

    Returns
    -------
    result: Dict[string, ids, alignment, y]
    """

It only able to predict 1 text for single feed-forward.

[11]:

r = model.predict(string1, sid = 1)
r.keys()

[11]:

dict_keys(['string', 'ids', 'alignment', 'y'])

[12]:

ipd.Audio(r['y'], rate = 22050)

[12]:

Compare different speakers#

[17]:

r = model.predict(string1, sid = 0)
ipd.Audio(r['y'], rate = 22050)

[17]:

[18]:

r = model.predict(string1, sid = 1)
ipd.Audio(r['y'], rate = 22050)

[18]:

[19]:

r = model.predict(string1, sid = 2)
ipd.Audio(r['y'], rate = 22050)

[19]:

Text-to-Speech VITS Multispeaker Noisy

Contents

Text-to-Speech VITS Multispeaker Noisy#

VITS description#

List available VITS#

Load VITS model#

List available speakers#

Predict#

Compare different speakers#