Text-to-Speech VITS Multispeaker#

VITS Multispeaker, End-to-End.

This tutorial is available as an IPython notebook at malaya-speech/example/tts-vits-multispeaker.

This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.

This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at malaya-speech/example/pipeline.

[1]:

import os

os.environ['CUDA_VISIBLE_DEVICES'] = ''

[2]:

import malaya_speech
import numpy as np
from malaya_speech import Pipeline
import matplotlib.pyplot as plt
import IPython.display as ipd

`pyaudio` is not available, `malaya_speech.streaming.pyaudio` is not able to use.

VITS description#

Malaya-speech VITS generate End-to-End, from text input into waveforms with 22050 sample rate.
No length limit, but to get better results, split the text.

List available VITS#

[3]:

malaya_speech.tts.available_vits()

[3]:

	Size (MB)	Understand punctuation	Is lowercase	num speakers
mesolitica/VITS-osman	145	True	False	1
mesolitica/VITS-yasmin	145	True	False	1
mesolitica/VITS-female-singlish	145	True	True	1
mesolitica/VITS-haqkiem	145	True	True	1
mesolitica/VITS-orkid	145	True	False	1
mesolitica/VITS-bunga	145	True	False	1
mesolitica/VITS-jebat	145	True	False	1
mesolitica/VITS-tuah	145	True	False	1
mesolitica/VITS-male	145	True	False	1
mesolitica/VITS-female	145	True	False	1
mesolitica/VITS-multispeaker-clean	159	True	False	9
mesolitica/VITS-multispeaker-noisy	159	True	False	3

Load VITS model#

VITS use text normalizer from Malaya, https://malaya.readthedocs.io/en/latest/load-normalizer.html#Load-normalizer,

Make sure you install Malaya version > 4.0 to make it works, to get better speech synthesis, make sure Malaya version > 4.9.1,

pip install malaya -U

def vits(model: str = 'mesolitica/VITS-osman', **kwargs):
    """
    Load VITS End-to-End TTS model.

    Parameters
    ----------
    model : str, optional (default='mesolitica/VITS-osman')
        Check available models at `malaya_speech.tts.available_vits()`.
    Returns
    -------
    result : malaya_speech.torch_model.synthesis.VITS class
    """

[22]:

osman = malaya_speech.tts.vits(model = 'mesolitica/VITS-osman')

[4]:

model = malaya_speech.tts.vits(model = 'mesolitica/VITS-multispeaker-clean')

[5]:

# https://www.sinarharian.com.my/article/115216/BERITA/Politik/Syed-Saddiq-pertahan-Dr-Mahathir
string1 = 'Syed Saddiq berkata, mereka seharusnya mengingati bahawa semasa menjadi Perdana Menteri Pakatan Harapan'

List available speakers#

[6]:

model.list_sid()

[6]:

{0: 'yasmin',
 1: 'osman',
 2: 'orkid',
 3: 'tuah',
 4: 'bunga',
 5: 'jebat',
 6: 'haqkiem',
 7: 'male',
 8: 'female'}

Predict#

def predict(
    self,
    string,
    temperature: float = 0.0,
    temperature_durator: float = 0.0,
    length_ratio: float = 1.0,
    sid: int = None,
    **kwargs,
):
    """
    Change string to waveform.

    Parameters
    ----------
    string: str
    temperature: float, optional (default=0.0)
        Decoder model trying to decode with encoder(text) + random.normal() * temperature.
        Manipulate this variable will change speaking style.
    temperature_durator: float, optional (default=0.0)
        Durator trying to predict alignment with random.normal() * temperature_durator.
        Manipulate this variable will change speaking style.
    length_ratio: float, optional (default=1.0)
        Manipulate this variable will change length frames generated.
    sid: int, optional (default=None)
        speaker id, only available for multispeaker models.
        will throw an error if sid is None for multispeaker models.

    Returns
    -------
    result: Dict[string, ids, alignment, y]
    """

It only able to predict 1 text for single feed-forward.

[8]:

r = model.predict(string1, sid = 1)
r.keys()

[8]:

dict_keys(['string', 'ids', 'alignment', 'y'])

[9]:

ipd.Audio(r['y'], rate = 22050)

[9]:

[10]:

r_osman = osman.predict(string1)
r_osman.keys()

[10]:

dict_keys(['string', 'ids', 'alignment', 'y'])

[11]:

ipd.Audio(r_osman['y'], rate = 22050)

[11]:

Compare different speakers#

[12]:

s = 'Haqkiem adalah pelajar tahun akhir yang mengambil Ijazah Sarjana Muda Sains Komputer Kecerdasan Buatan utama dari Universiti Teknikal Malaysia Melaka (UTeM) yang kini berusaha untuk latihan industri di mana dia secara praktikal dapat menerapkan pengetahuannya dalam Perisikan Perisian dan Pengaturcaraan ke arah organisasi atau industri yang berkaitan.'

[13]:

r = model.predict(s, sid = 0)
ipd.Audio(r['y'], rate = 22050)

[13]:

[14]:

r = model.predict(s, sid = 1)
ipd.Audio(r['y'], rate = 22050)

[14]:

[15]:

r = model.predict(s, sid = 2)
ipd.Audio(r['y'], rate = 22050)

[15]:

[16]:

r = model.predict(s, sid = 3)
ipd.Audio(r['y'], rate = 22050)

[16]:

[17]:

r = model.predict(s, sid = 4)
ipd.Audio(r['y'], rate = 22050)

[17]:

[18]:

r = model.predict(s, sid = 5)
ipd.Audio(r['y'], rate = 22050)

[18]:

[19]:

r = model.predict(s, sid = 6)
ipd.Audio(r['y'], rate = 22050)

[19]:

[20]:

r = model.predict(s, sid = 7)
ipd.Audio(r['y'], rate = 22050)

[20]:

[21]:

r = model.predict(s, sid = 8)
ipd.Audio(r['y'], rate = 22050)

[21]:

Text-to-Speech VITS Multispeaker

Contents

Text-to-Speech VITS Multispeaker#

VITS description#

List available VITS#

Load VITS model#

List available speakers#

Predict#

Compare different speakers#