Text-to-Speech VITS Multispeaker#

VITS Multispeaker, End-to-End.

This tutorial is available as an IPython notebook at malaya-speech/example/tts-vits-multispeaker.

This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.

This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at malaya-speech/example/pipeline.

import os

os.environ['CUDA_VISIBLE_DEVICES'] = ''
import malaya_speech
import numpy as np
from malaya_speech import Pipeline
import matplotlib.pyplot as plt
import IPython.display as ipd
`pyaudio` is not available, `malaya_speech.streaming.pyaudio` is not able to use.

VITS description#

  1. Malaya-speech VITS generate End-to-End, from text input into waveforms with 22050 sample rate.

  2. No length limit, but to get better results, split the text.

List available VITS#

Size (MB) Understand punctuation Is lowercase num speakers
mesolitica/VITS-osman 145 True False 1
mesolitica/VITS-yasmin 145 True False 1
mesolitica/VITS-female-singlish 145 True True 1
mesolitica/VITS-haqkiem 145 True True 1
mesolitica/VITS-orkid 145 True False 1
mesolitica/VITS-bunga 145 True False 1
mesolitica/VITS-jebat 145 True False 1
mesolitica/VITS-tuah 145 True False 1
mesolitica/VITS-male 145 True False 1
mesolitica/VITS-female 145 True False 1
mesolitica/VITS-multispeaker-clean 159 True False 9
mesolitica/VITS-multispeaker-noisy 159 True False 3

Load VITS model#

VITS use text normalizer from Malaya, https://malaya.readthedocs.io/en/latest/load-normalizer.html#Load-normalizer,

Make sure you install Malaya version > 4.0 to make it works, to get better speech synthesis, make sure Malaya version > 4.9.1,

pip install malaya -U
def vits(model: str = 'mesolitica/VITS-osman', **kwargs):
    Load VITS End-to-End TTS model.

    model : str, optional (default='mesolitica/VITS-osman')
        Check available models at `malaya_speech.tts.available_vits()`.
    result : malaya_speech.torch_model.synthesis.VITS class
osman = malaya_speech.tts.vits(model = 'mesolitica/VITS-osman')
model = malaya_speech.tts.vits(model = 'mesolitica/VITS-multispeaker-clean')
# https://www.sinarharian.com.my/article/115216/BERITA/Politik/Syed-Saddiq-pertahan-Dr-Mahathir
string1 = 'Syed Saddiq berkata, mereka seharusnya mengingati bahawa semasa menjadi Perdana Menteri Pakatan Harapan'

List available speakers#

{0: 'yasmin',
 1: 'osman',
 2: 'orkid',
 3: 'tuah',
 4: 'bunga',
 5: 'jebat',
 6: 'haqkiem',
 7: 'male',
 8: 'female'}


def predict(
    temperature: float = 0.0,
    temperature_durator: float = 0.0,
    length_ratio: float = 1.0,
    sid: int = None,
    Change string to waveform.

    string: str
    temperature: float, optional (default=0.0)
        Decoder model trying to decode with encoder(text) + random.normal() * temperature.
        Manipulate this variable will change speaking style.
    temperature_durator: float, optional (default=0.0)
        Durator trying to predict alignment with random.normal() * temperature_durator.
        Manipulate this variable will change speaking style.
    length_ratio: float, optional (default=1.0)
        Manipulate this variable will change length frames generated.
    sid: int, optional (default=None)
        speaker id, only available for multispeaker models.
        will throw an error if sid is None for multispeaker models.

    result: Dict[string, ids, alignment, y]

It only able to predict 1 text for single feed-forward.

r = model.predict(string1, sid = 1)
dict_keys(['string', 'ids', 'alignment', 'y'])
ipd.Audio(r['y'], rate = 22050)
r_osman = osman.predict(string1)
dict_keys(['string', 'ids', 'alignment', 'y'])
ipd.Audio(r_osman['y'], rate = 22050)

Compare different speakers#

s = 'Haqkiem adalah pelajar tahun akhir yang mengambil Ijazah Sarjana Muda Sains Komputer Kecerdasan Buatan utama dari Universiti Teknikal Malaysia Melaka (UTeM) yang kini berusaha untuk latihan industri di mana dia secara praktikal dapat menerapkan pengetahuannya dalam Perisikan Perisian dan Pengaturcaraan ke arah organisasi atau industri yang berkaitan.'
r = model.predict(s, sid = 0)
ipd.Audio(r['y'], rate = 22050)
r = model.predict(s, sid = 1)
ipd.Audio(r['y'], rate = 22050)
r = model.predict(s, sid = 2)
ipd.Audio(r['y'], rate = 22050)
r = model.predict(s, sid = 3)
ipd.Audio(r['y'], rate = 22050)
r = model.predict(s, sid = 4)
ipd.Audio(r['y'], rate = 22050)
r = model.predict(s, sid = 5)
ipd.Audio(r['y'], rate = 22050)
r = model.predict(s, sid = 6)
ipd.Audio(r['y'], rate = 22050)
r = model.predict(s, sid = 7)
ipd.Audio(r['y'], rate = 22050)
r = model.predict(s, sid = 8)
ipd.Audio(r['y'], rate = 22050)