{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Text-to-Speech VITS Multispeaker" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "VITS Multispeaker, End-to-End." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [malaya-speech/example/tts-vits-multispeaker](https://github.com/huseinzol05/malaya-speech/tree/master/example/tts-vits-multispeaker).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at [malaya-speech/example/pipeline](https://github.com/huseinzol05/malaya-speech/tree/master/example/pipeline).\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "os.environ['CUDA_VISIBLE_DEVICES'] = ''" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "`pyaudio` is not available, `malaya_speech.streaming.pyaudio` is not able to use.\n" ] } ], "source": [ "import malaya_speech\n", "import numpy as np\n", "from malaya_speech import Pipeline\n", "import matplotlib.pyplot as plt\n", "import IPython.display as ipd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### VITS description\n", "\n", "1. Malaya-speech VITS generate End-to-End, from text input into waveforms with 22050 sample rate.\n", "2. No length limit, but to get better results, split the text." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List available VITS" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Size (MB)Understand punctuationIs lowercasenum speakers
mesolitica/VITS-osman145TrueFalse1
mesolitica/VITS-yasmin145TrueFalse1
mesolitica/VITS-female-singlish145TrueTrue1
mesolitica/VITS-haqkiem145TrueTrue1
mesolitica/VITS-orkid145TrueFalse1
mesolitica/VITS-bunga145TrueFalse1
mesolitica/VITS-jebat145TrueFalse1
mesolitica/VITS-tuah145TrueFalse1
mesolitica/VITS-male145TrueFalse1
mesolitica/VITS-female145TrueFalse1
mesolitica/VITS-multispeaker-clean159TrueFalse9
mesolitica/VITS-multispeaker-noisy159TrueFalse3
\n", "
" ], "text/plain": [ " Size (MB) Understand punctuation \\\n", "mesolitica/VITS-osman 145 True \n", "mesolitica/VITS-yasmin 145 True \n", "mesolitica/VITS-female-singlish 145 True \n", "mesolitica/VITS-haqkiem 145 True \n", "mesolitica/VITS-orkid 145 True \n", "mesolitica/VITS-bunga 145 True \n", "mesolitica/VITS-jebat 145 True \n", "mesolitica/VITS-tuah 145 True \n", "mesolitica/VITS-male 145 True \n", "mesolitica/VITS-female 145 True \n", "mesolitica/VITS-multispeaker-clean 159 True \n", "mesolitica/VITS-multispeaker-noisy 159 True \n", "\n", " Is lowercase num speakers \n", "mesolitica/VITS-osman False 1 \n", "mesolitica/VITS-yasmin False 1 \n", "mesolitica/VITS-female-singlish True 1 \n", "mesolitica/VITS-haqkiem True 1 \n", "mesolitica/VITS-orkid False 1 \n", "mesolitica/VITS-bunga False 1 \n", "mesolitica/VITS-jebat False 1 \n", "mesolitica/VITS-tuah False 1 \n", "mesolitica/VITS-male False 1 \n", "mesolitica/VITS-female False 1 \n", "mesolitica/VITS-multispeaker-clean False 9 \n", "mesolitica/VITS-multispeaker-noisy False 3 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya_speech.tts.available_vits()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load VITS model\n", "\n", "VITS use text normalizer from Malaya, https://malaya.readthedocs.io/en/latest/load-normalizer.html#Load-normalizer,\n", "\n", "Make sure you install Malaya version > 4.0 to make it works, **to get better speech synthesis, make sure Malaya version > 4.9.1**,\n", "\n", "```bash\n", "pip install malaya -U\n", "```\n", "\n", "```python\n", "def vits(model: str = 'mesolitica/VITS-osman', **kwargs):\n", " \"\"\"\n", " Load VITS End-to-End TTS model.\n", "\n", " Parameters\n", " ----------\n", " model : str, optional (default='mesolitica/VITS-osman')\n", " Check available models at `malaya_speech.tts.available_vits()`.\n", " Returns\n", " -------\n", " result : malaya_speech.torch_model.synthesis.VITS class\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "osman = malaya_speech.tts.vits(model = 'mesolitica/VITS-osman')" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "model = malaya_speech.tts.vits(model = 'mesolitica/VITS-multispeaker-clean')" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# https://www.sinarharian.com.my/article/115216/BERITA/Politik/Syed-Saddiq-pertahan-Dr-Mahathir\n", "string1 = 'Syed Saddiq berkata, mereka seharusnya mengingati bahawa semasa menjadi Perdana Menteri Pakatan Harapan'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List available speakers" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{0: 'yasmin',\n", " 1: 'osman',\n", " 2: 'orkid',\n", " 3: 'tuah',\n", " 4: 'bunga',\n", " 5: 'jebat',\n", " 6: 'haqkiem',\n", " 7: 'male',\n", " 8: 'female'}" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.list_sid()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Predict\n", "\n", "```python\n", "def predict(\n", " self,\n", " string,\n", " temperature: float = 0.0,\n", " temperature_durator: float = 0.0,\n", " length_ratio: float = 1.0,\n", " sid: int = None,\n", " **kwargs,\n", "):\n", " \"\"\"\n", " Change string to waveform.\n", "\n", " Parameters\n", " ----------\n", " string: str\n", " temperature: float, optional (default=0.0)\n", " Decoder model trying to decode with encoder(text) + random.normal() * temperature.\n", " Manipulate this variable will change speaking style.\n", " temperature_durator: float, optional (default=0.0)\n", " Durator trying to predict alignment with random.normal() * temperature_durator.\n", " Manipulate this variable will change speaking style.\n", " length_ratio: float, optional (default=1.0)\n", " Manipulate this variable will change length frames generated.\n", " sid: int, optional (default=None)\n", " speaker id, only available for multispeaker models.\n", " will throw an error if sid is None for multispeaker models.\n", "\n", " Returns\n", " -------\n", " result: Dict[string, ids, alignment, y]\n", " \"\"\"\n", "```\n", "\n", "It only able to predict 1 text for single feed-forward." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_keys(['string', 'ids', 'alignment', 'y'])" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "r = model.predict(string1, sid = 1)\n", "r.keys()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(r['y'], rate = 22050)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_keys(['string', 'ids', 'alignment', 'y'])" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "r_osman = osman.predict(string1)\n", "r_osman.keys()" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(r_osman['y'], rate = 22050)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Compare different speakers" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "s = 'Haqkiem adalah pelajar tahun akhir yang mengambil Ijazah Sarjana Muda Sains Komputer Kecerdasan Buatan utama dari Universiti Teknikal Malaysia Melaka (UTeM) yang kini berusaha untuk latihan industri di mana dia secara praktikal dapat menerapkan pengetahuannya dalam Perisikan Perisian dan Pengaturcaraan ke arah organisasi atau industri yang berkaitan.'" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "r = model.predict(s, sid = 0)\n", "ipd.Audio(r['y'], rate = 22050)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "r = model.predict(s, sid = 1)\n", "ipd.Audio(r['y'], rate = 22050)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "r = model.predict(s, sid = 2)\n", "ipd.Audio(r['y'], rate = 22050)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "r = model.predict(s, sid = 3)\n", "ipd.Audio(r['y'], rate = 22050)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "r = model.predict(s, sid = 4)\n", "ipd.Audio(r['y'], rate = 22050)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "r = model.predict(s, sid = 5)\n", "ipd.Audio(r['y'], rate = 22050)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "r = model.predict(s, sid = 6)\n", "ipd.Audio(r['y'], rate = 22050)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "r = model.predict(s, sid = 7)\n", "ipd.Audio(r['y'], rate = 22050)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "r = model.predict(s, sid = 8)\n", "ipd.Audio(r['y'], rate = 22050)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }