{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Universal MelGAN" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "synthesize Melspectrogram to waveform and these models able to synthesize multiple speakers." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [malaya-speech/example/universal-melgan](https://github.com/huseinzol05/malaya-speech/tree/master/example/universal-melgan).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This module is language independent, so it save to use on different languages.\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Vocoder description\n", "\n", "1. Only accept mel feature size 80.\n", "2. Will generate waveform with 22050 sample rate." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Explanation\n", "\n", "If you use MelGAN Vocoder from https://malaya-speech.readthedocs.io/en/latest/load-vocoder.html, each speaker got their own MelGAN Vocoder.\n", "\n", "So Universal MelGAN, https://arxiv.org/abs/2011.09631 solved this problem, able to synthesize any melspectrogram to waveform." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import malaya_speech\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List available MelGAN" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Size (MB)Quantized Size (MB)Mel loss
male17.34.530.4443
female17.34.530.4434
husein17.34.530.4442
haqkiem17.34.530.4819
yasmin17.34.530.4867
osman17.34.530.4819
universal309.077.500.4463
universal-102478.419.900.4591
universal-38411.33.060.4445
\n", "
" ], "text/plain": [ " Size (MB) Quantized Size (MB) Mel loss\n", "male 17.3 4.53 0.4443\n", "female 17.3 4.53 0.4434\n", "husein 17.3 4.53 0.4442\n", "haqkiem 17.3 4.53 0.4819\n", "yasmin 17.3 4.53 0.4867\n", "osman 17.3 4.53 0.4819\n", "universal 309.0 77.50 0.4463\n", "universal-1024 78.4 19.90 0.4591\n", "universal-384 11.3 3.06 0.4445" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya_speech.vocoder.available_melgan()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load MelGAN model\n", "\n", "```python\n", "def melgan(model: str = 'female', quantized: bool = False, **kwargs):\n", " \"\"\"\n", " Load MelGAN Vocoder model.\n", "\n", " Parameters\n", " ----------\n", " model : str, optional (default='universal-1024')\n", " Model architecture supported. Allowed values:\n", "\n", " * ``'female'`` - MelGAN trained on female voice.\n", " * ``'male'`` - MelGAN trained on male voice.\n", " * ``'husein'`` - MelGAN trained on Husein voice, https://www.linkedin.com/in/husein-zolkepli/\n", " * ``'haqkiem'`` - MelGAN trained on Haqkiem voice, https://www.linkedin.com/in/haqkiem-daim/\n", " * ``'universal'`` - Universal MelGAN trained on multiple speakers.\n", " * ``'universal-1024'`` - Universal MelGAN with 1024 filters trained on multiple speakers.\n", " * ``'universal-384'`` - Universal MelGAN with 384 filters trained on multiple speakers.\n", " \n", " quantized : bool, optional (default=False)\n", " if True, will load 8-bit quantized model. \n", " Quantized model not necessary faster, totally depends on the machine.\n", "\n", " Returns\n", " -------\n", " result : malaya_speech.supervised.vocoder.load function\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING:root:Load quantized model will cause accuracy drop.\n", "/Users/huseinzolkepli/Documents/tf-1.15/env/lib/python3.7/site-packages/tensorflow_core/python/client/session.py:1750: UserWarning: An interactive session is already active. This can cause out-of-memory errors in some cases. You must explicitly call `InteractiveSession.close()` to release resources held by the other session(s).\n", " warnings.warn('An interactive session is already active. This can '\n" ] } ], "source": [ "melgan = malaya_speech.vocoder.melgan(model = 'universal')\n", "quantized_melgan = malaya_speech.vocoder.melgan(model = 'universal', quantized = True)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING:root:Load quantized model will cause accuracy drop.\n" ] } ], "source": [ "melgan_1024 = malaya_speech.vocoder.melgan(model = 'universal-1024')\n", "quantized_melgan_1024 = malaya_speech.vocoder.melgan(model = 'universal-1024', quantized = True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load some examples\n", "\n", "We use specific stft parameters and steps to convert waveform to melspectrogram for training session, or else these universal melgan models not able to work. Our steps,\n", "\n", "1. Change into melspectrogram.\n", "2. log 10 that melspectrogram.\n", "3. Normalize using global mean and std.\n", "\n", "The models should be able to train without global norm.\n", "\n", "So, to reuse the same steps, use `malaya_speech.featurization.universal_mel` function." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "y, sr = malaya_speech.load('speech/example-speaker/khalil-nooh.wav', sr = 22050)\n", "mel = malaya_speech.featurization.universal_mel(y)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import IPython.display as ipd\n", "\n", "ipd.Audio(y, rate = 22050)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 24.2 s, sys: 3.11 s, total: 27.3 s\n", "Wall time: 6.18 s\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "y_ = melgan.predict([mel])\n", "ipd.Audio(y_[0], rate = 22050)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 23.4 s, sys: 2.47 s, total: 25.9 s\n", "Wall time: 5.25 s\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "y_ = quantized_melgan.predict([mel])\n", "ipd.Audio(y_[0], rate = 22050)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 6.79 s, sys: 930 ms, total: 7.72 s\n", "Wall time: 1.85 s\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "y_ = melgan_1024.predict([mel])\n", "ipd.Audio(y_[0], rate = 22050)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 6.7 s, sys: 841 ms, total: 7.54 s\n", "Wall time: 1.71 s\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "y_ = quantized_melgan_1024.predict([mel])\n", "ipd.Audio(y_[0], rate = 22050)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# try english audio\n", "y, sr = malaya_speech.load('speech/44k/test-2.wav', sr = 22050)\n", "y = y[:sr * 4]\n", "mel = malaya_speech.featurization.universal_mel(y)\n", "ipd.Audio(y, rate = 22050)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 25.1 s, sys: 2.16 s, total: 27.2 s\n", "Wall time: 4.39 s\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "y_ = melgan.predict([mel])\n", "ipd.Audio(y_[0], rate = 22050)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 6.81 s, sys: 725 ms, total: 7.54 s\n", "Wall time: 1.34 s\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "y_ = melgan_1024.predict([mel])\n", "ipd.Audio(y_[0], rate = 22050)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Combine with FastSpeech2 TTS" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/huseinzolkepli/Documents/tf-1.15/env/lib/python3.7/site-packages/malaya/preprocessing.py:259: FutureWarning: Possible nested set at position 2289\n", " self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n" ] } ], "source": [ "female_v2 = malaya_speech.tts.fastspeech2(model = 'female-v2')\n", "haqkiem = malaya_speech.tts.fastspeech2(model = 'haqkiem')" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "string = 'husein busuk masam ketiak pun masam tapi nasib baik comel'" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 667 ms, sys: 181 ms, total: 848 ms\n", "Wall time: 606 ms\n" ] } ], "source": [ "%%time\n", "\n", "r_female_v2 = female_v2.predict(string)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1.11 s, sys: 323 ms, total: 1.43 s\n", "Wall time: 1.07 s\n" ] } ], "source": [ "%%time\n", "\n", "r_haqkiem = haqkiem.predict(string)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_ = melgan(r_female_v2['universal-output'])\n", "ipd.Audio(y_, rate = 22050)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_ = melgan_1024(r_female_v2['universal-output'])\n", "ipd.Audio(y_, rate = 22050)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_ = melgan(r_haqkiem['universal-output'])\n", "ipd.Audio(y_, rate = 22050)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_ = melgan_1024(r_haqkiem['universal-output'])\n", "ipd.Audio(y_, rate = 22050)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "string = 'kau ni apehal bodoh? nak gaduh ke siaaaal'" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 187 ms, sys: 34.7 ms, total: 221 ms\n", "Wall time: 60.7 ms\n" ] } ], "source": [ "%%time\n", "\n", "r_female_v2 = female_v2.predict(string)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 325 ms, sys: 58 ms, total: 383 ms\n", "Wall time: 71.5 ms\n" ] } ], "source": [ "%%time\n", "\n", "r_haqkiem = haqkiem.predict(string)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_ = melgan(r_female_v2['universal-output'])\n", "ipd.Audio(y_, rate = 22050)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_ = melgan_1024(r_female_v2['universal-output'])\n", "ipd.Audio(y_, rate = 22050)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_ = melgan(r_haqkiem['universal-output'])\n", "ipd.Audio(y_, rate = 22050)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_ = melgan_1024(r_haqkiem['universal-output'])\n", "ipd.Audio(y_, rate = 22050)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }