{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Vocoder" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "synthesize Melspectrogram to waveform." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [malaya-speech/example/vocoder](https://github.com/huseinzol05/malaya-speech/tree/master/example/vocoder).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "scrolled": true }, "outputs": [], "source": [ "import malaya_speech\n", "import numpy as np\n", "import IPython.display as ipd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Available Vocoder\n", "\n", "1. MelGAN, https://arxiv.org/abs/1910.06711\n", "2. Multiband MelGAN, https://arxiv.org/abs/2005.05106\n", "3. Universal MelGAN, https://arxiv.org/abs/2011.09631\n", "4. HiFiGAN, https://arxiv.org/abs/2010.05646" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Vocoder description\n", "\n", "1. These vocoder models only able to convert melspectrogram generated by TTS malaya-speech models\n", "2. Only accept mel feature size 80.\n", "3. Will generate waveform with 22050 sample rate." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List available MelGAN" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Size (MB)Quantized Size (MB)Mel loss
male17.34.530.4443
female17.34.530.4434
husein17.34.530.4442
haqkiem17.34.530.4819
universal309.077.500.4463
universal-102478.419.900.4591
\n", "
" ], "text/plain": [ " Size (MB) Quantized Size (MB) Mel loss\n", "male 17.3 4.53 0.4443\n", "female 17.3 4.53 0.4434\n", "husein 17.3 4.53 0.4442\n", "haqkiem 17.3 4.53 0.4819\n", "universal 309.0 77.50 0.4463\n", "universal-1024 78.4 19.90 0.4591" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya_speech.vocoder.available_melgan()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`husein` voice contributed by [Husein-Zolkepli](https://www.linkedin.com/in/husein-zolkepli/), recorded using low-end microphone in a small room with no reverberation absorber.\n", "\n", "`haqkiem` voice contributed by [Haqkiem Hamdan](https://www.linkedin.com/in/haqkiem-daim/), recorded using high-end microphone in an audio studio." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List available MB MelGAN" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Size (MB)Quantized Size (MB)Mel loss
female10.42.820.4356
male10.42.820.3735
husein10.42.820.4356
haqkiem10.42.820.4192
\n", "
" ], "text/plain": [ " Size (MB) Quantized Size (MB) Mel loss\n", "female 10.4 2.82 0.4356\n", "male 10.4 2.82 0.3735\n", "husein 10.4 2.82 0.4356\n", "haqkiem 10.4 2.82 0.4192" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya_speech.vocoder.available_mbmelgan()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`husein` voice contributed by [Husein-Zolkepli](https://www.linkedin.com/in/husein-zolkepli/), recorded using low-end microphone in a small room with no reverberation absorber.\n", "\n", "`haqkiem` voice contributed by [Haqkiem Hamdan](https://www.linkedin.com/in/haqkiem-daim/), recorded using high-end microphone in an audio studio." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## List available HiFiGAN" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Size (MB)Quantized Size (MB)Mel loss
male8.82.490.4650
female8.82.490.5547
universal-76872.818.500.3617
universal-51232.68.600.3253
\n", "
" ], "text/plain": [ " Size (MB) Quantized Size (MB) Mel loss\n", "male 8.8 2.49 0.4650\n", "female 8.8 2.49 0.5547\n", "universal-768 72.8 18.50 0.3617\n", "universal-512 32.6 8.60 0.3253" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya_speech.vocoder.available_hifigan()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load MelGAN model\n", "\n", "```python\n", "def melgan(model: str = 'female', quantized: bool = False, **kwargs):\n", " \"\"\"\n", " Load MelGAN Vocoder model.\n", "\n", " Parameters\n", " ----------\n", " model : str, optional (default='universal-1024')\n", " Model architecture supported. Allowed values:\n", "\n", " * ``'female'`` - MelGAN trained on female voice.\n", " * ``'male'`` - MelGAN trained on male voice.\n", " * ``'husein'`` - MelGAN trained on Husein voice, https://www.linkedin.com/in/husein-zolkepli/\n", " * ``'haqkiem'`` - MelGAN trained on Haqkiem voice, https://www.linkedin.com/in/haqkiem-daim/\n", " * ``'universal'`` - Universal MelGAN trained on multiple speakers.\n", " * ``'universal-1024'`` - Universal MelGAN with 1024 filters trained on multiple speakers.\n", " \n", " quantized : bool, optional (default=False)\n", " if True, will load 8-bit quantized model. \n", " Quantized model not necessary faster, totally depends on the machine.\n", "\n", " Returns\n", " -------\n", " result : malaya_speech.supervised.vocoder.load function\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /Users/huseinzolkepli/Documents/malaya-speech/malaya_speech/utils/__init__.py:66: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.\n", "\n", "WARNING:tensorflow:From /Users/huseinzolkepli/Documents/malaya-speech/malaya_speech/utils/__init__.py:68: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef instead.\n", "\n", "WARNING:tensorflow:From /Users/huseinzolkepli/Documents/malaya-speech/malaya_speech/utils/__init__.py:61: The name tf.InteractiveSession is deprecated. Please use tf.compat.v1.InteractiveSession instead.\n", "\n" ] } ], "source": [ "melgan = malaya_speech.vocoder.melgan(model = 'female')" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "husein_melgan = malaya_speech.vocoder.melgan(model = 'husein')" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING:root:Load quantized model will cause accuracy drop.\n" ] } ], "source": [ "quantized_melgan = malaya_speech.vocoder.melgan(model = 'female', quantized = True)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING:root:Load quantized model will cause accuracy drop.\n" ] } ], "source": [ "quantized_husein_melgan = malaya_speech.vocoder.melgan(model = 'husein', quantized = True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load Multiband MelGAN model\n", "\n", "```python\n", "def mbmelgan(model: str = 'female', quantized: bool = False, **kwargs):\n", " \"\"\"\n", " Load Multiband MelGAN Vocoder model.\n", "\n", " Parameters\n", " ----------\n", " model : str, optional (default='jasper')\n", " Model architecture supported. Allowed values:\n", "\n", " * ``'female'`` - MelGAN trained on female voice.\n", " * ``'male'`` - MelGAN trained on male voice.\n", " * ``'husein'`` - MelGAN trained on Husein voice, https://www.linkedin.com/in/husein-zolkepli/\n", " * ``'haqkiem'`` - MelGAN trained on Haqkiem voice, https://www.linkedin.com/in/haqkiem-daim/\n", " \n", " quantized : bool, optional (default=False)\n", " if True, will load 8-bit quantized model. \n", " Quantized model not necessary faster, totally depends on the machine.\n", "\n", " Returns\n", " -------\n", " result : malaya_speech.supervised.vocoder.load function\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "mbmelgan = malaya_speech.vocoder.mbmelgan(model = 'female')" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING:root:Load quantized model will cause accuracy drop.\n" ] } ], "source": [ "quantized_mbmelgan = malaya_speech.vocoder.mbmelgan(model = 'female', quantized = True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load HiFiGAN model\n", "\n", "```python\n", "def hifigan(model: str = 'female', quantized: bool = False, **kwargs):\n", " \"\"\"\n", " Load HiFiGAN Vocoder model.\n", "\n", " Parameters\n", " ----------\n", " model : str, optional (default='female')\n", " Model architecture supported. Allowed values:\n", "\n", " * ``'female'`` - HiFiGAN trained on female voice.\n", " * ``'male'`` - HiFiGAN trained on male voice.\n", " * ``'universal-768'`` - Universal HiFiGAN with 768 filters trained on multiple speakers.\n", " * ``'universal-512'`` - Universal HiFiGAN with 512 filters trained on multiple speakers.\n", " \n", "\n", " quantized : bool, optional (default=False)\n", " if True, will load 8-bit quantized model.\n", " Quantized model not necessary faster, totally depends on the machine.\n", "\n", " Returns\n", " -------\n", " result : malaya_speech.supervised.vocoder.load function\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "hifigan = malaya_speech.vocoder.hifigan(model = 'female')" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING:root:Load quantized model will cause accuracy drop.\n" ] } ], "source": [ "quantized_hifigan = malaya_speech.vocoder.hifigan(model = 'female', quantized = True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load data\n", "\n", "this data from validation set." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'mel': array([[-5.135642 , -4.9542203, -5.045578 , ..., -2.9754655, -2.4993045,\n", " -2.552092 ],\n", " [-4.9802437, -5.013033 , -4.753161 , ..., -2.8229384, -2.4302876,\n", " -2.4801488],\n", " [-5.174154 , -5.3979354, -4.799525 , ..., -2.5164714, -2.5151956,\n", " -2.750568 ],\n", " ...,\n", " [-1.4169824, -1.1434933, -1.3719425, ..., -1.5436271, -1.6565201,\n", " -1.8572053],\n", " [-1.5044638, -1.6360878, -1.6556237, ..., -1.5360395, -1.6257277,\n", " -1.8962083],\n", " [-2.642538 , -2.923341 , -2.8665295, ..., -2.355686 , -2.3283741,\n", " -2.5134325]], dtype=float32),\n", " 'audio': array([-6.1828476e-05, -6.1828476e-05, 0.0000000e+00, ...,\n", " 0.0000000e+00, 0.0000000e+00, 0.0000000e+00], dtype=float32)}" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pickle\n", "\n", "with open('speech/pickle/example-female.pkl', 'rb') as fopen:\n", " example = pickle.load(fopen)\n", " \n", "example" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Predict\n", "\n", "```python\n", "def predict(self, inputs):\n", " \"\"\"\n", " Change Mel to Waveform.\n", "\n", " Parameters\n", " ----------\n", " inputs: List[np.array]\n", " List[np.array] or List[malaya_speech.model.frame.FRAME].\n", " Returns\n", " -------\n", " result: List\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(example['audio'], rate = 22050)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1.81 s, sys: 401 ms, total: 2.21 s\n", "Wall time: 651 ms\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "y_ = melgan.predict([example['mel']])\n", "ipd.Audio(y_[0], rate = 22050)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.110681236" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.abs(example['audio'] - y_).mean()" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1.81 s, sys: 383 ms, total: 2.19 s\n", "Wall time: 688 ms\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "y_ = quantized_melgan.predict([example['mel']])\n", "ipd.Audio(y_[0], rate = 22050)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.111384235" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.abs(example['audio'] - y_).mean()" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 782 ms, sys: 153 ms, total: 935 ms\n", "Wall time: 340 ms\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "y_ = mbmelgan.predict([example['mel']])\n", "ipd.Audio(y_[0], rate = 22050)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.141786" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.abs(example['audio'] - y_).mean()" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 846 ms, sys: 142 ms, total: 988 ms\n", "Wall time: 356 ms\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "y_ = quantized_mbmelgan.predict([example['mel']])\n", "ipd.Audio(y_[0], rate = 22050)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.1441468" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.abs(example['audio'] - y_).mean()" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1.79 s, sys: 260 ms, total: 2.05 s\n", "Wall time: 486 ms\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "y_ = hifigan.predict([example['mel']])\n", "ipd.Audio(y_[0], rate = 22050)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1.59 s, sys: 273 ms, total: 1.86 s\n", "Wall time: 449 ms\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "y_ = quantized_hifigan.predict([example['mel']])\n", "ipd.Audio(y_[0], rate = 22050)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Husein speaker" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'mel': array([[-3.0871143 , -2.1933894 , -1.9330187 , ..., -1.0192316 ,\n", " -1.3248271 , -1.0974363 ],\n", " [-2.6055963 , -2.3006353 , -1.6657838 , ..., -1.0752352 ,\n", " -0.9072396 , -0.96403134],\n", " [-1.9718108 , -2.147272 , -2.0438907 , ..., -0.89703083,\n", " -0.73646724, -0.99775743],\n", " ...,\n", " [-1.3376398 , -1.70222 , -1.4850315 , ..., -1.0758084 ,\n", " -1.4132309 , -0.98251915],\n", " [-1.1394596 , -1.5324714 , -1.5667722 , ..., -1.1989553 ,\n", " -1.2888682 , -1.0891267 ],\n", " [-1.3729159 , -1.4634348 , -1.9626601 , ..., -1.4223598 ,\n", " -1.1820908 , -1.3431906 ]], dtype=float32),\n", " 'audio': array([0.00092771, 0.00076513, 0.00053338, ..., 0.00252281, 0.00252281,\n", " 0.00252281], dtype=float32)}" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "with open('speech/pickle/example-husein.pkl', 'rb') as fopen:\n", " example = pickle.load(fopen)\n", " \n", "example" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1.29 s, sys: 325 ms, total: 1.62 s\n", "Wall time: 686 ms\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "y_ = husein_melgan.predict([example['mel']])\n", "ipd.Audio(y_[0], rate = 22050)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1.29 s, sys: 218 ms, total: 1.51 s\n", "Wall time: 345 ms\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "y_ = quantized_husein_melgan.predict([example['mel']])\n", "ipd.Audio(y_[0], rate = 22050)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" } }, "nbformat": 4, "nbformat_minor": 4 }