{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Universal HiFiGAN" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "synthesize Melspectrogram to waveform and these models able to synthesize multiple speakers." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [malaya-speech/example/universal-hifigan](https://github.com/huseinzol05/malaya-speech/tree/master/example/universal-hifigan).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This module is language independent, so it save to use on different languages.\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Vocoder description\n", "\n", "1. Only accept mel feature size 80.\n", "2. Will generate waveform with 22050 sample rate." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Explanation\n", "\n", "If you use HiFiGAN Vocoder from https://malaya-speech.readthedocs.io/en/latest/load-vocoder.html, each speaker got their own HiFiGAN Vocoder.\n", "\n", "So we basically scale up the size and trained on multispeakers." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import malaya_speech\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List available HiFiGAN" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Size (MB)Quantized Size (MB)Mel loss
male8.82.490.4650
female8.82.490.5547
universal-1024170.042.900.3346
universal-76872.818.500.3617
universal-51232.68.600.3253
\n", "
" ], "text/plain": [ " Size (MB) Quantized Size (MB) Mel loss\n", "male 8.8 2.49 0.4650\n", "female 8.8 2.49 0.5547\n", "universal-1024 170.0 42.90 0.3346\n", "universal-768 72.8 18.50 0.3617\n", "universal-512 32.6 8.60 0.3253" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya_speech.vocoder.available_hifigan()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load HiFiGAN model\n", "\n", "```python\n", "def hifigan(model: str = 'universal-768', quantized: bool = False, **kwargs):\n", " \"\"\"\n", " Load HiFiGAN Vocoder model.\n", "\n", " Parameters\n", " ----------\n", " model : str, optional (default='universal-768')\n", " Model architecture supported. Allowed values:\n", "\n", " * ``'female'`` - HiFiGAN trained on female voice.\n", " * ``'male'`` - HiFiGAN trained on male voice.\n", " * ``'universal-1024'`` - Universal HiFiGAN with 1024 filters trained on multiple speakers.\n", " * ``'universal-768'`` - Universal HiFiGAN with 768 filters trained on multiple speakers.\n", " * ``'universal-512'`` - Universal HiFiGAN with 512 filters trained on multiple speakers.\n", "\n", " quantized : bool, optional (default=False)\n", " if True, will load 8-bit quantized model.\n", " Quantized model not necessary faster, totally depends on the machine.\n", "\n", " Returns\n", " -------\n", " result : malaya_speech.supervised.vocoder.load function\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING:root:Load quantized model will cause accuracy drop.\n" ] } ], "source": [ "model_768 = malaya_speech.vocoder.hifigan(model = 'universal-768')\n", "quantized_model_768 = malaya_speech.vocoder.hifigan(model = 'universal-768', quantized = True)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "model_512 = malaya_speech.vocoder.hifigan(model = 'universal-512')\n", "quantized_model_512 = malaya_speech.vocoder.hifigan(model = 'universal-512', quantized = True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load some examples\n", "\n", "We use specific stft parameters and steps to convert waveform to melspectrogram for training session, or else these universal melgan models not able to work. Our steps,\n", "\n", "1. Change into melspectrogram.\n", "2. log 10 that melspectrogram.\n", "3. Normalize using global mean and std.\n", "\n", "The models should be able to train without global norm.\n", "\n", "So, to reuse the same steps, use `malaya_speech.featurization.universal_mel` function." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "y, sr = malaya_speech.load('speech/example-speaker/khalil-nooh.wav', sr = 22050)\n", "mel = malaya_speech.featurization.universal_mel(y)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import IPython.display as ipd\n", "\n", "ipd.Audio(y, rate = 22050)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 6.89 s, sys: 597 ms, total: 7.49 s\n", "Wall time: 1.63 s\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "y_ = model_768.predict([mel])\n", "ipd.Audio(y_[0], rate = 22050)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 6.93 s, sys: 617 ms, total: 7.55 s\n", "Wall time: 1.5 s\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "y_ = quantized_model_768.predict([mel])\n", "ipd.Audio(y_[0], rate = 22050)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 3.36 s, sys: 604 ms, total: 3.97 s\n", "Wall time: 696 ms\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "y_ = model_512.predict([mel])\n", "ipd.Audio(y_[0], rate = 22050)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 3.43 s, sys: 605 ms, total: 4.04 s\n", "Wall time: 760 ms\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "y_ = quantized_model_512.predict([mel])\n", "ipd.Audio(y_[0], rate = 22050)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# try english audio\n", "y, sr = malaya_speech.load('speech/44k/test-2.wav', sr = 22050)\n", "y = y[:sr * 4]\n", "mel = malaya_speech.featurization.universal_mel(y)\n", "ipd.Audio(y, rate = 22050)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 7.11 s, sys: 598 ms, total: 7.71 s\n", "Wall time: 1.56 s\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "y_ = model_768.predict([mel])\n", "ipd.Audio(y_[0], rate = 22050)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 3.42 s, sys: 583 ms, total: 4.01 s\n", "Wall time: 789 ms\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "y_ = model_512.predict([mel])\n", "ipd.Audio(y_[0], rate = 22050)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }