{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Voice" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Many-to-One, One-to-Many, Many-to-Many, and Zero-shot Voice Conversion." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [malaya-speech/example/voice-conversion](https://github.com/huseinzol05/malaya-speech/tree/master/example/voice-conversion).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This module is language independent, so it save to use on different languages.\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Explanation\n", "\n", "We created super fast Voice Conversion model, called FastVC, Faster and Accurate Voice Conversion using Transformer. No paper produced.\n", "\n", "Steps to reproduce can check at https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/voice-conversion" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import malaya_speech\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List available Voice Conversion models" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Size (MB)Quantized Size (MB)Total loss
fastvc-32-vggvox-v2190.054.10.2851
fastvc-64-vggvox-v2194.055.70.2764
\n", "
" ], "text/plain": [ " Size (MB) Quantized Size (MB) Total loss\n", "fastvc-32-vggvox-v2 190.0 54.1 0.2851\n", "fastvc-64-vggvox-v2 194.0 55.7 0.2764" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya_speech.voice_conversion.available_deep_conversion()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load Deep Conversion\n", "\n", "```python\n", "def deep_conversion(\n", " model: str = 'fastvc-vggvox-v2', quantized: bool = False, **kwargs\n", "):\n", " \"\"\"\n", " Load Voice Conversion model.\n", "\n", " Parameters\n", " ----------\n", " model : str, optional (default='fastvc-vggvox-v2')\n", " Model architecture supported. Allowed values:\n", "\n", " * ``'fastvc-32-vggvox-v2'`` - FastVC bottleneck size 32 with VGGVox-v2 Speaker Vector.\n", " * ``'fastvc-64-vggvox-v2'`` - FastVC bottleneck size 64 with VGGVox-v2 Speaker Vector.\n", " \n", " quantized : bool, optional (default=False)\n", " if True, will load 8-bit quantized model. \n", " Quantized model not necessary faster, totally depends on the machine.\n", "\n", " Returns\n", " -------\n", " result : malaya_speech.supervised.voice_conversion.load function\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING:root:Load quantized model will cause accuracy drop.\n" ] } ], "source": [ "model = malaya_speech.voice_conversion.deep_conversion(model = 'fastvc-32-vggvox-v2')\n", "quantized_model = malaya_speech.voice_conversion.deep_conversion(model = 'fastvc-32-vggvox-v2', quantized = True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Predict\n", "\n", "```python\n", "def predict(self, original_audio, target_audio):\n", " \"\"\"\n", " Change original voice audio to follow targeted voice.\n", "\n", " Parameters\n", " ----------\n", " original_audio: np.array or malaya_speech.model.frame.Frame\n", " target_audio: np.array or malaya_speech.model.frame.Frame\n", "\n", " Returns\n", " -------\n", " result: Dict[decoder-output, postnet-output]\n", " \"\"\"\n", "```\n", "\n", "**`original_audio` and `target_audio` must 22050 sample rate**." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "sr = 22050\n", "original_audio = malaya_speech.load('speech/example-speaker/haqkiem.wav', sr = sr)[0]\n", "target_audio = malaya_speech.load('speech/example-speaker/female.wav', sr = sr)[0]" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import IPython.display as ipd\n", "\n", "ipd.Audio(original_audio, rate = sr)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(target_audio[:sr * 2], rate = sr)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 9.52 s, sys: 1.6 s, total: 11.1 s\n", "Wall time: 3.21 s\n" ] }, { "data": { "text/plain": [ "{'decoder-output': array([[ 0.16796653, 0.27031827, 0.25115278, ..., 1.9728385 ,\n", " 2.0013132 , 1.9959606 ],\n", " [ 0.1876081 , 0.31539977, 0.21735613, ..., 2.105957 ,\n", " 2.1475153 , 2.135561 ],\n", " [ 0.11078158, 0.24430256, 0.13483176, ..., 2.2050035 ,\n", " 2.2327175 , 2.2086055 ],\n", " ...,\n", " [-0.46983352, -0.37537116, -0.46007934, ..., -1.3968909 ,\n", " -1.4182267 , -1.445814 ],\n", " [-0.6261345 , -0.52298963, -0.6305046 , ..., -1.6692938 ,\n", " -1.6694924 , -1.670802 ],\n", " [-0.7858655 , -0.6631793 , -0.7685092 , ..., -1.7505003 ,\n", " -1.7430477 , -1.7306981 ]], dtype=float32),\n", " 'postnet-output': array([[ 0.16796653, 0.27031827, 0.25115278, ..., 1.9728385 ,\n", " 2.0013132 , 1.9959606 ],\n", " [ 0.1876081 , 0.31539977, 0.21735613, ..., 2.105957 ,\n", " 2.1475153 , 2.135561 ],\n", " [ 0.11078158, 0.24430256, 0.13483176, ..., 2.2050035 ,\n", " 2.2327175 , 2.2086055 ],\n", " ...,\n", " [-0.46983352, -0.37537116, -0.46007934, ..., -1.3968909 ,\n", " -1.4182267 , -1.445814 ],\n", " [-0.6261345 , -0.52298963, -0.6305046 , ..., -1.6692938 ,\n", " -1.6694924 , -1.670802 ],\n", " [-0.7858655 , -0.6631793 , -0.7685092 , ..., -1.7505003 ,\n", " -1.7430477 , -1.7306981 ]], dtype=float32)}" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "r = model.predict(original_audio, target_audio)\n", "r" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 9.27 s, sys: 1.48 s, total: 10.7 s\n", "Wall time: 3.07 s\n" ] }, { "data": { "text/plain": [ "{'decoder-output': array([[ 0.20622607, 0.31927785, 0.30248964, ..., 1.8387263 ,\n", " 1.8538276 , 1.8661375 ],\n", " [ 0.26772612, 0.37867302, 0.28368202, ..., 2.0063264 ,\n", " 2.0300496 , 2.027563 ],\n", " [ 0.22045831, 0.35479122, 0.24202934, ..., 2.1292984 ,\n", " 2.1489828 , 2.1232607 ],\n", " ...,\n", " [-0.37217844, -0.30496663, -0.40188327, ..., -1.4102241 ,\n", " -1.47401 , -1.4887681 ],\n", " [-0.553902 , -0.47220862, -0.60245174, ..., -1.6579611 ,\n", " -1.7115406 , -1.7125119 ],\n", " [-0.7077116 , -0.60712785, -0.7680642 , ..., -1.7266644 ,\n", " -1.7799759 , -1.766048 ]], dtype=float32),\n", " 'postnet-output': array([[ 0.20622607, 0.31927785, 0.30248964, ..., 1.8387263 ,\n", " 1.8538276 , 1.8661375 ],\n", " [ 0.26772612, 0.37867302, 0.28368202, ..., 2.0063264 ,\n", " 2.0300496 , 2.027563 ],\n", " [ 0.22045831, 0.35479122, 0.24202934, ..., 2.1292984 ,\n", " 2.1489828 , 2.1232607 ],\n", " ...,\n", " [-0.37217844, -0.30496663, -0.40188327, ..., -1.4102241 ,\n", " -1.47401 , -1.4887681 ],\n", " [-0.553902 , -0.47220862, -0.60245174, ..., -1.6579611 ,\n", " -1.7115406 , -1.7125119 ],\n", " [-0.7077116 , -0.60712785, -0.7680642 , ..., -1.7266644 ,\n", " -1.7799759 , -1.766048 ]], dtype=float32)}" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "quantized_r = quantized_model.predict(original_audio, target_audio)\n", "quantized_r" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Voice Conversion output\n", "\n", "1. Will returned mel feature size 80.\n", "2. This mel feature only able to synthesize using Universal Vocoder, eg, Universal Melgan, https://malaya-speech.readthedocs.io/en/latest/load-universal-melgan.html" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load Universal MelGAN\n", "\n", "Read more about Universal MelGAN at https://malaya-speech.readthedocs.io/en/latest/load-universal-melgan.html" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "melgan = malaya_speech.vocoder.melgan(model = 'universal-1024')" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 14.6 s, sys: 2.29 s, total: 16.9 s\n", "Wall time: 3.46 s\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "y_ = melgan.predict([r['postnet-output']])\n", "ipd.Audio(y_[0], rate = sr)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 14.1 s, sys: 1.93 s, total: 16 s\n", "Wall time: 3 s\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "y_ = melgan.predict([quantized_r['postnet-output']])\n", "ipd.Audio(y_[0], rate = sr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pretty good!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### More example\n", "\n", "This time we try, original voice is English, target voice from Malay and English." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "original_audio = malaya_speech.load('speech/44k/test-2.wav', sr = sr)[0]\n", "ipd.Audio(original_audio, rate = sr)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "target_audio = malaya_speech.load('speech/vctk/p300_298_mic1.flac', sr = sr)[0]\n", "r = model.predict(original_audio, target_audio)\n", "ipd.Audio(target_audio, rate = sr)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_ = melgan.predict([r['postnet-output']])\n", "ipd.Audio(y_[0], rate = sr)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "target_audio = malaya_speech.load('speech/vctk/p323_158_mic2.flac', sr = sr)[0]\n", "r = model.predict(original_audio, target_audio)\n", "ipd.Audio(target_audio, rate = sr)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_ = melgan.predict([r['postnet-output']])\n", "ipd.Audio(y_[0], rate = sr)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "target_audio = malaya_speech.load('speech/vctk/p360_292_mic2.flac', sr = sr)[0]\n", "r = model.predict(original_audio, target_audio)\n", "ipd.Audio(target_audio, rate = sr)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_ = melgan.predict([r['postnet-output']])\n", "ipd.Audio(y_[0], rate = sr)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "target_audio = malaya_speech.load('speech/vctk/p361_077_mic1.flac', sr = sr)[0]\n", "r = model.predict(original_audio, target_audio)\n", "ipd.Audio(target_audio, rate = sr)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_ = melgan.predict([r['postnet-output']])\n", "ipd.Audio(y_[0], rate = sr)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "target_audio = malaya_speech.load('speech/example-speaker/female.wav', sr = sr)[0]\n", "r = model.predict(original_audio, target_audio)\n", "ipd.Audio(target_audio, rate = sr)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_ = melgan.predict([r['postnet-output']])\n", "ipd.Audio(y_[0], rate = sr)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "target_audio = malaya_speech.load('speech/example-speaker/husein-zolkepli.wav', sr = sr)[0]\n", "ipd.Audio(target_audio, rate = sr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you have a low quality audio, you can use speech enhancement, https://malaya-speech.readthedocs.io/en/latest/load-speech-enhancement.html" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "enhancer = malaya_speech.speech_enhancement.deep_enhance(model = 'unet-enhance-24')" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "logits = enhancer.predict(target_audio)\n", "ipd.Audio(logits, rate = sr)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "r = model.predict(original_audio, target_audio)" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_ = melgan.predict([r['postnet-output']])\n", "ipd.Audio(y_[0], rate = sr)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" } }, "nbformat": 4, "nbformat_minor": 4 }