{ "cells": [ { "cell_type": "markdown", "id": "adapted-channel", "metadata": {}, "source": [ "# Speech Split PySPTK" ] }, { "cell_type": "markdown", "id": "accessory-relief", "metadata": {}, "source": [ "detailed speaking style conversion by disentangling speech into content, timbre, rhythm and pitch using PySPTK." ] }, { "cell_type": "markdown", "id": "incoming-willow", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [malaya-speech/example/speechsplit-conversion-pysptk](https://github.com/huseinzol05/malaya-speech/tree/master/example/speechsplit-conversion-pysptk).\n", " \n", "
" ] }, { "cell_type": "markdown", "id": "confident-preference", "metadata": {}, "source": [ "
\n", "\n", "This module is language independent, so it save to use on different languages.\n", " \n", "
" ] }, { "cell_type": "markdown", "id": "removable-texture", "metadata": {}, "source": [ "### Explanation\n", "\n", "We created super fast Speech Split Conversion model, called FastSpeechSplit, Faster and Accurate Speech Split Conversion using Transformer. No paper produced.\n", "\n", "Steps to reproduce can check at https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/speechsplit-conversion" ] }, { "cell_type": "markdown", "id": "quality-instrument", "metadata": {}, "source": [ "### F0 Conversion\n", "\n", "Make sure already installed pysptk,\n", "\n", "```bash\n", "pip install pysptk\n", "```" ] }, { "cell_type": "code", "execution_count": 1, "id": "opponent-wealth", "metadata": {}, "outputs": [], "source": [ "import malaya_speech\n", "import numpy as np" ] }, { "cell_type": "markdown", "id": "mature-enlargement", "metadata": {}, "source": [ "### List available Speech Split models" ] }, { "cell_type": "code", "execution_count": 2, "id": "strong-might", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Size (MB)Quantized Size (MB)
fastspeechsplit-vggvox-v2232.059.2
fastspeechsplit-v2-vggvox-v2105.0411.0
\n", "
" ], "text/plain": [ " Size (MB) Quantized Size (MB)\n", "fastspeechsplit-vggvox-v2 232.0 59.2\n", "fastspeechsplit-v2-vggvox-v2 105.0 411.0" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya_speech.speechsplit_conversion.available_deep_conversion(f0_mode = 'pysptk')" ] }, { "cell_type": "markdown", "id": "blank-label", "metadata": {}, "source": [ "### Load Deep Conversion\n", "\n", "```python\n", "def deep_conversion(\n", " model: str = 'fastspeechsplit-v2-vggvox-v2',\n", " f0_mode = 'pysptk',\n", " quantized: bool = False,\n", " **kwargs,\n", "):\n", " \"\"\"\n", " Load Voice Conversion model.\n", "\n", " Parameters\n", " ----------\n", " model : str, optional (default='fastvc-32-vggvox-v2')\n", " Model architecture supported. Allowed values:\n", "\n", " * ``'fastspeechsplit-vggvox-v2'`` - FastSpeechSplit with VGGVox-v2 Speaker Vector.\n", " * ``'fastspeechsplit-v2-vggvox-v2'`` - FastSpeechSplit V2 with VGGVox-v2 Speaker Vector.\n", "\n", " f0_mode : str, optional (default='pysptk)\n", " F0 conversion supported. Allowed values:\n", "\n", " * ``'pysptk'`` - https://github.com/r9y9/pysptk, sensitive towards gender.\n", " * ``'pyworld'`` - https://pypi.org/project/pyworld/\n", " \n", " quantized : bool, optional (default=False)\n", " if True, will load 8-bit quantized model. \n", " Quantized model not necessary faster, totally depends on the machine.\n", "\n", " Returns\n", " -------\n", " result : malaya_speech.supervised.speechsplit_conversion.load function\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 23, "id": "collective-recorder", "metadata": {}, "outputs": [], "source": [ "model = malaya_speech.speechsplit_conversion.deep_conversion(model = 'fastspeechsplit-vggvox-v2')\n", "model_v2 = malaya_speech.speechsplit_conversion.deep_conversion(model = 'fastspeechsplit-v2-vggvox-v2')" ] }, { "cell_type": "markdown", "id": "animal-proxy", "metadata": {}, "source": [ "### Predict\n", "\n", "```python\n", "def predict(\n", " self,\n", " original_audio,\n", " target_audio,\n", " modes = ['R', 'F', 'U', 'RF', 'RU', 'FU', 'RFU'],\n", "):\n", " \"\"\"\n", " Change original voice audio to follow targeted voice.\n", "\n", " Parameters\n", " ----------\n", " original_audio: np.array or malaya_speech.model.frame.Frame\n", " target_audio: np.array or malaya_speech.model.frame.Frame\n", " modes: List[str], optional (default = ['R', 'F', 'U', 'RF', 'RU', 'FU', 'RFU'])\n", " R denotes rhythm, F denotes pitch target, U denotes speaker target (vector).\n", "\n", " * ``'R'`` - maintain `original_audio` F and U on `target_audio` R.\n", " * ``'F'`` - maintain `original_audio` R and U on `target_audio` F.\n", " * ``'U'`` - maintain `original_audio` R and F on `target_audio` U.\n", " * ``'RF'`` - maintain `original_audio` U on `target_audio` R and F.\n", " * ``'RU'`` - maintain `original_audio` F on `target_audio` R and U.\n", " * ``'FU'`` - maintain `original_audio` R on `target_audio` F and U.\n", " * ``'RFU'`` - no conversion happened, just do encoder-decoder on `target_audio`\n", "\n", " Returns\n", " -------\n", " result: Dict[modes]\n", " \"\"\"\n", "```\n", "\n", "**`original_audio` and `target_audio` must 22050 sample rate**." ] }, { "cell_type": "code", "execution_count": 4, "id": "expanded-wages", "metadata": {}, "outputs": [], "source": [ "sr = 22050\n", "original_audio = malaya_speech.load('speech/example-speaker/haqkiem.wav', sr = sr)[0]\n", "target_audio = malaya_speech.load('speech/example-speaker/female.wav', sr = sr)[0]" ] }, { "cell_type": "code", "execution_count": 5, "id": "ethical-personal", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import IPython.display as ipd\n", "\n", "ipd.Audio(original_audio, rate = sr)" ] }, { "cell_type": "code", "execution_count": 6, "id": "silent-thumb", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(target_audio[:sr * 2], rate = sr)" ] }, { "cell_type": "code", "execution_count": 7, "id": "sized-swedish", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 32 s, sys: 6.65 s, total: 38.7 s\n", "Wall time: 17.7 s\n" ] }, { "data": { "text/plain": [ "dict_keys(['R', 'F', 'U', 'RF', 'RU', 'FU', 'RFU'])" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "r = model.predict(original_audio, target_audio)\n", "r.keys()" ] }, { "cell_type": "code", "execution_count": 8, "id": "compound-intro", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 58.7 s, sys: 12.5 s, total: 1min 11s\n", "Wall time: 30.5 s\n" ] }, { "data": { "text/plain": [ "dict_keys(['R', 'F', 'U', 'RF', 'RU', 'FU', 'RFU'])" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "r_v2 = model_v2.predict(original_audio, target_audio)\n", "r_v2.keys()" ] }, { "cell_type": "markdown", "id": "intense-shadow", "metadata": {}, "source": [ "### Speech Split output\n", "\n", "1. Will returned mel feature size 80.\n", "2. This mel feature only able to synthesize using Universal Vocoder, eg, Universal Melgan, https://malaya-speech.readthedocs.io/en/latest/load-universal-melgan.html" ] }, { "cell_type": "markdown", "id": "local-works", "metadata": {}, "source": [ "### Load Universal MelGAN\n", "\n", "Read more about Universal MelGAN at https://malaya-speech.readthedocs.io/en/latest/load-universal-melgan.html" ] }, { "cell_type": "code", "execution_count": 10, "id": "hungarian-literature", "metadata": {}, "outputs": [], "source": [ "melgan = malaya_speech.vocoder.melgan(model = 'universal-1024')" ] }, { "cell_type": "code", "execution_count": 11, "id": "marine-impression", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 17.3 s, sys: 3.62 s, total: 20.9 s\n", "Wall time: 6.16 s\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "y_ = melgan.predict([r['R']])\n", "ipd.Audio(y_[0], rate = sr)" ] }, { "cell_type": "code", "execution_count": 12, "id": "exclusive-static", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 16.1 s, sys: 2.42 s, total: 18.5 s\n", "Wall time: 3.41 s\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "y_ = melgan.predict([r_v2['R']])\n", "ipd.Audio(y_[0], rate = sr)" ] }, { "cell_type": "code", "execution_count": 13, "id": "dental-slovenia", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 13.6 s, sys: 2.18 s, total: 15.8 s\n", "Wall time: 2.82 s\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "y_ = melgan.predict([r['F']])\n", "ipd.Audio(y_[0], rate = sr)" ] }, { "cell_type": "code", "execution_count": 14, "id": "hispanic-textbook", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 13.7 s, sys: 2.25 s, total: 15.9 s\n", "Wall time: 3.09 s\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "y_ = melgan.predict([r_v2['F']])\n", "ipd.Audio(y_[0], rate = sr)" ] }, { "cell_type": "code", "execution_count": 15, "id": "twelve-magnet", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 14.8 s, sys: 2.35 s, total: 17.2 s\n", "Wall time: 3.46 s\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "y_ = melgan.predict([r['U']])\n", "ipd.Audio(y_[0], rate = sr)" ] }, { "cell_type": "code", "execution_count": 16, "id": "average-popularity", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 14 s, sys: 2.37 s, total: 16.4 s\n", "Wall time: 2.97 s\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "y_ = melgan.predict([r_v2['U']])\n", "ipd.Audio(y_[0], rate = sr)" ] }, { "cell_type": "code", "execution_count": 17, "id": "peaceful-width", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 16.2 s, sys: 2.48 s, total: 18.7 s\n", "Wall time: 3.46 s\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "y_ = melgan.predict([r['RF']])\n", "ipd.Audio(y_[0], rate = sr)" ] }, { "cell_type": "code", "execution_count": 18, "id": "gross-samoa", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 16.5 s, sys: 2.62 s, total: 19.1 s\n", "Wall time: 3.64 s\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "y_ = melgan.predict([r_v2['RF']])\n", "ipd.Audio(y_[0], rate = sr)" ] }, { "cell_type": "code", "execution_count": 19, "id": "accessory-nudist", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 16.9 s, sys: 2.59 s, total: 19.5 s\n", "Wall time: 3.85 s\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "y_ = melgan.predict([r['RU']])\n", "ipd.Audio(y_[0], rate = sr)" ] }, { "cell_type": "code", "execution_count": 20, "id": "unexpected-sociology", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 16.8 s, sys: 2.65 s, total: 19.4 s\n", "Wall time: 3.5 s\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "y_ = melgan.predict([r_v2['RU']])\n", "ipd.Audio(y_[0], rate = sr)" ] }, { "cell_type": "code", "execution_count": 21, "id": "upper-emission", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 14.1 s, sys: 2.04 s, total: 16.1 s\n", "Wall time: 2.99 s\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "y_ = melgan.predict([r['FU']])\n", "ipd.Audio(y_[0], rate = sr)" ] }, { "cell_type": "code", "execution_count": 22, "id": "egyptian-electricity", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 14.2 s, sys: 2.25 s, total: 16.5 s\n", "Wall time: 2.93 s\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "y_ = melgan.predict([r_v2['FU']])\n", "ipd.Audio(y_[0], rate = sr)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" } }, "nbformat": 4, "nbformat_minor": 5 }