{ "cells": [ { "cell_type": "markdown", "id": "frozen-champion", "metadata": {}, "source": [ "# WAV separation" ] }, { "cell_type": "markdown", "id": "challenging-viking", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [malaya-speech/example/multispeaker-separation-wav](https://github.com/huseinzol05/malaya-speech/tree/master/example/multispeaker-separation-wav).\n", " \n", "
" ] }, { "cell_type": "markdown", "id": "dense-canadian", "metadata": {}, "source": [ "
\n", "\n", "This module is language independent, so it save to use on different languages.\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": 1, "id": "structured-machine", "metadata": {}, "outputs": [], "source": [ "import malaya_speech\n", "import numpy as np\n", "from malaya_speech import Pipeline\n", "import matplotlib.pyplot as plt\n", "import IPython.display as ipd" ] }, { "cell_type": "markdown", "id": "legislative-bulletin", "metadata": {}, "source": [ "### Multispeaker separation description\n", "\n", "1. FastSep-WAV only able to separate 8k sample rate.\n", "2. FastSep-WAV trained to separate 4 unique speakers.\n", "3. Trained on VCTK, Nepali, Mandarin and Malay mixed speakers." ] }, { "cell_type": "markdown", "id": "subsequent-benefit", "metadata": {}, "source": [ "### List available FastSep-WAV" ] }, { "cell_type": "code", "execution_count": 2, "id": "arctic-stone", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:root:Tested on 1k samples\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Size (MB)Quantized Size (MB)SISNR PIT
fastsep-278.720.514.156882
fastsep-4155.040.219.682500
\n", "
" ], "text/plain": [ " Size (MB) Quantized Size (MB) SISNR PIT\n", "fastsep-2 78.7 20.5 14.156882\n", "fastsep-4 155.0 40.2 19.682500" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya_speech.multispeaker_separation.available_deep_wav()" ] }, { "cell_type": "markdown", "id": "reverse-pantyhose", "metadata": {}, "source": [ "### Load model\n", "\n", "```python\n", "def deep_wav(model: str = 'fastsep-4', quantized: bool = False, **kwargs):\n", " \"\"\"\n", " Load FastSep model, trained on raw 8k wav using SISNR PIT loss.\n", "\n", " Parameters\n", " ----------\n", " model : str, optional (default='fastsep-4')\n", " Model architecture supported. Allowed values:\n", "\n", " * ``'fastsep-4'`` - FastSep 4 layers trained on raw 8k wav.\n", " * ``'fastsep-2'`` - FastSep 2 layers trained on raw 8k wav.\n", " quantized : bool, optional (default=False)\n", " if True, will load 8-bit quantized model. \n", " Quantized model not necessary faster, totally depends on the machine.\n", "\n", " Returns\n", " -------\n", " result : malaya_speech.model.tf.Split class\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 3, "id": "divine-horse", "metadata": {}, "outputs": [], "source": [ "model = malaya_speech.multispeaker_separation.deep_wav('fastsep-4')" ] }, { "cell_type": "markdown", "id": "affiliated-morgan", "metadata": {}, "source": [ "### Load quantized model" ] }, { "cell_type": "code", "execution_count": 4, "id": "fourth-rendering", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING:root:Load quantized model will cause accuracy drop.\n" ] } ], "source": [ "quantized_model = malaya_speech.multispeaker_separation.deep_wav('fastsep-4', quantized = True)" ] }, { "cell_type": "markdown", "id": "controversial-composite", "metadata": {}, "source": [ "### Generate random mixed audio" ] }, { "cell_type": "code", "execution_count": 5, "id": "governing-madrid", "metadata": {}, "outputs": [], "source": [ "import random\n", "import malaya_speech.augmentation.waveform as augmentation\n", "\n", "sr = 8000\n", "speakers_size = 4\n", "\n", "def read_wav(f):\n", " return malaya_speech.load(f, sr = sr)\n", "\n", "\n", "def random_sampling(s, length):\n", " return augmentation.random_sampling(s, sr = sr, length = length)\n", "\n", "def combine_speakers(files, n = 5, limit = 4):\n", " w_samples = random.sample(files, n)\n", " w_samples = [read_wav(f)[0] for f in w_samples]\n", " w_lens = [len(w) / sr for w in w_samples]\n", " w_lens = int(min(min(w_lens) * 1000, random.randint(3000, 7000)))\n", " w_samples = [random_sampling(w, length = w_lens) for w in w_samples]\n", " y = [w_samples[0]]\n", " left = w_samples[0].copy()\n", "\n", " combined = None\n", "\n", " for i in range(1, n):\n", " right = w_samples[i].copy()\n", " overlap = random.uniform(0.98, 1.0)\n", " len_overlap = int(overlap * len(right))\n", " minus = len(left) - len_overlap\n", " if minus < 0:\n", " minus = 0\n", " padded_right = np.pad(right, (minus, 0))\n", " left = np.pad(left, (0, len(padded_right) - len(left)))\n", "\n", " left = left + padded_right\n", "\n", " if i >= (limit - 1):\n", " if combined is None:\n", " combined = padded_right\n", " else:\n", " combined = np.pad(\n", " combined, (0, len(padded_right) - len(combined))\n", " )\n", " combined += padded_right\n", "\n", " else:\n", " y.append(padded_right)\n", "\n", " if combined is not None:\n", " y.append(combined)\n", " \n", " maxs = [max(left)]\n", " for i in range(len(y)):\n", " if len(y[i]) != len(left):\n", " y[i] = np.pad(y[i], (0, len(left) - len(y[i])))\n", " maxs.append(max(y[i]))\n", " \n", " max_amp = max(maxs)\n", " mix_scaling = 1 / max_amp * 0.9\n", " left = left * mix_scaling\n", " \n", " for i in range(len(y)):\n", " y[i] = y[i] * mix_scaling\n", " \n", " return left, y" ] }, { "cell_type": "code", "execution_count": 6, "id": "effective-bidding", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "23" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from glob import glob\n", "\n", "wavs = glob('speech/example-speaker/*.wav')\n", "wavs.extend(glob('speech/vctk/*.flac'))\n", "len(wavs)" ] }, { "cell_type": "code", "execution_count": 18, "id": "portable-medication", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(3.45525, 4)" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "left, y = combine_speakers(wavs, speakers_size)\n", "len(left) / sr, len(y)" ] }, { "cell_type": "code", "execution_count": 19, "id": "biological-bridge", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(left, rate = sr)" ] }, { "cell_type": "code", "execution_count": 20, "id": "liquid-westminster", "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.plot(left, label = 'mixed')\n", "plt.plot(y[0], label = 'y0')\n", "plt.plot(y[1], label = 'y1')\n", "plt.plot(y[2], label = 'y2')\n", "plt.plot(y[3], label = 'y3')\n", "plt.legend()\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "official-ensemble", "metadata": {}, "source": [ "### Predict\n", "\n", "```python\n", "def predict(self, input):\n", " \"\"\"\n", " Split an audio into 4 different speakers.\n", "\n", " Parameters\n", " ----------\n", " input: np.array or malaya_speech.model.frame.Frame\n", "\n", " Returns\n", " -------\n", " result: np.array\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 21, "id": "neither-lancaster", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 20.5 s, sys: 1.68 s, total: 22.2 s\n", "Wall time: 3.51 s\n" ] }, { "data": { "text/plain": [ "(4, 27642)" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "y = model.predict(left)\n", "y.shape" ] }, { "cell_type": "code", "execution_count": 22, "id": "conditional-namibia", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 20.8 s, sys: 1.73 s, total: 22.5 s\n", "Wall time: 3.45 s\n" ] }, { "data": { "text/plain": [ "(4, 27642)" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "quantized_y = quantized_model.predict(left)\n", "quantized_y.shape" ] }, { "cell_type": "markdown", "id": "popular-wednesday", "metadata": {}, "source": [ "### Results" ] }, { "cell_type": "code", "execution_count": 23, "id": "above-fetish", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(y[0], rate = sr)" ] }, { "cell_type": "code", "execution_count": 24, "id": "olympic-combination", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(y[1], rate = sr)" ] }, { "cell_type": "code", "execution_count": 25, "id": "dedicated-movie", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(y[2], rate = sr)" ] }, { "cell_type": "code", "execution_count": 26, "id": "mexican-algorithm", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(y[3], rate = sr)" ] }, { "cell_type": "markdown", "id": "weird-baseball", "metadata": {}, "source": [ "### Quantized results" ] }, { "cell_type": "code", "execution_count": 27, "id": "proper-labor", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(quantized_y[0], rate = sr)" ] }, { "cell_type": "code", "execution_count": 28, "id": "arctic-sessions", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(quantized_y[1], rate = sr)" ] }, { "cell_type": "code", "execution_count": 29, "id": "reduced-million", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(quantized_y[2], rate = sr)" ] }, { "cell_type": "code", "execution_count": 30, "id": "elementary-ribbon", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(quantized_y[3], rate = sr)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" } }, "nbformat": 4, "nbformat_minor": 5 }