{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Noise Reduction\n", "\n", "Reduce background musics, noises and etc while maintain voice activities." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [malaya-speech/example/noise-reduction](https://github.com/huseinzol05/malaya-speech/tree/master/example/noise-reduction).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This module is language independent, so it save to use on different languages. Pretrained models trained on multilanguages.\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at [malaya-speech/example/pipeline](https://github.com/huseinzol05/malaya-speech/tree/master/example/pipeline).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Dataset\n", "\n", "Trained on English, Manglish and Bahasa podcasts with augmented noises, gathered at https://github.com/huseinzol05/malaya-speech/tree/master/data/podcast" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import malaya_speech\n", "import numpy as np\n", "from malaya_speech import Pipeline" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(2628043, 44100, 59.59281179138322)" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y, sr = malaya_speech.load('speech/podcast/SKOLAR.wav', sr = 44100)\n", "len(y), sr, len(y) / sr" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So total length is 60 seconds." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import IPython.display as ipd\n", "ipd.Audio(y[:10 * sr], rate = sr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This audio extracted from https://www.youtube.com/watch?v=blaIfSWf38Q&t=25s&ab_channel=SkolarMalaysia\n", "\n", "As you can hear, the audio got introduction music overlapped with speakers. So we want to reduce that introduction music and possibly split the audio into voice and background noise." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List available deep model" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:root:Only calculate SDR, ISR, SAR on voice sample. Higher is better.\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Size (MB)Quantized Size (MB)SUM MAEMAE_SPEAKERMAE_NOISESDRISRSAR
unet78.920.00.8623160.4606760.401649.17312013.9243513.20592
resnet-unet96.424.60.8253500.4388500.386499.45413013.9639013.60276
resnext-unet75.419.00.8110200.4471900.363838.99283213.4919413.13210
\n", "
" ], "text/plain": [ " Size (MB) Quantized Size (MB) SUM MAE MAE_SPEAKER \\\n", "unet 78.9 20.0 0.862316 0.460676 \n", "resnet-unet 96.4 24.6 0.825350 0.438850 \n", "resnext-unet 75.4 19.0 0.811020 0.447190 \n", "\n", " MAE_NOISE SDR ISR SAR \n", "unet 0.40164 9.173120 13.92435 13.20592 \n", "resnet-unet 0.38649 9.454130 13.96390 13.60276 \n", "resnext-unet 0.36383 8.992832 13.49194 13.13210 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya_speech.noise_reduction.available_model()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load deep model\n", "\n", "```python\n", "def deep_model(model: str = 'resnet-unet', quantized: bool = False, **kwargs):\n", " \"\"\"\n", " Load Noise Reduction deep learning model.\n", "\n", " Parameters\n", " ----------\n", " model : str, optional (default='wavenet')\n", " Model architecture supported. Allowed values:\n", "\n", " * ``'unet'`` - pretrained UNET.\n", " * ``'resnet-unet'`` - pretrained resnet-UNET.\n", " * ``'resnext'`` - pretrained resnext-UNET.\n", " quantized : bool, optional (default=False)\n", " if True, will load 8-bit quantized model. \n", " Quantized model not necessary faster, totally depends on the machine.\n", "\n", " Returns\n", " -------\n", " result : malaya_speech.model.tf.UNET_STFT class\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "model = malaya_speech.noise_reduction.deep_model(model = 'resnet-unet')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load Quantized deep model\n", "\n", "To load 8-bit quantized model, simply pass `quantized = True`, default is `False`.\n", "\n", "We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING:root:Load quantized model will cause accuracy drop.\n" ] } ], "source": [ "quantized_model = malaya_speech.noise_reduction.deep_model(model = 'resnet-unet', quantized = True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Important factor" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. Noise Reduction model trained on 44k sample rate, so make sure load the audio with 44k sample rate.\n", "\n", "```python\n", "malaya_speech.load(audio_file, sr = 44100)\n", "librosa.load(audio_file, sr = 44100)\n", "```\n", "\n", "2. You can feed dynamic length of audio, no need to cap, the model do padding by itself. But again, the longer the audio, the longer time required to calculate, unless you have GPU to speed up.\n", "3. STFT and Inverse STFT can be done on GPU level, so the model is really fast on GPU." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 27.3 s, sys: 3.48 s, total: 30.8 s\n", "Wall time: 6.54 s\n" ] } ], "source": [ "%%time\n", "\n", "output = model(y)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'voice': array([ 7.4655384e-07, -5.3525662e-07, -3.8191757e-07, ...,\n", " -3.0058224e-02, -2.9105157e-02, -2.1171883e-02], dtype=float32),\n", " 'noise': array([-4.3224041e-08, -2.3430280e-06, -3.2800205e-07, ...,\n", " -1.3801644e-03, -3.3497461e-03, -1.9985531e-03], dtype=float32)}" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "output" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(output['voice'][:10 * sr], rate = sr)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(output['noise'][:10 * sr], rate = sr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Nicely done! How about our quantized model?" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 28.6 s, sys: 3.58 s, total: 32.2 s\n", "Wall time: 8.05 s\n" ] }, { "data": { "text/plain": [ "{'voice': array([ 6.0242473e-07, -6.7520131e-07, -4.9965337e-07, ...,\n", " -3.0453464e-02, -2.9502867e-02, -2.1368120e-02], dtype=float32),\n", " 'noise': array([ 1.01274054e-07, -2.20296351e-06, -2.10108894e-07, ...,\n", " -9.84926941e-04, -2.95203505e-03, -1.80230662e-03], dtype=float32)}" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "output_quantized = quantized_model(y)\n", "output_quantized" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(output_quantized['voice'][:10 * sr], rate = sr)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(output_quantized['noise'][:10 * sr], rate = sr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Use Pipeline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Incase your audio is too long and you do not want to burden your machine. So, you can use malaya-speech Pipeline to split the audio splitted to 15 seconds, predict one-by-one and combine after that." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "p = Pipeline()\n", "pipeline = (\n", " p.map(malaya_speech.generator.frames, frame_duration_ms = 15000, sample_rate = sr)\n", " .foreach_map(model)\n", " .foreach_map(lambda x: x['voice'])\n", " .map(np.concatenate)\n", ")\n", "p.visualize()" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 34.8 s, sys: 4.52 s, total: 39.3 s\n", "Wall time: 7.67 s\n" ] } ], "source": [ "%%time\n", "\n", "results = p.emit(y)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_keys(['frames', 'noise-reduction', '', 'concatenate'])" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results.keys()" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(results['concatenate'][:10 * sr], rate = sr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Reference\n", "\n", "1. Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation, Daniel Stoller, Sebastian Ewert, Simon Dixon, https://arxiv.org/abs/1806.03185\n", "2. SKOLAR MALAYSIA PODCAST, https://www.youtube.com/watch?v=blaIfSWf38Q&t=25s&ab_channel=SkolarMalaysia" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" } }, "nbformat": 4, "nbformat_minor": 4 }