{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Speech enhancement" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [malaya-speech/example/speech-enhancement](https://github.com/huseinzol05/malaya-speech/tree/master/example/speech-enhancement).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This module is language independent, so it save to use on different languages. Pretrained models trained on multilanguages.\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at [malaya-speech/example/pipeline](https://github.com/huseinzol05/malaya-speech/tree/master/example/pipeline).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Dataset\n", "\n", "Trained on English, Manglish and Bahasa podcasts with augmented noises, gathered at https://github.com/huseinzol05/malaya-speech/tree/master/data/podcast" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Purpose of this module to enhance voice activities, reduce reverberance, reduce loudness and broken voices.\n", "\n", "**voice -> malaya-speech noise reduction -> malaya-speech speech enhancement**." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import malaya_speech\n", "import numpy as np\n", "from malaya_speech import Pipeline" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(220500, 22050, 10.0)" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sr = 22050\n", "y, _ = malaya_speech.load('speech/khutbah/wadi-annuar.wav', sr = sr)\n", "len(y), sr, len(y) / sr" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So total length is 10 seconds." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import IPython.display as ipd\n", "ipd.Audio(y[:10 * sr], rate = sr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The speech got room echo and a bit of broken high pitch, because it recorded in mosque." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['speech/enhance/461-y_.wav',\n", " 'speech/enhance/125-y_.wav',\n", " 'speech/enhance/371-y_.wav',\n", " 'speech/enhance/328-y_.wav']" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from glob import glob\n", "\n", "wavs = glob('speech/enhance/*.wav')\n", "wavs" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "wavs = [malaya_speech.load(f, sr = sr)[0] for f in wavs]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List available deep enhance" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "scrolled": true }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:root:Only calculate SDR, ISR, SAR on voice sample. Higher is better.\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Size (MB)Quantized Size (MB)SDRISRSAR
unet40.710.309.87717815.91621713.70913
resnet-unet36.49.299.43617016.86103012.32157
resnext-unet36.19.269.68557816.42137012.45115
\n", "
" ], "text/plain": [ " Size (MB) Quantized Size (MB) SDR ISR SAR\n", "unet 40.7 10.30 9.877178 15.916217 13.70913\n", "resnet-unet 36.4 9.29 9.436170 16.861030 12.32157\n", "resnext-unet 36.1 9.26 9.685578 16.421370 12.45115" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya_speech.speech_enhancement.available_deep_enhance()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load deep enhance\n", "\n", "```python\n", "def deep_enhance(model: str = 'unet', quantized: bool = False, **kwargs):\n", " \"\"\"\n", " Load Speech Enhancement UNET Waveform sampling deep learning model.\n", "\n", " Parameters\n", " ----------\n", " model : str, optional (default='unet')\n", " Model architecture supported. Allowed values:\n", "\n", " * ``'unet'`` - pretrained UNET Speech Enhancement.\n", " * ``'resnet-unet'`` - pretrained resnet-UNET Speech Enhancement.\n", " * ``'resnext-unet'`` - pretrained resnext-UNET Speech Enhancement.\n", " quantized : bool, optional (default=False)\n", " if True, will load 8-bit quantized model. \n", " Quantized model not necessary faster, totally depends on the machine.\n", "\n", " Returns\n", " -------\n", " result : malaya_speech.model.tf.UNET1D class\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "scrolled": true }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING:root:Load quantized model will cause accuracy drop.\n", "/Users/huseinzolkepli/Documents/tf-1.15/env/lib/python3.7/site-packages/tensorflow_core/python/client/session.py:1750: UserWarning: An interactive session is already active. This can cause out-of-memory errors in some cases. You must explicitly call `InteractiveSession.close()` to release resources held by the other session(s).\n", " warnings.warn('An interactive session is already active. This can '\n" ] } ], "source": [ "model = malaya_speech.speech_enhancement.deep_enhance(model = 'unet')\n", "quantized_model = malaya_speech.speech_enhancement.deep_enhance(model = 'unet', quantized = True)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING:root:Load quantized model will cause accuracy drop.\n" ] } ], "source": [ "resnet = malaya_speech.speech_enhancement.deep_enhance(model = 'resnet-unet')\n", "quantized_resnet = malaya_speech.speech_enhancement.deep_enhance(model = 'resnet-unet', quantized = True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Important factor for deep enhance\n", "\n", "1. Deep Enhance model trained on 22k sample rate, so make sure load the audio with 22k sample rate.\n", "\n", "```python\n", "malaya_speech.load(audio_file, sr = 22050)\n", "librosa.load(audio_file, sr = 22050)\n", "```\n", "\n", "2. You can feed dynamic length of audio, no need to cap, the model do padding by itself. But again, the longer the audio, the longer time required to calculate, unless you have GPU to speed up.\n", "3. The model process on waveform level, so no STFT or inverse STFT involved." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Predict\n", "\n", "Speech Enhancement model only accept 1 audio for single feed-forward,\n", "\n", "```python\n", "def predict(self, input):\n", " \"\"\"\n", " Enhance inputs, will return waveform.\n", "\n", " Parameters\n", " ----------\n", " input: np.array\n", " np.array or malaya_speech.model.frame.Frame.\n", "\n", " Returns\n", " -------\n", " result: np.array\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 2.1 s, sys: 489 ms, total: 2.59 s\n", "Wall time: 508 ms\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "logits = model.predict(y)\n", "ipd.Audio(logits, rate = 22050)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 2.28 s, sys: 481 ms, total: 2.76 s\n", "Wall time: 515 ms\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "quantized_logits = quantized_model.predict(y)\n", "ipd.Audio(quantized_logits, rate = 22050)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 2.62 s, sys: 770 ms, total: 3.39 s\n", "Wall time: 907 ms\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "logits = resnet.predict(y)\n", "ipd.Audio(logits, rate = 22050)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 2.59 s, sys: 728 ms, total: 3.32 s\n", "Wall time: 874 ms\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "logits = quantized_resnet.predict(y)\n", "ipd.Audio(logits, rate = 22050)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Try more example" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 4.44 s, sys: 1.16 s, total: 5.6 s\n", "Wall time: 1.01 s\n" ] } ], "source": [ "%%time\n", "\n", "logits = [model.predict(w) for w in wavs]" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 4.68 s, sys: 1.23 s, total: 5.91 s\n", "Wall time: 1.12 s\n" ] } ], "source": [ "%%time\n", "\n", "resnet_logits = [resnet.predict(w) for w in wavs]" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 4.15 s, sys: 1.08 s, total: 5.23 s\n", "Wall time: 952 ms\n" ] } ], "source": [ "%%time\n", "\n", "quantized_logits = [quantized_model.predict(w) for w in wavs]" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 4.66 s, sys: 1.18 s, total: 5.83 s\n", "Wall time: 1.14 s\n" ] } ], "source": [ "%%time\n", "\n", "quantized_resnet = [quantized_resnet.predict(w) for w in wavs]" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(wavs[0], rate = 22050)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(logits[0], rate = 22050)" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(resnet_logits[0], rate = 22050)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(wavs[1], rate = 22050)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(logits[1], rate = 22050)" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(resnet_logits[1], rate = 22050)" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(wavs[2], rate = 22050)" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(logits[2], rate = 22050)" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(resnet_logits[2], rate = 22050)" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(wavs[3], rate = 22050)" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(logits[3], rate = 22050)" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(resnet_logits[3], rate = 22050)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List available masking model\n", "\n", "Masking model is just simply mask STFT input to reduce echo, reverberance and broken pitch. This model cannot generate new waveform, eg, if input waveform is filtered using low / high frequency, this model cannot enhance the input. So, we prefer use `malaya_speech.speech_enhancement.deep_enhance`." ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:root:Only calculate SDR, ISR, SAR on voice sample. Higher is better.\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Size (MB)Quantized Size (MB)SUM MAEMAE_SPEAKERMAE_NOISESDRISRSAR
unet78.920.00.858960.4684900.39046012.12805014.6706715.019682
resnet-unet91.423.00.815400.4479580.36744112.34925914.8541815.217510
\n", "
" ], "text/plain": [ " Size (MB) Quantized Size (MB) SUM MAE MAE_SPEAKER MAE_NOISE \\\n", "unet 78.9 20.0 0.85896 0.468490 0.390460 \n", "resnet-unet 91.4 23.0 0.81540 0.447958 0.367441 \n", "\n", " SDR ISR SAR \n", "unet 12.128050 14.67067 15.019682 \n", "resnet-unet 12.349259 14.85418 15.217510 " ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya_speech.speech_enhancement.available_deep_masking()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load masking model\n", "\n", "```python\n", "def deep_masking(model: str = 'resnet-unet', quantized: bool = False, **kwargs):\n", " \"\"\"\n", " Load Speech Enhancement STFT UNET masking deep learning model.\n", "\n", " Parameters\n", " ----------\n", " model : str, optional (default='resnet-unet')\n", " Model architecture supported. Allowed values:\n", "\n", " * ``'unet'`` - pretrained UNET.\n", " * ``'resnet-unet'`` - pretrained resnet-UNET.\n", " quantized : bool, optional (default=False)\n", " if True, will load 8-bit quantized model. \n", " Quantized model not necessary faster, totally depends on the machine.\n", "\n", " Returns\n", " -------\n", " result : malaya_speech.model.tf.UNETSTFT class\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model = malaya_speech.speech_enhancement.deep_masking(model = 'resnet-unet')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Important factor for deep masking" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. Speech enhancement masking model trained on 44k sample rate, so make sure load the audio with 44k sample rate.\n", "\n", "```python\n", "malaya_speech.load(audio_file, sr = 44100)\n", "librosa.load(audio_file, sr = 44100)\n", "```\n", "\n", "2. You can feed dynamic length of audio, no need to cap, the model do padding by itself. But again, the longer the audio, the longer time required to calculate, unless you have GPU to speed up.\n", "3. STFT and Inverse STFT can be done on GPU level, so the model is really fast on GPU." ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(441000, 44100, 10.0)" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sr = 44100\n", "y, _ = malaya_speech.load('speech/khutbah/wadi-annuar.wav', sr = sr)\n", "len(y), sr, len(y) / sr" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 4.82 s, sys: 919 ms, total: 5.73 s\n", "Wall time: 1.8 s\n" ] } ], "source": [ "%%time\n", "\n", "output = model(y)" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'voice': array([1.9417714e-08, 2.0993056e-08, 2.4434440e-08, ..., 2.1756661e-01,\n", " 1.9999057e-01, 1.4723262e-01], dtype=float32),\n", " 'noise': array([-1.9704540e-08, -2.3319327e-08, -2.4154849e-08, ...,\n", " 1.5757367e-01, 1.5660551e-01, 9.5091663e-02], dtype=float32)}" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "output" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(output['voice'], rate = sr)" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 4.22 s, sys: 647 ms, total: 4.86 s\n", "Wall time: 795 ms\n" ] } ], "source": [ "%%time\n", "\n", "output = model(malaya_speech.resample(wavs[0], 22050, sr))" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(output['voice'], rate = sr)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" } }, "nbformat": 4, "nbformat_minor": 4 }