{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Realtime VAD\n", "\n", "Let say you want to cut your realtime recording audio by using VAD, malaya-speech able to do that." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [malaya-speech/example/realtime-vad](https://github.com/huseinzol05/malaya-speech/tree/master/example/realtime-vad).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This module is language independent, so it save to use on different languages. Pretrained models trained on multilanguages.\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at [malaya-speech/example/pipeline](https://github.com/huseinzol05/malaya-speech/tree/master/example/pipeline).\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import malaya_speech\n", "from malaya_speech import Pipeline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load VAD model\n", "\n", "Fastest and common model people use, is webrtc. Read more about VAD at https://malaya-speech.readthedocs.io/en/latest/load-vad.html" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "webrtc = malaya_speech.vad.webrtc()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Recording interface\n", "\n", "So, to start recording audio including realtime VAD, we need to use `malaya_speech.streaming.record`. We use `pyaudio` library as the backend.\n", "\n", "```python\n", "def record(\n", " vad,\n", " asr_model = None,\n", " classification_model = None,\n", " device = None,\n", " input_rate: int = 16000,\n", " sample_rate: int = 16000,\n", " blocks_per_second: int = 50,\n", " padding_ms: int = 300,\n", " ratio: float = 0.75,\n", " min_length: float = 0.1,\n", " filename: str = None,\n", " spinner: bool = False,\n", "):\n", " \"\"\"\n", " Record an audio using pyaudio library. This record interface required a VAD model.\n", "\n", " Parameters\n", " ----------\n", " vad: object\n", " vad model / pipeline.\n", " asr_model: object\n", " ASR model / pipeline, will transcribe each subsamples realtime.\n", " classification_model: object\n", " classification pipeline, will classify each subsamples realtime.\n", " device: None\n", " `device` parameter for pyaudio, check available devices from `sounddevice.query_devices()`.\n", " input_rate: int, optional (default = 16000)\n", " sample rate from input device, this will auto resampling.\n", " sample_rate: int, optional (default = 16000)\n", " output sample rate.\n", " blocks_per_second: int, optional (default = 50)\n", " size of frame returned from pyaudio, frame size = sample rate / (blocks_per_second / 2).\n", " 50 is good for WebRTC, 30 or less is good for Malaya Speech VAD.\n", " padding_ms: int, optional (default = 300)\n", " size of queue to store frames, size = padding_ms // (1000 * blocks_per_second // sample_rate)\n", " ratio: float, optional (default = 0.75)\n", " if 75% of the queue is positive, assumed it is a voice activity.\n", " min_length: float, optional (default=0.1)\n", " minimum length (s) to accept a subsample.\n", " filename: str, optional (default=None)\n", " if None, will auto generate name based on timestamp.\n", " spinner: bool, optional (default=False)\n", " if True, will use spinner object from halo library.\n", "\n", "\n", " Returns\n", " -------\n", " result : [filename, samples]\n", " \"\"\"\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Once you start to run the code below, it will straight away recording your voice**. Right now I am using built-in microphone.\n", "\n", "If you run in jupyter notebook, press button stop up there to stop recording, if in terminal, press `CTRL + c`." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "saved audio to savewav_2020-11-26_22-36-06_294832.wav (ctrl-C to stop recording) ...| Listening (ctrl-C to stop recording) .../ Listening (ctrl-C to stop recording) ... Listening (ctrl-C to stop recording) ...\\ Listening (ctrl-C to stop recording) ...| Listening (ctrl-C to stop recording) .../ Listening (ctrl-C to stop recording) ...- Listening (ctrl-C to stop recording) ...\\ Listening (ctrl-C to stop recording) ...| Listening (ctrl-C to stop recording) .../ Listening (ctrl-C to stop recording) ...- Listening (ctrl-C to stop recording) ...\\ Listening (ctrl-C to stop recording) ... Listening (ctrl-C to stop recording) ...\\ Listening (ctrl-C to stop recording) ...| Listening (ctrl-C to stop recording) .../ Listening (ctrl-C to stop recording) ...- Listening (ctrl-C to stop recording) ...\\ Listening (ctrl-C to stop recording) ...| Listening (ctrl-C to stop recording) .../ Listening (ctrl-C to stop recording) ...- Listening (ctrl-C to stop recording) ...\\ Listening (ctrl-C to stop recording) ...| Listening (ctrl-C to stop recording) .../ Listening (ctrl-C to stop recording) ... Listening (ctrl-C to stop recording) ...\\ Listening (ctrl-C to stop recording) ...| Listening (ctrl-C to stop recording) .../ Listening (ctrl-C to stop recording) ...- Listening (ctrl-C to stop recording) ...\\ Listening (ctrl-C to stop recording) ...| Listening (ctrl-C to stop recording) .../ Listening (ctrl-C to stop recording) ...- Listening (ctrl-C to stop recording) ...\\ Listening (ctrl-C to stop recording) ...| Listening (ctrl-C to stop recording) .../ Listening (ctrl-C to stop recording) ...- Listening (ctrl-C to stop recording) ...\\ Listening (ctrl-C to stop recording) ...\n", "\r" ] }, { "data": { "text/plain": [ "'savewav_2020-11-26_22-36-06_294832.wav'" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" }, { "name": "stdout", "output_type": "stream", "text": [ "\r" ] } ], "source": [ "file, samples = malaya_speech.streaming.record(webrtc)\n", "file" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "get the audio at [malaya-speech/speech/record](https://github.com/huseinzol05/malaya-speech/tree/master/speech/record)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, it will automatically save to a file, and you can check the length of samples." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" }, { "name": "stdout", "output_type": "stream", "text": [ "\r" ] } ], "source": [ "len(samples)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "means, we got 4 subsamples!" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" }, { "name": "stdout", "output_type": "stream", "text": [ "\r" ] } ], "source": [ "import IPython.display as ipd\n", "ipd.Audio(file)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" }, { "name": "stdout", "output_type": "stream", "text": [ "\r" ] } ], "source": [ "ipd.Audio(malaya_speech.astype.to_ndarray(samples[0][0]), rate = 16000)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" }, { "name": "stdout", "output_type": "stream", "text": [ "\r" ] } ], "source": [ "ipd.Audio(malaya_speech.astype.to_ndarray(samples[1][0]), rate = 16000)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" }, { "name": "stdout", "output_type": "stream", "text": [ "\r" ] } ], "source": [ "ipd.Audio(malaya_speech.astype.to_ndarray(samples[2][0]), rate = 16000)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" }, { "name": "stdout", "output_type": "stream", "text": [ "\r" ] } ], "source": [ "ipd.Audio(malaya_speech.astype.to_ndarray(samples[3][0]), rate = 16000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Use pipeline\n", "\n", "We know, webrtc does not work really good in noisy environment, so to improve that, we can use VAD deep model from malaya-speech." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "scrolled": true }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING:root:Load quantized model will cause accuracy drop.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /Users/huseinzolkepli/Documents/malaya-speech/malaya_speech/utils/__init__.py:66: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:tensorflow:From /Users/huseinzolkepli/Documents/malaya-speech/malaya_speech/utils/__init__.py:66: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /Users/huseinzolkepli/Documents/malaya-speech/malaya_speech/utils/__init__.py:68: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef instead.\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:tensorflow:From /Users/huseinzolkepli/Documents/malaya-speech/malaya_speech/utils/__init__.py:68: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef instead.\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /Users/huseinzolkepli/Documents/malaya-speech/malaya_speech/utils/__init__.py:61: The name tf.InteractiveSession is deprecated. Please use tf.compat.v1.InteractiveSession instead.\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:tensorflow:From /Users/huseinzolkepli/Documents/malaya-speech/malaya_speech/utils/__init__.py:61: The name tf.InteractiveSession is deprecated. Please use tf.compat.v1.InteractiveSession instead.\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r" ] } ], "source": [ "vad_model = malaya_speech.vad.deep_model(model = 'vggvox-v2', quantized = True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**pyaudio will returned int16 bytes, so we need to change to numpy array, normalize it to -1 and +1 floating point**." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" }, { "name": "stdout", "output_type": "stream", "text": [ "\r" ] } ], "source": [ "p = Pipeline()\n", "pipeline = (\n", " p.map(malaya_speech.astype.to_ndarray)\n", " .map(malaya_speech.astype.int_to_float)\n", " .map(vad_model)\n", ")\n", "p.visualize()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Once you start to run the code below, it will straight away recording your voice**. Right now I am using built-in microphone.\n", "\n", "If you run in jupyter notebook, press button stop up there to stop recording, if in terminal, press `CTRL + c`." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/librosa/core/spectrum.py:224: UserWarning: n_fft=512 is too small for input signal of length=320\n", " n_fft, y.shape[-1]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "saved audio to savewav_2020-11-26_22-40-56_929661.wav (ctrl-C to stop recording) ...| Listening (ctrl-C to stop recording) .../ Listening (ctrl-C to stop recording) ...- Listening (ctrl-C to stop recording) ...\\ Listening (ctrl-C to stop recording) ... Listening (ctrl-C to stop recording) ...\\ Listening (ctrl-C to stop recording) ...| Listening (ctrl-C to stop recording) .../ Listening (ctrl-C to stop recording) ...- Listening (ctrl-C to stop recording) ...\\ Listening (ctrl-C to stop recording) ...| Listening (ctrl-C to stop recording) .../ Listening (ctrl-C to stop recording) ...- Listening (ctrl-C to stop recording) ... Listening (ctrl-C to stop recording) ...\\ Listening (ctrl-C to stop recording) ...| Listening (ctrl-C to stop recording) .../ Listening (ctrl-C to stop recording) ...- Listening (ctrl-C to stop recording) ...\\ Listening (ctrl-C to stop recording) ...| Listening (ctrl-C to stop recording) .../ Listening (ctrl-C to stop recording) ... Listening (ctrl-C to stop recording) ...\\ Listening (ctrl-C to stop recording) ...| Listening (ctrl-C to stop recording) .../ Listening (ctrl-C to stop recording) ...- Listening (ctrl-C to stop recording) ...\\ Listening (ctrl-C to stop recording) ...| Listening (ctrl-C to stop recording) .../ Listening (ctrl-C to stop recording) ...- Listening (ctrl-C to stop recording) ...\\ Listening (ctrl-C to stop recording) ...\n", "\r" ] }, { "data": { "text/plain": [ "'savewav_2020-11-26_22-40-56_929661.wav'" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" }, { "name": "stdout", "output_type": "stream", "text": [] } ], "source": [ "file, samples = malaya_speech.streaming.record(p)\n", "file" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "get the audio at [malaya-speech/speech/record](https://github.com/huseinzol05/malaya-speech/tree/master/speech/record)." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" }, { "name": "stdout", "output_type": "stream", "text": [] } ], "source": [ "len(samples)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" }, { "name": "stdout", "output_type": "stream", "text": [] } ], "source": [ "ipd.Audio(file)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" }, { "name": "stdout", "output_type": "stream", "text": [] } ], "source": [ "ipd.Audio(malaya_speech.astype.to_ndarray(samples[0]), rate = 16000)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" }, { "name": "stdout", "output_type": "stream", "text": [] } ], "source": [ "ipd.Audio(malaya_speech.astype.to_ndarray(samples[1]), rate = 16000)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" }, { "name": "stdout", "output_type": "stream", "text": [] } ], "source": [ "ipd.Audio(malaya_speech.astype.to_ndarray(samples[2]), rate = 16000)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" }, { "name": "stdout", "output_type": "stream", "text": [] } ], "source": [ "ipd.Audio(malaya_speech.astype.to_ndarray(samples[3]), rate = 16000)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" } }, "nbformat": 4, "nbformat_minor": 4 }