{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Realtime ASR + Rubberband\n", "\n", "To improve Realtime ASR, you can stretch voice output using [pyrubberband](https://github.com/bmcfee/pyrubberband)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [malaya-speech/example/realtime-asr](https://github.com/huseinzol05/malaya-speech/tree/master/example/realtime-asr).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at [malaya-speech/example/pipeline](https://github.com/huseinzol05/malaya-speech/tree/master/example/pipeline).\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import malaya_speech\n", "from malaya_speech import Pipeline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Install pyrubberband\n", "\n", "```bash\n", "pip3 install pyrubberband\n", "```\n", "\n", "[pyrubberband](https://github.com/bmcfee/pyrubberband) is a python wrapper for [rubberband](http://breakfastquay.com/rubberband/)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load VAD model\n", "\n", "We are going to use WebRTC VAD model, read more about VAD at https://malaya-speech.readthedocs.io/en/latest/load-vad.html" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "scrolled": true }, "outputs": [], "source": [ "vad_model = malaya_speech.vad.webrtc()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Recording interface\n", "\n", "So, to start recording audio including realtime VAD and Classification, we need to use `malaya_speech.streaming.record`. We use `pyaudio` library as the backend.\n", "\n", "```python\n", "def record(\n", " vad,\n", " asr_model = None,\n", " classification_model = None,\n", " device = None,\n", " input_rate: int = 16000,\n", " sample_rate: int = 16000,\n", " blocks_per_second: int = 50,\n", " padding_ms: int = 300,\n", " ratio: float = 0.75,\n", " min_length: float = 0.1,\n", " filename: str = None,\n", " spinner: bool = False,\n", "):\n", " \"\"\"\n", " Record an audio using pyaudio library. This record interface required a VAD model.\n", "\n", " Parameters\n", " ----------\n", " vad: object\n", " vad model / pipeline.\n", " asr_model: object\n", " ASR model / pipeline, will transcribe each subsamples realtime.\n", " classification_model: object\n", " classification pipeline, will classify each subsamples realtime.\n", " device: None\n", " `device` parameter for pyaudio, check available devices from `sounddevice.query_devices()`.\n", " input_rate: int, optional (default = 16000)\n", " sample rate from input device, this will auto resampling.\n", " sample_rate: int, optional (default = 16000)\n", " output sample rate.\n", " blocks_per_second: int, optional (default = 50)\n", " size of frame returned from pyaudio, frame size = sample rate / (blocks_per_second / 2).\n", " 50 is good for WebRTC, 30 or less is good for Malaya Speech VAD.\n", " padding_ms: int, optional (default = 300)\n", " size of queue to store frames, size = padding_ms // (1000 * blocks_per_second // sample_rate)\n", " ratio: float, optional (default = 0.75)\n", " if 75% of the queue is positive, assumed it is a voice activity.\n", " min_length: float, optional (default=0.1)\n", " minimum length (s) to accept a subsample.\n", " filename: str, optional (default=None)\n", " if None, will auto generate name based on timestamp.\n", " spinner: bool, optional (default=False)\n", " if True, will use spinner object from halo library.\n", "\n", "\n", " Returns\n", " -------\n", " result : [filename, samples]\n", " \"\"\"\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**pyaudio will returned int16 bytes, so we need to change to numpy array, normalize it to -1 and +1 floating point**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Check available devices" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "> 0 External Microphone, Core Audio (1 in, 0 out)\n", "< 1 External Headphones, Core Audio (0 in, 2 out)\n", " 2 MacBook Pro Microphone, Core Audio (1 in, 0 out)\n", " 3 MacBook Pro Speakers, Core Audio (0 in, 2 out)\n", " 4 JustStream Audio Driver, Core Audio (2 in, 2 out)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import sounddevice\n", "\n", "sounddevice.query_devices()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default it will use `0` index." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load ASR model" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "model = malaya_speech.stt.deep_transducer(model = 'conformer')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Test pyrubberband" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "import pyrubberband as pyrb\n", "import IPython.display as ipd" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "y, sr = malaya_speech.load('speech/example-speaker/husein-zolkepli.wav')\n", "y_stretch = pyrb.time_stretch(y, sr, 0.8)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(y, rate = sr)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(y_stretch, rate = sr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ASR Pipeline\n", "\n", "Because pyaudio will returned int16 bytes, so we need to change to numpy array then normalize to float, feel free to add speech enhancement or any function, but in this example, I just keep it simple." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sr = 16000\n", "stretch_ratio = 0.85\n", "p_asr = Pipeline()\n", "pipeline_asr = (\n", " p_asr.map(malaya_speech.astype.to_ndarray)\n", " .map(malaya_speech.astype.int_to_float)\n", " .map(lambda x: pyrb.time_stretch(x, sr, stretch_ratio), name = 'stretch')\n", " .map(lambda x: model(x), name = 'speech-to-text')\n", ")\n", "p_asr.visualize()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**You need to make sure the last output should named as `speech-to-text` or else the realtime engine will throw an error**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Start Recording" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Again, once you start to run the code below, it will straight away recording your voice**. \n", "\n", "If you run in jupyter notebook, press button stop up there to stop recording, if in terminal, press `CTRL + c`." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Listening (ctrl-C to stop recording) ... \n", "\n", "Sample 0 2021-06-12 11:03:39.804085: helo nama saya hussein \n", "Sample 1 2021-06-12 11:03:40.041678: hari ini saya mahu cakap tentang bohong\n", "Sample 2 2021-06-12 11:03:40.274277: apakah itu bon\n", "Sample 3 2021-06-12 11:03:40.543731: yalah pinjaman yang dikeluarkan oleh\n", "Sample 4 2021-06-12 11:03:42.442084: atau kerajaan\n", "Sample 5 2021-06-12 11:03:46.897266: tak ada investor dengan bunga setiap tahun\n", "Sample 6 2021-06-12 11:03:51.883495: enam bulan tiga bulan bergantung kepada bank dan akan ada tempoh mata\n", "Sample 7 2021-06-12 11:03:54.447205: terima kasih\n", "saved audio to realtime-asr-rubberband.wav\n" ] }, { "data": { "text/plain": [ "'realtime-asr-rubberband.wav'" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "file, samples = malaya_speech.streaming.record(vad = vad_model, asr_model = p_asr, spinner = False,\n", " filename = 'realtime-asr-rubberband.wav')\n", "file" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "the wav file can get at [malaya-speech/speech/record](https://github.com/huseinzol05/malaya-speech/tree/master/speech/record)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Actually it is pretty nice. As you can see, it able to transcribe realtime, you can try it by yourself." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import IPython.display as ipd\n", "\n", "ipd.Audio(file)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "(bytearray,\n", " 'enam bulan tiga bulan bergantung kepada bank dan akan ada tempoh mata')" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(samples[6][0]), samples[6][1]" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y = malaya_speech.utils.astype.to_ndarray(samples[6][0])\n", "ipd.Audio(y, rate = 16000)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" } }, "nbformat": 4, "nbformat_minor": 4 }