{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Realtime ASR without VAD\n", "\n", "Let say you want to transcribe realtime recording / input using PyAudio without VAD, malaya-speech able to do that." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [malaya-speech/example/realtime-asr-without-vad](https://github.com/huseinzol05/malaya-speech/tree/master/example/realtime-asr-without-vad).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at [malaya-speech/example/pipeline](https://github.com/huseinzol05/malaya-speech/tree/master/example/pipeline).\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import malaya_speech\n", "from malaya_speech import Pipeline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Streaming interface\n", "\n", "```python\n", "def stream(\n", " vad_model=None,\n", " asr_model=None,\n", " classification_model=None,\n", " sample_rate: int = 16000,\n", " segment_length: int = 2560,\n", " num_padding_frames: int = 20,\n", " ratio: float = 0.75,\n", " min_length: float = 0.1,\n", " max_length: float = 10.0,\n", " realtime_print: bool = True,\n", " **kwargs,\n", "):\n", " \"\"\"\n", " Stream an audio using pyaudio library.\n", "\n", " Parameters\n", " ----------\n", " vad_model: object, optional (default=None)\n", " vad model / pipeline.\n", " asr_model: object, optional (default=None)\n", " ASR model / pipeline, will transcribe each subsamples realtime.\n", " classification_model: object, optional (default=None)\n", " classification pipeline, will classify each subsamples realtime.\n", " device: None, optional (default=None)\n", " `device` parameter for pyaudio, check available devices from `sounddevice.query_devices()`.\n", " sample_rate: int, optional (default = 16000)\n", " output sample rate.\n", " segment_length: int, optional (default=2560)\n", " usually derived from asr_model.segment_length * asr_model.hop_length,\n", " size of audio chunks, actual size in term of second is `segment_length` / `sample_rate`.\n", " ratio: float, optional (default = 0.75)\n", " if 75% of the queue is positive, assumed it is a voice activity.\n", " min_length: float, optional (default=0.1)\n", " minimum length (second) to accept a subsample.\n", " max_length: float, optional (default=10.0)\n", " maximum length (second) to accept a subsample.\n", " realtime_print: bool, optional (default=True)\n", " Will print results for ASR.\n", " **kwargs: vector argument\n", " vector argument pass to malaya_speech.streaming.pyaudio.Audio interface.\n", "\n", " Returns\n", " -------\n", " result : List[dict]\n", " \"\"\"\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Check available devices" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "> 0 MacBook Air Microphone, Core Audio (1 in, 0 out)\n", "< 1 MacBook Air Speakers, Core Audio (0 in, 2 out)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import sounddevice\n", "\n", "sounddevice.query_devices()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default it will use `0` index." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load ASR model" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Size (MB)malay-malayamalay-fleur102Languagesinglish
mesolitica/conformer-tiny38.5{'WER': 0.17341180814, 'CER': 0.05957485024}{'WER': 0.19524478979, 'CER': 0.0830808938}[malay]NaN
mesolitica/conformer-base121{'WER': 0.122076123261, 'CER': 0.03879606324}{'WER': 0.1326737206665, 'CER': 0.05032914857}[malay]NaN
mesolitica/conformer-medium243{'WER': 0.1054817492564, 'CER': 0.0313518992842}{'WER': 0.1172708897486, 'CER': 0.0431050488}[malay]NaN
mesolitica/emformer-base162{'WER': 0.175762423786, 'CER': 0.06233919000537}{'WER': 0.18303839134, 'CER': 0.0773853362}[malay]NaN
mesolitica/conformer-base-singlish121NaNNaN[singlish]{'WER': 0.06517537334361, 'CER': 0.03265430876}
mesolitica/conformer-medium-mixed243{'WER': 0.111166517935, 'CER': 0.03410958328}{'WER': 0.108354748, 'CER': 0.037785722}[malay, singlish]{'WER': 0.091969755225, 'CER': 0.044627194623}
mesolitica/conformer-medium-mixed-augmented243{'WER': 0.1015719878, 'CER': 0.0326360923}{'WER': 0.1103884742, 'CER': 0.0385676182}[malay, singlish]{'WER': 0.086342166, 'CER': 0.0413572066}
mesolitica/conformer-large-mixed-augmented413{'WER': 0.0919852874, 'CER': 0.026612152}{'WER': 0.103593636, 'CER': 0.036611048}[malay, singlish]{'WER': 0.08727157, 'CER': 0.04318735972}
\n", "
" ], "text/plain": [ " Size (MB) \\\n", "mesolitica/conformer-tiny 38.5 \n", "mesolitica/conformer-base 121 \n", "mesolitica/conformer-medium 243 \n", "mesolitica/emformer-base 162 \n", "mesolitica/conformer-base-singlish 121 \n", "mesolitica/conformer-medium-mixed 243 \n", "mesolitica/conformer-medium-mixed-augmented 243 \n", "mesolitica/conformer-large-mixed-augmented 413 \n", "\n", " malay-malaya \\\n", "mesolitica/conformer-tiny {'WER': 0.17341180814, 'CER': 0.05957485024} \n", "mesolitica/conformer-base {'WER': 0.122076123261, 'CER': 0.03879606324} \n", "mesolitica/conformer-medium {'WER': 0.1054817492564, 'CER': 0.0313518992842} \n", "mesolitica/emformer-base {'WER': 0.175762423786, 'CER': 0.06233919000537} \n", "mesolitica/conformer-base-singlish NaN \n", "mesolitica/conformer-medium-mixed {'WER': 0.111166517935, 'CER': 0.03410958328} \n", "mesolitica/conformer-medium-mixed-augmented {'WER': 0.1015719878, 'CER': 0.0326360923} \n", "mesolitica/conformer-large-mixed-augmented {'WER': 0.0919852874, 'CER': 0.026612152} \n", "\n", " malay-fleur102 \\\n", "mesolitica/conformer-tiny {'WER': 0.19524478979, 'CER': 0.0830808938} \n", "mesolitica/conformer-base {'WER': 0.1326737206665, 'CER': 0.05032914857} \n", "mesolitica/conformer-medium {'WER': 0.1172708897486, 'CER': 0.0431050488} \n", "mesolitica/emformer-base {'WER': 0.18303839134, 'CER': 0.0773853362} \n", "mesolitica/conformer-base-singlish NaN \n", "mesolitica/conformer-medium-mixed {'WER': 0.108354748, 'CER': 0.037785722} \n", "mesolitica/conformer-medium-mixed-augmented {'WER': 0.1103884742, 'CER': 0.0385676182} \n", "mesolitica/conformer-large-mixed-augmented {'WER': 0.103593636, 'CER': 0.036611048} \n", "\n", " Language \\\n", "mesolitica/conformer-tiny [malay] \n", "mesolitica/conformer-base [malay] \n", "mesolitica/conformer-medium [malay] \n", "mesolitica/emformer-base [malay] \n", "mesolitica/conformer-base-singlish [singlish] \n", "mesolitica/conformer-medium-mixed [malay, singlish] \n", "mesolitica/conformer-medium-mixed-augmented [malay, singlish] \n", "mesolitica/conformer-large-mixed-augmented [malay, singlish] \n", "\n", " singlish \n", "mesolitica/conformer-tiny NaN \n", "mesolitica/conformer-base NaN \n", "mesolitica/conformer-medium NaN \n", "mesolitica/emformer-base NaN \n", "mesolitica/conformer-base-singlish {'WER': 0.06517537334361, 'CER': 0.03265430876} \n", "mesolitica/conformer-medium-mixed {'WER': 0.091969755225, 'CER': 0.044627194623} \n", "mesolitica/conformer-medium-mixed-augmented {'WER': 0.086342166, 'CER': 0.0413572066} \n", "mesolitica/conformer-large-mixed-augmented {'WER': 0.08727157, 'CER': 0.04318735972} " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya_speech.stt.transducer.available_pt_transformer()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "scrolled": true }, "outputs": [], "source": [ "model = malaya_speech.stt.transducer.pt_transformer(model = 'mesolitica/conformer-medium')" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "_ = model.eval()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ASR Pipeline\n", "\n", "Because pyaudio will returned int16 bytes, so we need to change to numpy array then normalize to float, feel free to add speech enhancement or any function, but in this example, I just keep it simple." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "p_asr = Pipeline()\n", "pipeline_asr = (\n", " p_asr.map(lambda x: model.beam_decoder([x])[0], name = 'speech-to-text')\n", ")\n", "p_asr.visualize()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**You need to make sure the last output should named as `speech-to-text` or else the streaming interface will throw an error**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Start Recording" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Again, once you start to run the code below, it will straight away recording your voice**. \n", "\n", "If you run in jupyter notebook, press button stop up there to stop recording, if in terminal, press `CTRL + c`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**If you are not provide VAD model, make sure `max_length` set to proper value to chunk it**." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "helo nama saya hussein bin kapal anak saya masam isteri saya pun masam semua orang masa terima kasih " ] } ], "source": [ "samples = malaya_speech.streaming.pyaudio.stream(asr_model = p_asr, max_length = 5.0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Actually it is pretty nice. As you can see, it able to transcribe realtime, you can try it by yourself." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "5" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(samples)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "{'wav_data': array([ 0.02063065, 0.02536985, 0.02458022, ..., -0.02863501,\n", " -0.04477567, -0.07088793], dtype=float32),\n", " 'start': 5.12,\n", " 'asr_model': 'nama saya hussein bin',\n", " 'end': 10.24}" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "samples[1]" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "import IPython.display as ipd\n", "import numpy as np" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(samples[1]['wav_data'], rate = 16000)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(np.concatenate([s['wav_data'] for s in samples[:3]]), rate = 16000)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.4" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }