{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# RNNT Streaming\n", "\n", "Transducer streaming using TorchAudio, malaya-speech able to do that." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [malaya-speech/example/rnnt-streaming-torchaudio](https://github.com/huseinzol05/malaya-speech/tree/master/example/rnnt-streaming-torchaudio).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at [malaya-speech/example/pipeline](https://github.com/huseinzol05/malaya-speech/tree/master/example/pipeline).\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import malaya_speech" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Starting malaya-speech 1.4.0, streaming always returned a float32 array between -1 and +1 values." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Streaming interface\n", "\n", "```python\n", "def stream_rnnt(\n", " src,\n", " asr_model=None,\n", " classification_model=None,\n", " format=None,\n", " option=None,\n", " beam_width: int = 10,\n", " buffer_size: int = 4096,\n", " sample_rate: int = 16000,\n", " segment_length: int = 2560,\n", " context_length: int = 640,\n", " realtime_print: bool = True,\n", " **kwargs,\n", "):\n", " \"\"\"\n", " Parameters\n", " -----------\n", " src: str\n", " Supported `src` for `torchaudio.io.StreamReader`\n", " Read more at https://pytorch.org/audio/stable/tutorials/streamreader_basic_tutorial.html#sphx-glr-tutorials-streamreader-basic-tutorial-py\n", " or https://pytorch.org/audio/stable/tutorials/streamreader_advanced_tutorial.html#sphx-glr-tutorials-streamreader-advanced-tutorial-py\n", " asr_model: object, optional (default=None)\n", " ASR model / pipeline, will transcribe each subsamples realtime.\n", " must be an object of `malaya_speech.torch_model.torchaudio.Conformer`.\n", " classification_model: object, optional (default=None)\n", " classification pipeline, will classify each subsamples realtime.\n", " format: str, optional (default=None)\n", " Supported `format` for `torchaudio.io.StreamReader`,\n", " https://pytorch.org/audio/stable/generated/torchaudio.io.StreamReader.html#torchaudio.io.StreamReader\n", " option: dict, optional (default=None)\n", " Supported `option` for `torchaudio.io.StreamReader`,\n", " https://pytorch.org/audio/stable/generated/torchaudio.io.StreamReader.html#torchaudio.io.StreamReader\n", " buffer_size: int, optional (default=4096)\n", " Supported `buffer_size` for `torchaudio.io.StreamReader`, buffer size in byte. Used only when src is file-like object,\n", " https://pytorch.org/audio/stable/generated/torchaudio.io.StreamReader.html#torchaudio.io.StreamReader\n", " sample_rate: int, optional (default=16000)\n", " sample rate from input device, this will auto resampling.\n", " segment_length: int, optional (default=2560)\n", " usually derived from asr_model.segment_length * asr_model.hop_length,\n", " size of audio chunks, actual size in term of second is `segment_length` / `sample_rate`.\n", " context_length: int, optional (default=640)\n", " usually derived from asr_model.right_context_length * asr_model.hop_length,\n", " size of append context chunks, only useful for streaming RNNT.\n", " beam_width: int, optional (default=10)\n", " width for beam decoding.\n", " realtime_print: bool, optional (default=True)\n", " Will print results for ASR.\n", " \"\"\"\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load ASR model" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Size (MB)malay-malayamalay-fleur102Languagesinglish
mesolitica/conformer-tiny38.5{'WER': 0.17341180814, 'CER': 0.05957485024}{'WER': 0.19524478979, 'CER': 0.0830808938}[malay]NaN
mesolitica/conformer-base121{'WER': 0.122076123261, 'CER': 0.03879606324}{'WER': 0.1326737206665, 'CER': 0.05032914857}[malay]NaN
mesolitica/conformer-medium243{'WER': 0.12777757303, 'CER': 0.0393998776}{'WER': 0.1379928549, 'CER': 0.05876827088}[malay]NaN
mesolitica/emformer-base162{'WER': 0.175762423786, 'CER': 0.06233919000537}{'WER': 0.18303839134, 'CER': 0.0773853362}[malay]NaN
mesolitica/conformer-singlish121NaNNaN[singlish]{'WER': 0.08535878149, 'CER': 0.0452357273822,...
mesolitica/conformer-medium-mixed243{'WER': 0.122076123261, 'CER': 0.03879606324}{'WER': 0.1326737206665, 'CER': 0.05032914857}[malay, singlish]{'WER': 0.08535878149, 'CER': 0.0452357273822,...
\n", "
" ], "text/plain": [ " Size (MB) \\\n", "mesolitica/conformer-tiny 38.5 \n", "mesolitica/conformer-base 121 \n", "mesolitica/conformer-medium 243 \n", "mesolitica/emformer-base 162 \n", "mesolitica/conformer-singlish 121 \n", "mesolitica/conformer-medium-mixed 243 \n", "\n", " malay-malaya \\\n", "mesolitica/conformer-tiny {'WER': 0.17341180814, 'CER': 0.05957485024} \n", "mesolitica/conformer-base {'WER': 0.122076123261, 'CER': 0.03879606324} \n", "mesolitica/conformer-medium {'WER': 0.12777757303, 'CER': 0.0393998776} \n", "mesolitica/emformer-base {'WER': 0.175762423786, 'CER': 0.06233919000537} \n", "mesolitica/conformer-singlish NaN \n", "mesolitica/conformer-medium-mixed {'WER': 0.122076123261, 'CER': 0.03879606324} \n", "\n", " malay-fleur102 \\\n", "mesolitica/conformer-tiny {'WER': 0.19524478979, 'CER': 0.0830808938} \n", "mesolitica/conformer-base {'WER': 0.1326737206665, 'CER': 0.05032914857} \n", "mesolitica/conformer-medium {'WER': 0.1379928549, 'CER': 0.05876827088} \n", "mesolitica/emformer-base {'WER': 0.18303839134, 'CER': 0.0773853362} \n", "mesolitica/conformer-singlish NaN \n", "mesolitica/conformer-medium-mixed {'WER': 0.1326737206665, 'CER': 0.05032914857} \n", "\n", " Language \\\n", "mesolitica/conformer-tiny [malay] \n", "mesolitica/conformer-base [malay] \n", "mesolitica/conformer-medium [malay] \n", "mesolitica/emformer-base [malay] \n", "mesolitica/conformer-singlish [singlish] \n", "mesolitica/conformer-medium-mixed [malay, singlish] \n", "\n", " singlish \n", "mesolitica/conformer-tiny NaN \n", "mesolitica/conformer-base NaN \n", "mesolitica/conformer-medium NaN \n", "mesolitica/emformer-base NaN \n", "mesolitica/conformer-singlish {'WER': 0.08535878149, 'CER': 0.0452357273822,... \n", "mesolitica/conformer-medium-mixed {'WER': 0.08535878149, 'CER': 0.0452357273822,... " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya_speech.stt.transducer.available_pt_transformer()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**RNNT Streaming only support Emformer or else TorchAudio will throw an error**." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "scrolled": true }, "outputs": [], "source": [ "model = malaya_speech.stt.transducer.pt_transformer(model = 'mesolitica/emformer-base')" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "_ = model.eval()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**You need to make sure the last output should named as `speech-to-text` or else the streaming interface will throw an error**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Start streaming" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " amalan kuling tapi kalau boleh aku nak kena buat dulu mandi jah kan semalam tu jah dah habisan ke tengok kita dah mai orang yang kita nak sihat yalah premia dengan awak sho aku suka pergi ya aku suka" ] } ], "source": [ "samples = malaya_speech.streaming.torchaudio.stream_rnnt('speech/podcast/toodia.mp3',\n", " asr_model = model)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "375" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(samples)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }