{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Speech-to-Text RNNT Singlish" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Encoder model + RNNT loss for Singlish language, trained on Singapore National Speech Corpus, https://www.imda.gov.sg/programme-listing/digital-services-lab/national-speech-corpus" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [malaya-speech/example/stt-transducer-model-singlish](https://github.com/huseinzol05/malaya-speech/tree/master/example/stt-transducer-model-singlish).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at [malaya-speech/example/pipeline](https://github.com/huseinzol05/malaya-speech/tree/master/example/pipeline).\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import malaya_speech\n", "import numpy as np\n", "from malaya_speech import Pipeline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List available RNNT model" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Size (MB)Quantized Size (MB)WERCERWER-LMCER-LMLanguage
tiny-conformer24.49.140.2128110.0813690.1996830.077004[malay]
small-conformer49.218.10.1985330.0744950.1853610.071143[malay]
conformer12537.10.1636020.0587440.1561820.05719[malay]
large-conformer4041070.1566840.0619710.1486220.05901[malay]
conformer-stack-2mixed13038.50.1036080.0500690.1029110.050201[malay, singlish]
conformer-stack-3mixed13038.50.2347680.1339440.2292410.130702[malay, singlish, mandarin]
small-conformer-singlish49.218.10.0878310.0456860.0873330.045317[singlish]
conformer-singlish12537.10.0777920.0403620.0771860.03987[singlish]
large-conformer-singlish4041070.0701470.0358720.0698120.035723[singlish]
\n", "
" ], "text/plain": [ " Size (MB) Quantized Size (MB) WER CER \\\n", "tiny-conformer 24.4 9.14 0.212811 0.081369 \n", "small-conformer 49.2 18.1 0.198533 0.074495 \n", "conformer 125 37.1 0.163602 0.058744 \n", "large-conformer 404 107 0.156684 0.061971 \n", "conformer-stack-2mixed 130 38.5 0.103608 0.050069 \n", "conformer-stack-3mixed 130 38.5 0.234768 0.133944 \n", "small-conformer-singlish 49.2 18.1 0.087831 0.045686 \n", "conformer-singlish 125 37.1 0.077792 0.040362 \n", "large-conformer-singlish 404 107 0.070147 0.035872 \n", "\n", " WER-LM CER-LM Language \n", "tiny-conformer 0.199683 0.077004 [malay] \n", "small-conformer 0.185361 0.071143 [malay] \n", "conformer 0.156182 0.05719 [malay] \n", "large-conformer 0.148622 0.05901 [malay] \n", "conformer-stack-2mixed 0.102911 0.050201 [malay, singlish] \n", "conformer-stack-3mixed 0.229241 0.130702 [malay, singlish, mandarin] \n", "small-conformer-singlish 0.087333 0.045317 [singlish] \n", "conformer-singlish 0.077186 0.03987 [singlish] \n", "large-conformer-singlish 0.069812 0.035723 [singlish] " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya_speech.stt.available_transducer()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lower is better. Mixed models tested on different dataset." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Google Speech-to-Text accuracy\n", "\n", "We tested on the same malay dataset to compare malaya-speech models and Google Speech-to-Text, check the notebook at [benchmark-google-speech-singlish-dataset.ipynb](https://github.com/huseinzol05/malaya-speech/blob/master/pretrained-model/prepare-stt/benchmark-google-speech-singlish-dataset.ipynb)." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'malay': {'WER': 0.164775, 'CER': 0.059732},\n", " 'singlish': {'WER': 0.4941349, 'CER': 0.3026296}}" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya_speech.stt.google_accuracy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load RNNT model\n", "\n", "```python\n", "def deep_transducer(\n", " model: str = 'conformer', quantized: bool = False, **kwargs\n", "):\n", " \"\"\"\n", " Load Encoder-Transducer ASR model.\n", "\n", " Parameters\n", " ----------\n", " model : str, optional (default='conformer')\n", " Model architecture supported. Allowed values:\n", "\n", " * ``'tiny-conformer'`` - TINY size Google Conformer.\n", " * ``'small-conformer'`` - SMALL size Google Conformer.\n", " * ``'conformer'`` - BASE size Google Conformer.\n", " * ``'large-conformer'`` - LARGE size Google Conformer.\n", " * ``'conformer-stack-2mixed'`` - BASE size Stacked Google Conformer for (Malay + Singlish) languages.\n", " * ``'conformer-stack-3mixed'`` - BASE size Stacked Google Conformer for (Malay + Singlish + Mandarin) languages.\n", " * ``'small-conformer-singlish'`` - SMALL size Google Conformer for singlish language.\n", " * ``'conformer-singlish'`` - BASE size Google Conformer for singlish language.\n", " * ``'large-conformer-singlish'`` - LARGE size Google Conformer for singlish language.\n", "\n", " quantized : bool, optional (default=False)\n", " if True, will load 8-bit quantized model.\n", " Quantized model not necessary faster, totally depends on the machine.\n", "\n", " Returns\n", " -------\n", " result : malaya_speech.model.tf.Transducer class\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "scrolled": true }, "outputs": [], "source": [ "model = malaya_speech.stt.deep_transducer(model = 'conformer-singlish')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load Quantized deep model\n", "\n", "To load 8-bit quantized model, simply pass `quantized = True`, default is `False`.\n", "\n", "We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING:root:Load quantized model will cause accuracy drop.\n" ] } ], "source": [ "quantized_model = malaya_speech.stt.deep_transducer(model = 'conformer-singlish', quantized = True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load sample" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "singlish0, sr = malaya_speech.load('speech/singlish/singlish0.wav')\n", "singlish1, sr = malaya_speech.load('speech/singlish/singlish1.wav')\n", "singlish2, sr = malaya_speech.load('speech/singlish/singlish2.wav')\n", "imda0, sr = malaya_speech.load('speech/imda/221931702.WAV')\n", "imda1, sr = malaya_speech.load('speech/imda/221931727.WAV')\n", "imda2, sr = malaya_speech.load('speech/imda/221931818.WAV')" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import IPython.display as ipd\n", "\n", "ipd.Audio(singlish0, rate = sr)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(singlish1, rate = sr)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(singlish2, rate = sr)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(imda0, rate = sr)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(imda1, rate = sr)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(imda2, rate = sr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Predict\n", "\n", "We can choose,\n", "\n", "1. `greedy` decoder.\n", "2. `beam` decoder, by default `beam_size` is 5, feel free to edit it.\n", "\n", "```python\n", "def predict(\n", " self, inputs, decoder: str = 'greedy', beam_size: int = 5, **kwargs\n", "):\n", " \"\"\"\n", " Transcribe inputs, will return list of strings.\n", "\n", " Parameters\n", " ----------\n", " inputs: List[np.array]\n", " List[np.array] or List[malaya_speech.model.frame.Frame].\n", " decoder: str, optional (default='greedy')\n", " decoder mode, allowed values:\n", "\n", " * ``'greedy'`` - will call self.greedy_decoder\n", " * ``'beam'`` - will call self.beam_decoder\n", " beam_size: int, optional (default=5)\n", " beam size for beam decoder.\n", "\n", " Returns\n", " -------\n", " result: List[str]\n", " \"\"\"\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Greedy decoder\n", "\n", "Greedy able to utilize batch processing, and faster than beam decoder.\n", "\n", "```python\n", "def greedy_decoder(self, inputs):\n", " \"\"\"\n", " Transcribe inputs, will return list of strings.\n", "\n", " Parameters\n", " ----------\n", " inputs: List[np.array]\n", " List[np.array] or List[malaya_speech.model.frame.Frame].\n", "\n", " Returns\n", " -------\n", " result: List[str]\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 10.3 s, sys: 2.65 s, total: 12.9 s\n", "Wall time: 9.4 s\n" ] }, { "data": { "text/plain": [ "['and then see how they roll it in film okay actually',\n", " 'then you tap to your eyes',\n", " 'sembawang seven in mal',\n", " 'wantan mee is a traditional local cuisine',\n", " 'saravanan gopinathan george yeo yong boon and tay kheng soon',\n", " 'ahmad khan adelene wee chin suan and robert ibbetson']" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "model.greedy_decoder([singlish0, singlish1, singlish2, imda0, imda1, imda2])" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 9.92 s, sys: 2.41 s, total: 12.3 s\n", "Wall time: 8.76 s\n" ] }, { "data": { "text/plain": [ "['and then see how they roll it in film okay actually',\n", " 'then you tap to your eyes',\n", " 'sembawang seven in mal',\n", " 'wantan mee is a traditional local cuisine',\n", " 'saravanan gopinathan george yeo yong boon and tay kheng soon',\n", " 'ahmad khan adelene wee chin suan and robert ibbetson']" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "quantized_model.greedy_decoder([singlish0, singlish1, singlish2, imda0, imda1, imda2])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Beam decoder\n", "\n", "To get better results, use beam decoder with optimum beam size.\n", "\n", "```python\n", "def beam_decoder(self, inputs, beam_size: int = 5):\n", " \"\"\"\n", " Transcribe inputs, will return list of strings.\n", "\n", " Parameters\n", " ----------\n", " inputs: List[np.array]\n", " List[np.array] or List[malaya_speech.model.frame.Frame].\n", " beam_size: int, optional (default=5)\n", " beam size for beam decoder.\n", "\n", " Returns\n", " -------\n", " result: List[str]\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 27.1 s, sys: 2.52 s, total: 29.6 s\n", "Wall time: 21 s\n" ] }, { "data": { "text/plain": [ "['and then see how they roll it in film okay actually',\n", " 'okay then you tap to your eyes',\n", " 'sembawang seven in male',\n", " 'wantan mee is a traditional local cuisine',\n", " 'saravanan gopinathan george yeo yong boon and tay kheng soon',\n", " 'ahmad khan adelene wee chin suan and robert ibbetson']" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "model.beam_decoder([singlish0, singlish1, singlish2, imda0, imda1, imda2], beam_size = 3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**RNNT model beam decoder not able to utilise batch programming, if feed a batch, it will process one by one**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Predict alignment\n", "\n", "We want to know when the speakers speak certain words, so we can use `predict_timestamp`,\n", "\n", "```python\n", "def predict_alignment(self, input, combined = True):\n", " \"\"\"\n", " Transcribe input and get timestamp, only support greedy decoder.\n", "\n", " Parameters\n", " ----------\n", " input: np.array\n", " np.array or malaya_speech.model.frame.Frame.\n", " combined: bool, optional (default=True)\n", " If True, will combined subwords to become a word.\n", "\n", " Returns\n", " -------\n", " result: List[Dict[text, start, end]]\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 6.08 s, sys: 2.01 s, total: 8.09 s\n", "Wall time: 7.22 s\n" ] }, { "data": { "text/plain": [ "[{'text': 'and', 'start': 0.2, 'end': 0.21},\n", " {'text': 'then', 'start': 0.36, 'end': 0.45},\n", " {'text': 'see', 'start': 0.6, 'end': 0.61},\n", " {'text': 'how', 'start': 0.88, 'end': 0.89},\n", " {'text': 'they', 'start': 1.36, 'end': 1.49},\n", " {'text': 'roll', 'start': 1.96, 'end': 2.09},\n", " {'text': 'it', 'start': 2.16, 'end': 2.17},\n", " {'text': 'in', 'start': 2.4, 'end': 2.41},\n", " {'text': 'film', 'start': 2.6, 'end': 2.85},\n", " {'text': 'okay', 'start': 3.68, 'end': 3.85},\n", " {'text': 'actually', 'start': 3.92, 'end': 4.21}]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "model.predict_alignment(singlish0)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1.12 s, sys: 200 ms, total: 1.32 s\n", "Wall time: 306 ms\n" ] }, { "data": { "text/plain": [ "[{'text': 'and', 'start': 0.2, 'end': 0.21},\n", " {'text': ' ', 'start': 0.28, 'end': 0.29},\n", " {'text': 'the', 'start': 0.36, 'end': 0.37},\n", " {'text': 'n_', 'start': 0.44, 'end': 0.45},\n", " {'text': 'see', 'start': 0.6, 'end': 0.61},\n", " {'text': ' ', 'start': 0.76, 'end': 0.77},\n", " {'text': 'how', 'start': 0.88, 'end': 0.89},\n", " {'text': ' ', 'start': 1.08, 'end': 1.09},\n", " {'text': 'the', 'start': 1.36, 'end': 1.37},\n", " {'text': 'y_', 'start': 1.48, 'end': 1.49},\n", " {'text': 'ro', 'start': 1.96, 'end': 1.97},\n", " {'text': 'll_', 'start': 2.08, 'end': 2.09},\n", " {'text': 'it_', 'start': 2.2, 'end': 2.21},\n", " {'text': 'in_', 'start': 2.4, 'end': 2.41},\n", " {'text': 'fil', 'start': 2.6, 'end': 2.61},\n", " {'text': 'm_', 'start': 2.84, 'end': 2.85},\n", " {'text': 'oka', 'start': 3.68, 'end': 3.69},\n", " {'text': 'y_', 'start': 3.84, 'end': 3.85},\n", " {'text': 'act', 'start': 3.92, 'end': 3.93},\n", " {'text': 'ual', 'start': 4.0, 'end': 4.01},\n", " {'text': 'ly', 'start': 4.2, 'end': 4.21}]" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "model.predict_alignment(singlish0, combined = False)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1.05 s, sys: 192 ms, total: 1.25 s\n", "Wall time: 278 ms\n" ] }, { "data": { "text/plain": [ "[{'text': 'wantan', 'start': 0.92, 'end': 1.05},\n", " {'text': 'mee', 'start': 1.4, 'end': 1.53},\n", " {'text': 'is', 'start': 1.64, 'end': 1.65},\n", " {'text': 'a', 'start': 1.84, 'end': 1.85},\n", " {'text': 'traditional', 'start': 2.08, 'end': 2.69},\n", " {'text': 'local', 'start': 2.8, 'end': 2.93},\n", " {'text': 'cuisine', 'start': 3.12, 'end': 3.45}]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "model.predict_alignment(imda0)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }