{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Speech-to-Text RNNT Malay + Singlish + Mandarin" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Encoder model + RNNT loss for Malay + Singlish + Mandarin" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [malaya-speech/example/stt-transducer-model-3mixed](https://github.com/huseinzol05/malaya-speech/tree/master/example/stt-transducer-model-3mixed).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at [malaya-speech/example/pipeline](https://github.com/huseinzol05/malaya-speech/tree/master/example/pipeline).\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import malaya_speech\n", "import numpy as np\n", "from malaya_speech import Pipeline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List available RNNT model" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "malaya_speech.stt.available_transducer()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lower is better. Test set can get at https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/prepare-stt\n", "\n", "Malay trained on Malaya Speech dataset, singlish trained on IMDA dataset, mandarin trained on https://openslr.org/68/ and https://openslr.org/38/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Google Speech-to-Text only supported monolanguage, so we are not able to compare the accuracy." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load RNNT model\n", "\n", "```python\n", "def deep_transducer(\n", " model: str = 'conformer', quantized: bool = False, **kwargs\n", "):\n", " \"\"\"\n", " Load Encoder-Transducer ASR model.\n", "\n", " Parameters\n", " ----------\n", " model : str, optional (default='conformer')\n", " Model architecture supported. Allowed values:\n", "\n", " * ``'tiny-conformer'`` - TINY size Google Conformer.\n", " * ``'small-conformer'`` - SMALL size Google Conformer.\n", " * ``'conformer'`` - BASE size Google Conformer.\n", " * ``'large-conformer'`` - LARGE size Google Conformer.\n", " * ``'conformer-stack-2mixed'`` - BASE size Stacked Google Conformer for (Malay + Singlish) languages.\n", " * ``'conformer-stack-3mixed'`` - BASE size Stacked Google Conformer for (Malay + Singlish + Mandarin) languages.\n", " * ``'small-conformer-singlish'`` - SMALL size Google Conformer for singlish language.\n", " * ``'conformer-singlish'`` - BASE size Google Conformer for singlish language.\n", " * ``'large-conformer-singlish'`` - LARGE size Google Conformer for singlish language.\n", "\n", " quantized : bool, optional (default=False)\n", " if True, will load 8-bit quantized model.\n", " Quantized model not necessary faster, totally depends on the machine.\n", "\n", " Returns\n", " -------\n", " result : malaya_speech.model.tf.Transducer class\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "scrolled": true }, "outputs": [], "source": [ "model = malaya_speech.stt.deep_transducer(model = 'conformer-stack-3mixed')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load Quantized deep model\n", "\n", "To load 8-bit quantized model, simply pass `quantized = True`, default is `False`.\n", "\n", "We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING:root:Load quantized model will cause accuracy drop.\n", "37.0MB [00:05, 6.65MB/s] \n", "1.00MB [00:00, 233MB/s] \n", "1.00MB [00:00, 227MB/s] \n", "1.00MB [00:00, 296MB/s] \n" ] } ], "source": [ "quantized_model = malaya_speech.stt.deep_transducer(model = 'conformer-stack-3mixed', quantized = True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load sample" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "ceramah, sr = malaya_speech.load('speech/khutbah/wadi-annuar.wav')\n", "record1, sr = malaya_speech.load('speech/record/savewav_2020-11-26_22-36-06_294832.wav')\n", "record2, sr = malaya_speech.load('speech/record/savewav_2020-11-26_22-40-56_929661.wav')\n", "singlish0, sr = malaya_speech.load('speech/singlish/singlish0.wav')\n", "singlish1, sr = malaya_speech.load('speech/singlish/singlish1.wav')\n", "singlish2, sr = malaya_speech.load('speech/singlish/singlish2.wav')\n", "mandarin0, sr = malaya_speech.load('speech/mandarin/597.wav')\n", "mandarin1, sr = malaya_speech.load('speech/mandarin/584.wav')\n", "mandarin2, sr = malaya_speech.load('speech/mandarin/509.wav')" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import IPython.display as ipd\n", "\n", "ipd.Audio(ceramah, rate = sr)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(record1, rate = sr)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(record2, rate = sr)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(singlish0, rate = sr)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(singlish1, rate = sr)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(singlish2, rate = sr)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(mandarin0, rate = sr)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(mandarin1, rate = sr)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(mandarin2, rate = sr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Predict using greedy decoder\n", "\n", "```python\n", "def greedy_decoder(self, inputs):\n", " \"\"\"\n", " Transcribe inputs using greedy decoder.\n", "\n", " Parameters\n", " ----------\n", " inputs: List[np.array]\n", " List[np.array] or List[malaya_speech.model.frame.Frame].\n", "\n", " Returns\n", " -------\n", " result: List[str]\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 13.8 s, sys: 3.86 s, total: 17.6 s\n", "Wall time: 10.3 s\n" ] }, { "data": { "text/plain": [ "['jadi dalam perjalanan ini ini yang susah ini ketika nabi mengajar muaz bin jabal tadi ni allah',\n", " 'kalau nama saya musim saya tak suka mandi kata saya masak',\n", " 'hello im sorry sorry so',\n", " 'and then see how they bring and film okay actually',\n", " 'later to your s',\n", " 'seven seven more',\n", " 'gei wo lai ge zhang jie zui xin de ge',\n", " 'wo xiang ting shu ying do',\n", " 'qiu yi shou ge de ming zi ge ci li you zhuan sheng ji meng wang shi son su']" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "model.greedy_decoder([ceramah, record1, record2, singlish0, singlish1, singlish2, \n", " mandarin0, mandarin1, mandarin2])" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 13.2 s, sys: 3.55 s, total: 16.7 s\n", "Wall time: 9.94 s\n" ] }, { "data": { "text/plain": [ "['jadi dalam perjalanan ini yang susah ini ketika nabi mengajar muaz bin jabal tadi ni allah',\n", " 'kalau nama saya musim saya tak suka mandi kata saya masak',\n", " 'hello im sorry saya using saya sekemandi saya mandi jp',\n", " 'and then see how they break and film okay actually',\n", " 'later to your as',\n", " 'seven seven more',\n", " 'gei wo lai ge zhang jie zui xin de ge',\n", " 'wo xiang ting shu ying do',\n", " 'qiu yi shou ge de ming zi ge ci li you zhuan sheng di meng wang shi son su']" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "quantized_model.greedy_decoder([ceramah, record1, record2, singlish0, singlish1, singlish2, \n", " mandarin0, mandarin1, mandarin2])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Predict using beam decoder\n", "\n", "```python\n", "def beam_decoder(self, inputs, beam_width: int = 5,\n", " temperature: float = 0.0,\n", " score_norm: bool = True):\n", " \"\"\"\n", " Transcribe inputs using beam decoder.\n", "\n", " Parameters\n", " ----------\n", " inputs: List[np.array]\n", " List[np.array] or List[malaya_speech.model.frame.Frame].\n", " beam_width: int, optional (default=5)\n", " beam size for beam decoder.\n", " temperature: float, optional (default=0.0)\n", " apply temperature function for logits, can help for certain case,\n", " logits += -np.log(-np.log(uniform_noise_shape_logits)) * temperature\n", " score_norm: bool, optional (default=True)\n", " descending sort beam based on score / length of decoded.\n", "\n", " Returns\n", " -------\n", " result: List[str]\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 31.3 s, sys: 4.14 s, total: 35.5 s\n", "Wall time: 17.1 s\n" ] }, { "data": { "text/plain": [ "['jadi dalam perjalanan ini yang susah ini ketika nabi mengajar muaz bin jabal tadi ni allah',\n", " 'kalau nama saya musim saya tak suka mandi kata saya masak',\n", " 'hello n saya sekemandi saya mandi jeti hari',\n", " 'and then see how they broad and film okay actually',\n", " 'later to you as',\n", " 'seven seven eight more',\n", " 'gei wo lai ge zhang jie zui xin de ge',\n", " 'wo xiang shou kan jiang su ying shi pin dao de jie mu',\n", " 'qiu yi shou ge de ming zi ge ci li you zhuan shen ti meng wang shi qing shi xiu']" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "model.beam_decoder([ceramah, record1, record2, singlish0, singlish1, singlish2, \n", " mandarin0, mandarin1, mandarin2], beam_width = 5)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 30.8 s, sys: 3.87 s, total: 34.7 s\n", "Wall time: 16.6 s\n" ] }, { "data": { "text/plain": [ "['jadi dalam perjalanan ini yang susah ini ketika nabi mengajar muaz bin jabal tadi ni allah',\n", " 'kalau nama saya musim saya tak suka mandi kata saya masak',\n", " 'hello n saya sekemandi saya mandi jeti hari',\n", " 'and then see how they break in film okay actually',\n", " 'later to you as',\n", " 'seven seven eight more',\n", " 'gei wo lai ge zhang jie zui xin de ge',\n", " 'wo xiang shou kan jiang su ying shi pin dao de jie mu',\n", " 'qiu yi shou ge de ming zi ge ci li you zhuan shen ti meng wang shi qing shi xiu']" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "quantized_model.beam_decoder([ceramah, record1, record2, singlish0, singlish1, singlish2, \n", " mandarin0, mandarin1, mandarin2], beam_width = 5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**RNNT model beam decoder not able to utilise batch programming, if feed a batch, it will process one by one**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Predict alignment\n", "\n", "We want to know when the speakers speak certain words, so we can use `predict_timestamp`,\n", "\n", "```python\n", "def predict_alignment(self, input, combined = True):\n", " \"\"\"\n", " Transcribe input and get timestamp, only support greedy decoder.\n", "\n", " Parameters\n", " ----------\n", " input: np.array\n", " np.array or malaya_speech.model.frame.Frame.\n", " combined: bool, optional (default=True)\n", " If True, will combined subwords to become a word.\n", "\n", " Returns\n", " -------\n", " result: List[Dict[text, start, end]]\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 6.12 s, sys: 2.21 s, total: 8.34 s\n", "Wall time: 7.64 s\n" ] }, { "data": { "text/plain": [ "[{'text': 'gei', 'start': 2.6, 'end': 2.61},\n", " {'text': 'wo', 'start': 2.76, 'end': 2.77},\n", " {'text': 'lai', 'start': 2.88, 'end': 2.89},\n", " {'text': 'ge', 'start': 3.12, 'end': 3.13},\n", " {'text': 'zhang', 'start': 3.24, 'end': 3.49},\n", " {'text': 'jie', 'start': 3.52, 'end': 3.53},\n", " {'text': 'zui', 'start': 3.76, 'end': 3.77},\n", " {'text': 'xin', 'start': 3.88, 'end': 3.89},\n", " {'text': 'de', 'start': 4.0, 'end': 4.01},\n", " {'text': 'ge', 'start': 4.04, 'end': 4.05}]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "model.predict_alignment(mandarin0)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 799 ms, sys: 133 ms, total: 933 ms\n", "Wall time: 288 ms\n" ] }, { "data": { "text/plain": [ "[{'text': 'gei', 'start': 2.6, 'end': 2.61},\n", " {'text': ' ', 'start': 2.72, 'end': 2.73},\n", " {'text': 'wo_', 'start': 2.76, 'end': 2.77},\n", " {'text': 'lai', 'start': 2.88, 'end': 2.89},\n", " {'text': ' ', 'start': 3.08, 'end': 3.09},\n", " {'text': 'ge_', 'start': 3.12, 'end': 3.13},\n", " {'text': 'zha', 'start': 3.24, 'end': 3.25},\n", " {'text': 'ng_', 'start': 3.48, 'end': 3.49},\n", " {'text': 'jie', 'start': 3.52, 'end': 3.53},\n", " {'text': ' ', 'start': 3.72, 'end': 3.73},\n", " {'text': 'zui', 'start': 3.76, 'end': 3.77},\n", " {'text': ' ', 'start': 3.84, 'end': 3.85},\n", " {'text': 'xin', 'start': 3.88, 'end': 3.89},\n", " {'text': ' ', 'start': 3.96, 'end': 3.97},\n", " {'text': 'de_', 'start': 4.0, 'end': 4.01},\n", " {'text': 'ge', 'start': 4.04, 'end': 4.05}]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "model.predict_alignment(mandarin0, combined = False)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }