{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Speech-to-Text CTC Malay + Singlish + Mandarin" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Encoder model + CTC loss for Malay + Singlish + Mandarin languages" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [malaya-speech/example/stt-ctc-model-3mixed](https://github.com/huseinzol05/malaya-speech/tree/master/example/stt-ctc-model-3mixed).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at [malaya-speech/example/pipeline](https://github.com/huseinzol05/malaya-speech/tree/master/example/pipeline).\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import malaya_speech\n", "import numpy as np\n", "from malaya_speech import Pipeline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List available CTC model" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Size (MB)Quantized Size (MB)WERCERWER-LMCER-LMLanguage
hubert-conformer-tiny36.610.30.3359680.08825730.1992270.0635223[malay]
hubert-conformer11531.10.2387140.06089980.1414790.0450751[malay]
hubert-conformer-large3921000.2203140.0549270.1280060.0385329[malay]
hubert-conformer-large-3mixed3921000.2411260.07879390.1327610.057482[malay, singlish, mandarin]
best-rq-conformer-tiny36.610.30.3192910.0789880.1795820.055521[malay]
best-rq-conformer11531.10.2536780.06580450.1542060.0482278[malay]
best-rq-conformer-large3921000.2346510.06016050.1300820.044521[malay]
\n", "
" ], "text/plain": [ " Size (MB) Quantized Size (MB) WER \\\n", "hubert-conformer-tiny 36.6 10.3 0.335968 \n", "hubert-conformer 115 31.1 0.238714 \n", "hubert-conformer-large 392 100 0.220314 \n", "hubert-conformer-large-3mixed 392 100 0.241126 \n", "best-rq-conformer-tiny 36.6 10.3 0.319291 \n", "best-rq-conformer 115 31.1 0.253678 \n", "best-rq-conformer-large 392 100 0.234651 \n", "\n", " CER WER-LM CER-LM \\\n", "hubert-conformer-tiny 0.0882573 0.199227 0.0635223 \n", "hubert-conformer 0.0608998 0.141479 0.0450751 \n", "hubert-conformer-large 0.054927 0.128006 0.0385329 \n", "hubert-conformer-large-3mixed 0.0787939 0.132761 0.057482 \n", "best-rq-conformer-tiny 0.078988 0.179582 0.055521 \n", "best-rq-conformer 0.0658045 0.154206 0.0482278 \n", "best-rq-conformer-large 0.0601605 0.130082 0.044521 \n", "\n", " Language \n", "hubert-conformer-tiny [malay] \n", "hubert-conformer [malay] \n", "hubert-conformer-large [malay] \n", "hubert-conformer-large-3mixed [malay, singlish, mandarin] \n", "best-rq-conformer-tiny [malay] \n", "best-rq-conformer [malay] \n", "best-rq-conformer-large [malay] " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya_speech.stt.available_ctc()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load CTC model\n", "\n", "```python\n", "def deep_ctc(\n", " model: str = 'hubert-conformer', quantized: bool = False, **kwargs\n", "):\n", " \"\"\"\n", " Load Encoder-CTC ASR model.\n", "\n", " Parameters\n", " ----------\n", " model : str, optional (default='hubert-conformer')\n", " Model architecture supported. Allowed values:\n", "\n", " * ``'hubert-conformer-tiny'`` - Finetuned HuBERT Conformer TINY.\n", " * ``'hubert-conformer'`` - Finetuned HuBERT Conformer.\n", " * ``'hubert-conformer-large'`` - Finetuned HuBERT Conformer LARGE.\n", " * ``'hubert-conformer-large-3mixed'`` - Finetuned HuBERT Conformer LARGE for (Malay + Singlish + Mandarin) languages.\n", " * ``'best-rq-conformer-tiny'`` - Finetuned BEST-RQ Conformer TINY.\n", " * ``'best-rq-conformer'`` - Finetuned BEST-RQ Conformer.\n", " * ``'best-rq-conformer-large'`` - Finetuned BEST-RQ Conformer LARGE.\n", "\n", "\n", " quantized : bool, optional (default=False)\n", " if True, will load 8-bit quantized model.\n", " Quantized model not necessary faster, totally depends on the machine.\n", "\n", " Returns\n", " -------\n", " result : malaya_speech.model.tf.Wav2Vec2_CTC class\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "scrolled": true }, "outputs": [], "source": [ "model = malaya_speech.stt.deep_ctc(model = 'hubert-conformer-large-3mixed')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load Quantized deep model\n", "\n", "To load 8-bit quantized model, simply pass `quantized = True`, default is `False`.\n", "\n", "We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "scrolled": true }, "outputs": [], "source": [ "quantized_model = malaya_speech.stt.deep_ctc(model = 'hubert-conformer-large-3mixed', quantized = True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load sample" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "ceramah, sr = malaya_speech.load('speech/khutbah/wadi-annuar.wav')\n", "record1, sr = malaya_speech.load('speech/record/savewav_2020-11-26_22-36-06_294832.wav')\n", "record2, sr = malaya_speech.load('speech/record/savewav_2020-11-26_22-40-56_929661.wav')\n", "singlish0, sr = malaya_speech.load('speech/singlish/singlish0.wav')\n", "singlish1, sr = malaya_speech.load('speech/singlish/singlish1.wav')\n", "singlish2, sr = malaya_speech.load('speech/singlish/singlish2.wav')\n", "mandarin0, sr = malaya_speech.load('speech/mandarin/597.wav')\n", "mandarin1, sr = malaya_speech.load('speech/mandarin/584.wav')\n", "mandarin2, sr = malaya_speech.load('speech/mandarin/509.wav')" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import IPython.display as ipd\n", "\n", "ipd.Audio(ceramah, rate = sr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can hear, the speaker speaks in kedahan dialects plus some arabic words, let see how good our model is." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(record1, rate = sr)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(record2, rate = sr)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(singlish0, rate = sr)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(singlish1, rate = sr)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(singlish2, rate = sr)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(mandarin0, rate = sr)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(mandarin1, rate = sr)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(mandarin2, rate = sr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Predict using greedy decoder\n", "\n", "```python\n", "def greedy_decoder(self, inputs):\n", " \"\"\"\n", " Transcribe inputs using greedy decoder.\n", "\n", " Parameters\n", " ----------\n", " input: List[np.array]\n", " List[np.array] or List[malaya_speech.model.frame.Frame].\n", "\n", " Returns\n", " -------\n", " result: List[str]\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1min 12s, sys: 25.8 s, total: 1min 38s\n", "Wall time: 53.8 s\n" ] }, { "data": { "text/plain": [ "['jadi dalam perjalanan ini lunia yang susah ini ketika nabi mengajar muaz bin jabar tadi ni alah maini',\n", " 'helo nama saya sin saya taksukamandi saya masang',\n", " 'helo nama saya husin saya suka mandi say mond t hari',\n", " 'and then se how they ro it in film okay actualy',\n", " 'then you tat to your eyes',\n", " 'severson involve',\n", " 'gei wo lai ge zhang jie zui xin de ge',\n", " 'wo xiang shou kan jiang su yi shi pin dao de jie mu',\n", " 'qiu yi shou ge de ming zi ge ci li you zhuan sheng yi meng wang shi qing shi xiu']" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "model.greedy_decoder([ceramah, record1, record2, singlish0, singlish1, singlish2, \n", " mandarin0, mandarin1, mandarin2])" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1min 4s, sys: 20.3 s, total: 1min 25s\n", "Wall time: 27 s\n" ] }, { "data": { "text/plain": [ "['jadi dalam perjalanan ini dunia yang susah ini ketika nabi mengajar muaz bin jabar tadi ni alah maini',\n", " 'helo nama saya sin saya taksukamandi saya masang',\n", " 'helo nama saya husin saya suka mandi say mond t hari',\n", " 'and then se how they ro it in film okay actualy',\n", " 'then you tat to your eyes',\n", " 'severson involve',\n", " 'gei wo lai ge zhang jie zui xin de ge',\n", " 'wo xiang shou kan jiang su yi shi pin dao de jie mu',\n", " 'qiu yi shou ge de ming zi ge ci li you zhuan sheng yi meng wang shi qing shi xiu']" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "quantized_model.greedy_decoder([ceramah, record1, record2, singlish0, singlish1, singlish2, \n", " mandarin0, mandarin1, mandarin2])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Predict using beam decoder\n", "\n", "```python\n", "def beam_decoder(self, inputs, beam_width: int = 100):\n", " \"\"\"\n", " Transcribe inputs using beam decoder.\n", "\n", " Parameters\n", " ----------\n", " input: List[np.array]\n", " List[np.array] or List[malaya_speech.model.frame.Frame].\n", " beam_width: int, optional (default=100)\n", " beam size for beam decoder.\n", "\n", " Returns\n", " -------\n", " result: List[str]\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1min 20s, sys: 25.7 s, total: 1min 45s\n", "Wall time: 38.9 s\n" ] }, { "data": { "text/plain": [ "['jadi dalam perjalanan ini lunia yang susah ini ketika nabi mengajar muaz bin jabar tadi ni alah maini',\n", " 'helo nama saya sine saya taksukamandi k saya masang',\n", " 'helo nama saya husin saya suka mandi saya mond hari',\n", " 'and then se how they roe it in film okay actualy',\n", " ' then you tat to your eys',\n", " 'severson involve',\n", " 'gei wo lai ge zhang jie zui xin de ge',\n", " 'wo xiang shou kan jiang su yin shi pin dao de jie mu',\n", " 'qiu yi shou ge de ming zi ge ci li you zhuan sheng yi meng wang shi qing shi xiu']" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "model.beam_decoder([ceramah, record1, record2, singlish0, singlish1, singlish2, \n", " mandarin0, mandarin1, mandarin2])" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1min 15s, sys: 23.7 s, total: 1min 38s\n", "Wall time: 36.4 s\n" ] }, { "data": { "text/plain": [ "['jadi dalam perjalanan ini dunia yang susah ini ketika nabi mengajar muaz bin jabar tadi ni alah maini',\n", " 'helo nama saya sine saya tak sukamandi k saya masang',\n", " 'helo nama saya husin saya suka mandi saya mond hari',\n", " 'and then se how they roe it in film okay actualy',\n", " ' then you tat to your eys',\n", " 'severson involve',\n", " 'gei wo lai ge zhang jie zui xin de ge',\n", " 'wo xiang shou kan jiang su yin shi pin dao de jie mu',\n", " 'qiu yi shou ge de ming zi ge ci li you zhuan sheng yi meng wang shi qing shi xiu']" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "quantized_model.beam_decoder([ceramah, record1, record2, singlish0, singlish1, singlish2, \n", " mandarin0, mandarin1, mandarin2])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Predict logits\n", "\n", "```python\n", "def predict_logits(self, inputs):\n", " \"\"\"\n", " Predict logits from inputs.\n", "\n", " Parameters\n", " ----------\n", " input: List[np.array]\n", " List[np.array] or List[malaya_speech.model.frame.Frame].\n", "\n", "\n", " Returns\n", " -------\n", " result: List[np.array]\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1min 17s, sys: 25.7 s, total: 1min 43s\n", "Wall time: 38.9 s\n" ] } ], "source": [ "%%time\n", "\n", "logits = model.predict_logits([ceramah, record1, record2, singlish0, singlish1, singlish2, \n", " mandarin0, mandarin1, mandarin2])" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "([array([[5.0769908e-13, 4.9920995e-06, 4.1646995e-06, ..., 4.5173483e-11,\n", " 4.9810876e-07, 9.9995333e-01],\n", " [7.0174210e-13, 1.1965726e-06, 8.6241844e-06, ..., 4.4535094e-11,\n", " 3.0639339e-07, 9.9991900e-01],\n", " [2.3351788e-13, 5.8303417e-06, 1.9618931e-06, ..., 1.1045688e-11,\n", " 7.3586186e-08, 9.9996752e-01],\n", " ...,\n", " [2.5784982e-10, 1.1928772e-03, 1.1550728e-05, ..., 8.8894708e-08,\n", " 8.0232326e-07, 9.9622393e-01],\n", " [2.8769603e-11, 2.7891362e-04, 5.6442818e-06, ..., 3.6080590e-09,\n", " 3.2085375e-06, 9.9853903e-01],\n", " [4.6427840e-12, 4.9945329e-06, 1.7572627e-05, ..., 1.3756351e-09,\n", " 8.1958979e-06, 9.9879241e-01]], dtype=float32),\n", " array([[7.3830272e-15, 2.6149910e-06, 2.9792586e-08, ..., 2.6439647e-13,\n", " 1.7397490e-05, 9.9995226e-01],\n", " [4.1778101e-14, 2.1268688e-06, 1.0971124e-08, ..., 1.6074101e-12,\n", " 3.9539827e-04, 9.9956423e-01],\n", " [2.3152864e-10, 1.3606093e-06, 2.8146353e-06, ..., 1.3427521e-08,\n", " 3.4544442e-05, 3.6596761e-03],\n", " ...,\n", " [9.0034176e-13, 1.0301627e-05, 6.1719288e-07, ..., 5.0070459e-10,\n", " 2.1360616e-07, 9.9990749e-01],\n", " [5.9597216e-13, 6.4955448e-06, 3.9596441e-07, ..., 2.0631585e-10,\n", " 9.8242957e-08, 9.9990749e-01],\n", " [4.0117134e-13, 2.9746909e-06, 2.4516078e-07, ..., 8.1560876e-11,\n", " 7.1571016e-08, 9.9989223e-01]], dtype=float32),\n", " array([[4.9308157e-12, 2.4779722e-05, 5.4713229e-05, ..., 6.4286562e-11,\n", " 1.4688317e-07, 9.9918878e-01],\n", " [2.7793004e-12, 9.1287082e-05, 3.6907666e-05, ..., 1.7815747e-11,\n", " 1.6487343e-07, 9.9894202e-01],\n", " [1.3181623e-13, 4.3102671e-05, 1.2385843e-06, ..., 1.9754181e-12,\n", " 1.9433019e-06, 9.9937463e-01],\n", " ...,\n", " [7.5935447e-14, 8.9420282e-06, 6.9871746e-08, ..., 1.4791677e-11,\n", " 2.0337066e-07, 9.9996471e-01],\n", " [3.9390583e-14, 1.1035587e-06, 1.0905099e-07, ..., 3.1461658e-12,\n", " 1.2711328e-07, 9.9997044e-01],\n", " [3.8269647e-14, 2.5441926e-07, 2.1005840e-07, ..., 1.6235466e-12,\n", " 1.2756625e-07, 9.9995995e-01]], dtype=float32),\n", " array([[2.73753199e-12, 1.41113460e-05, 6.97673095e-05, ...,\n", " 7.00015601e-11, 2.52056367e-07, 9.99187827e-01],\n", " [2.67177686e-12, 5.47746749e-05, 3.18141829e-05, ...,\n", " 4.88219222e-11, 1.54149234e-07, 9.98954356e-01],\n", " [2.06563694e-12, 5.69845797e-05, 1.17870304e-05, ...,\n", " 3.02334373e-11, 5.92762262e-07, 9.99148726e-01],\n", " ...,\n", " [1.05928322e-11, 3.52900570e-05, 1.49284710e-06, ...,\n", " 3.19949567e-10, 1.58309449e-05, 9.99036312e-01],\n", " [9.81266457e-12, 2.45898154e-05, 1.92325160e-06, ...,\n", " 2.47491666e-10, 1.77746479e-05, 9.98341978e-01],\n", " [8.86547780e-12, 1.88725371e-05, 2.89226273e-06, ...,\n", " 3.51084273e-10, 1.02685335e-05, 9.97144938e-01]], dtype=float32),\n", " array([[3.7328987e-10, 4.8693453e-04, 1.4623961e-03, ..., 1.2560929e-08,\n", " 7.6059614e-06, 9.7704321e-01],\n", " [2.6055907e-10, 2.0105850e-03, 2.9868580e-04, ..., 3.1227767e-09,\n", " 6.8202285e-06, 9.7670513e-01],\n", " [2.8789800e-11, 6.0608047e-03, 1.9908768e-05, ..., 1.4141036e-09,\n", " 2.0176631e-05, 9.6654350e-01],\n", " ...,\n", " [1.7319904e-10, 2.4855541e-04, 8.9967716e-06, ..., 9.5362104e-08,\n", " 2.1292526e-05, 9.9565399e-01],\n", " [1.2542030e-10, 1.0046138e-04, 1.5353055e-05, ..., 7.4714642e-08,\n", " 6.1415085e-05, 9.9698526e-01],\n", " [6.0447411e-11, 9.9597302e-05, 1.2155103e-05, ..., 4.0446455e-08,\n", " 7.4573909e-06, 9.9734193e-01]], dtype=float32),\n", " array([[1.43914158e-10, 8.17825639e-05, 1.11076247e-03, ...,\n", " 1.00497495e-08, 2.89867444e-06, 9.93359864e-01],\n", " [5.60313254e-11, 2.31490500e-04, 1.34532733e-04, ...,\n", " 2.26499552e-09, 1.29177897e-06, 9.95656013e-01],\n", " [7.65202918e-12, 4.68969316e-04, 5.32863714e-06, ...,\n", " 9.66771330e-10, 4.13954331e-06, 9.95621800e-01],\n", " ...,\n", " [2.99896358e-11, 5.04393902e-05, 3.70761677e-06, ...,\n", " 3.50759866e-09, 4.00016518e-07, 9.97396052e-01],\n", " [1.15719179e-11, 1.85157151e-05, 3.35120762e-06, ...,\n", " 7.35826011e-10, 5.00607086e-07, 9.97562528e-01],\n", " [4.55708561e-12, 6.71240468e-06, 2.61677724e-06, ...,\n", " 2.16798593e-10, 3.46757702e-07, 9.98183012e-01]], dtype=float32),\n", " array([[1.26174936e-13, 1.60809429e-06, 8.10027700e-07, ...,\n", " 1.29462222e-11, 5.63254332e-08, 9.99971390e-01],\n", " [1.99867336e-12, 7.03780358e-07, 1.58678413e-05, ...,\n", " 1.61210642e-10, 7.81361678e-08, 9.99813080e-01],\n", " [9.24896119e-13, 2.14554825e-06, 3.25683641e-06, ...,\n", " 6.80284787e-11, 3.70923310e-08, 9.99940872e-01],\n", " ...,\n", " [2.31162935e-13, 2.65076073e-06, 3.30892895e-08, ...,\n", " 8.54729343e-11, 6.96199152e-08, 9.99979079e-01],\n", " [1.57815736e-13, 2.03774584e-06, 2.12347793e-08, ...,\n", " 4.44310769e-11, 4.75518824e-08, 9.99981821e-01],\n", " [1.00010786e-13, 1.51056111e-06, 1.46384007e-08, ...,\n", " 1.84447058e-11, 3.62013779e-08, 9.99983847e-01]], dtype=float32),\n", " array([[3.20663059e-14, 6.05978357e-07, 2.63524299e-07, ...,\n", " 2.76005000e-12, 5.24865627e-08, 9.99990463e-01],\n", " [5.22183208e-13, 3.10281933e-07, 4.54648125e-06, ...,\n", " 3.62388348e-11, 4.90500156e-08, 9.99934256e-01],\n", " [3.85531993e-13, 1.18964090e-06, 1.30783656e-06, ...,\n", " 2.42427953e-11, 1.57408788e-08, 9.99969482e-01],\n", " ...,\n", " [2.64443415e-13, 1.23893403e-06, 2.62333604e-08, ...,\n", " 1.66792136e-10, 1.05531477e-07, 9.99979973e-01],\n", " [2.24179047e-13, 1.42319334e-06, 2.18357545e-08, ...,\n", " 1.07364034e-10, 7.68235964e-08, 9.99976158e-01],\n", " [1.20802741e-13, 1.40528255e-06, 1.22246844e-08, ...,\n", " 4.11593329e-11, 5.55311530e-08, 9.99978125e-01]], dtype=float32),\n", " array([[9.8158745e-14, 1.3049810e-06, 5.5805117e-07, ..., 9.7509830e-12,\n", " 4.9950518e-08, 9.9997705e-01],\n", " [5.5710406e-13, 5.2367943e-07, 2.9092205e-06, ..., 3.6361244e-11,\n", " 4.2768967e-08, 9.9993038e-01],\n", " [3.9059172e-13, 1.3663632e-06, 9.6394558e-07, ..., 2.5345634e-11,\n", " 2.2321407e-08, 9.9996382e-01],\n", " ...,\n", " [1.6430580e-15, 2.9931834e-07, 2.2682769e-09, ..., 4.4464970e-12,\n", " 1.6207073e-07, 9.9999815e-01],\n", " [2.4066753e-15, 4.7474239e-07, 3.0915743e-09, ..., 4.7773513e-12,\n", " 1.7420831e-07, 9.9999815e-01],\n", " [4.6813690e-15, 3.4006769e-07, 4.1412229e-09, ..., 5.0322411e-12,\n", " 7.9983622e-08, 9.9999815e-01]], dtype=float32)],\n", " (499, 39),\n", " (299, 39))" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "logits, logits[0].shape, logits[1].shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load Language Model\n", "\n", "```python\n", "def language_model(\n", " model: str = 'dump-combined', **kwargs\n", "):\n", " \"\"\"\n", " Load KenLM language model.\n", "\n", " Parameters\n", " ----------\n", " model : str, optional (default='dump-combined')\n", " Model architecture supported. Allowed values:\n", "\n", " * ``'bahasa'`` - Gathered from malaya-speech ASR bahasa transcript.\n", " * ``'bahasa-news'`` - Gathered from malaya-speech ASR bahasa transcript + Bahasa News (Random sample 300k sentences).\n", " * ``'bahasa-combined'`` - Gathered from malaya-speech ASR bahasa transcript + Bahasa News (Random sample 300k sentences) + Bahasa Wikipedia (Random sample 150k sentences).\n", " * ``'redape-community'`` - Mirror for https://github.com/redapesolutions/suara-kami-community\n", " * ``'dump-combined'`` - Academia + News + IIUM + Parliament + Watpadd + Wikipedia + Common Crawl + training set from https://github.com/huseinzol05/malay-dataset/tree/master/dumping/clean.\n", " * ``'manglish'`` - Manglish News + Manglish Reddit + Manglish forum + training set from https://github.com/huseinzol05/malay-dataset/tree/master/dumping/clean.\n", " * ``'bahasa-manglish-combined'`` - Combined `dump-combined` and `manglish`.\n", "\n", " Returns\n", " -------\n", " result : str\n", " \"\"\"\n", "```\n", "\n", "For 3 mixed languages, I prefer to load `bahasa-manglish-combined`." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "from pyctcdecode import build_ctcdecoder\n", "from ctc_decoders import Scorer\n", "from ctc_decoders import ctc_beam_search_decoder\n", "from malaya_speech.utils.char import CTC_VOCAB\n", "import kenlm" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "lm = malaya_speech.stt.language_model(model = 'bahasa-manglish-combined')" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "kenlm_model = kenlm.Model(lm)\n", "decoder = build_ctcdecoder(\n", " CTC_VOCAB + ['_'],\n", " kenlm_model,\n", " alpha=0.2,\n", " beta=1.0,\n", " ctc_token_idx=len(CTC_VOCAB)\n", ")" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 jadi dalam perjalanan ini dunia yang susah ini ketika nabi mengajar muadz bin jabal tadi ni allah ma ini\n", "1 hello nama saya san saya tak suka mandi saya masang\n", "2 hello nama saya hussin saya suka mandi saya mandi hari\n", "3 and then see how they ro it in film okay actually\n", "4 then you tat to your eyes\n", "5 severson involve\n", "6 gei wo lai ge zhang jie zui xin de ge\n", "7 nansenrnrnibaunwoixiagshoukanjiangsugyigshipindaodejiemu\n", "8 qiu yi shou ge de ming zi ge ci li you zhuan sheng yi wen wang shi qing shi xiu\n" ] } ], "source": [ "for no, l in enumerate(logits):\n", " out = decoder.decode_beams(l, prune_history=True)\n", " d_lm, lm_state, timesteps, logit_score, lm_score = out[0]\n", " print(no, d_lm)" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "scorer = Scorer(0.5, 1.0, lm, CTC_VOCAB)" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 jadi dalam perjalanan ini dunia yang susah ini ketika nabi mengajar muadz bin jabal tadi ni allah ma ini\n", "1 hello nama saya san saya tak sumanding\n", "2 hello nama saya hussin saya suka mandi saman hari\n", "3 and then see how they row it in film okay actually\n", "4 then you tat to your eyes\n", "5 severson involve\n", "6 gei wo lai ge zhang jie zui xin de ge\n", "7 wo xiang shou kan jiang su yi shi pin dao de jie mu\n", "8 qiu yi shou ge de ming zi ge ci li you zhuan sheng yi wen wang shi qing shi xiu\n" ] } ], "source": [ "for no, l in enumerate(logits):\n", " o = ctc_beam_search_decoder(l, CTC_VOCAB, 20, ext_scoring_func = scorer)[0][1]\n", " print(no, o)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" } }, "nbformat": 4, "nbformat_minor": 4 }