{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Speech-to-Text CTC" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Encoder model + CTC loss" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [malaya-speech/example/stt-ctc-model](https://github.com/huseinzol05/malaya-speech/tree/master/example/stt-ctc-model).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at [malaya-speech/example/pipeline](https://github.com/huseinzol05/malaya-speech/tree/master/example/pipeline).\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import malaya_speech\n", "import numpy as np\n", "from malaya_speech import Pipeline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List available CTC model" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Size (MB)Quantized Size (MB)WERCERWER-LMCER-LMLanguage
hubert-conformer-tiny36.610.30.3359680.08825730.1992270.0635223[malay]
hubert-conformer11531.10.2387140.06089980.1414790.0450751[malay]
hubert-conformer-large3921000.2203140.0549270.1280060.0385329[malay]
hubert-conformer-large-3mixed3921000.2411260.07879390.1327610.057482[malay, singlish, mandarin]
best-rq-conformer-tiny36.610.30.3192910.0789880.1795820.055521[malay]
best-rq-conformer11531.10.2536780.06580450.1542060.0482278[malay]
best-rq-conformer-large3921000.2346510.06016050.1300820.044521[malay]
\n", "
" ], "text/plain": [ " Size (MB) Quantized Size (MB) WER \\\n", "hubert-conformer-tiny 36.6 10.3 0.335968 \n", "hubert-conformer 115 31.1 0.238714 \n", "hubert-conformer-large 392 100 0.220314 \n", "hubert-conformer-large-3mixed 392 100 0.241126 \n", "best-rq-conformer-tiny 36.6 10.3 0.319291 \n", "best-rq-conformer 115 31.1 0.253678 \n", "best-rq-conformer-large 392 100 0.234651 \n", "\n", " CER WER-LM CER-LM \\\n", "hubert-conformer-tiny 0.0882573 0.199227 0.0635223 \n", "hubert-conformer 0.0608998 0.141479 0.0450751 \n", "hubert-conformer-large 0.054927 0.128006 0.0385329 \n", "hubert-conformer-large-3mixed 0.0787939 0.132761 0.057482 \n", "best-rq-conformer-tiny 0.078988 0.179582 0.055521 \n", "best-rq-conformer 0.0658045 0.154206 0.0482278 \n", "best-rq-conformer-large 0.0601605 0.130082 0.044521 \n", "\n", " Language \n", "hubert-conformer-tiny [malay] \n", "hubert-conformer [malay] \n", "hubert-conformer-large [malay] \n", "hubert-conformer-large-3mixed [malay, singlish, mandarin] \n", "best-rq-conformer-tiny [malay] \n", "best-rq-conformer [malay] \n", "best-rq-conformer-large [malay] " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya_speech.stt.available_ctc()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Google Speech-to-Text accuracy\n", "\n", "We tested on the same malay dataset to compare malaya-speech models and Google Speech-to-Text, check the notebook at [benchmark-google-speech-malay-dataset.ipynb](https://github.com/huseinzol05/malaya-speech/blob/master/pretrained-model/prepare-stt/benchmark-google-speech-malay-dataset.ipynb)." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'malay': {'WER': 0.164775, 'CER': 0.059732},\n", " 'singlish': {'WER': 0.4941349, 'CER': 0.3026296}}" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya_speech.stt.google_accuracy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Again, even some models beat google speech-to-text accuracy for CER, we really need to be skeptical with the score, the test set and postprocessing might favoured for malaya-speech**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load CTC model\n", "\n", "```python\n", "def deep_ctc(\n", " model: str = 'hubert-conformer', quantized: bool = False, **kwargs\n", "):\n", " \"\"\"\n", " Load Encoder-CTC ASR model.\n", "\n", " Parameters\n", " ----------\n", " model : str, optional (default='hubert-conformer')\n", " Model architecture supported. Allowed values:\n", "\n", " * ``'hubert-conformer-tiny'`` - Finetuned HuBERT Conformer TINY.\n", " * ``'hubert-conformer'`` - Finetuned HuBERT Conformer.\n", " * ``'hubert-conformer-large'`` - Finetuned HuBERT Conformer LARGE.\n", " * ``'hubert-conformer-large-3mixed'`` - Finetuned HuBERT Conformer LARGE for (Malay + Singlish + Mandarin) languages.\n", " * ``'best-rq-conformer-tiny'`` - Finetuned BEST-RQ Conformer TINY.\n", " * ``'best-rq-conformer'`` - Finetuned BEST-RQ Conformer.\n", " * ``'best-rq-conformer-large'`` - Finetuned BEST-RQ Conformer LARGE.\n", "\n", "\n", " quantized : bool, optional (default=False)\n", " if True, will load 8-bit quantized model.\n", " Quantized model not necessary faster, totally depends on the machine.\n", "\n", " Returns\n", " -------\n", " result : malaya_speech.model.tf.Wav2Vec2_CTC class\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "scrolled": true }, "outputs": [], "source": [ "model = malaya_speech.stt.deep_ctc(model = 'hubert-conformer-large')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load Quantized deep model\n", "\n", "To load 8-bit quantized model, simply pass `quantized = True`, default is `False`.\n", "\n", "We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "scrolled": true }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING:root:Load quantized model will cause accuracy drop.\n" ] } ], "source": [ "quantized_model = malaya_speech.stt.deep_ctc(model = 'hubert-conformer-large', quantized = True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load sample" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "ceramah, sr = malaya_speech.load('speech/khutbah/wadi-annuar.wav')\n", "record1, sr = malaya_speech.load('speech/record/savewav_2020-11-26_22-36-06_294832.wav')\n", "record2, sr = malaya_speech.load('speech/record/savewav_2020-11-26_22-40-56_929661.wav')" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import IPython.display as ipd\n", "\n", "ipd.Audio(ceramah, rate = sr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can hear, the speaker speaks in kedahan dialects plus some arabic words, let see how good our model is." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(record1, rate = sr)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(record2, rate = sr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Predict using greedy decoder\n", "\n", "```python\n", "def greedy_decoder(self, inputs):\n", " \"\"\"\n", " Transcribe inputs using greedy decoder.\n", "\n", " Parameters\n", " ----------\n", " input: List[np.array]\n", " List[np.array] or List[malaya_speech.model.frame.Frame].\n", "\n", " Returns\n", " -------\n", " result: List[str]\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 16.5 s, sys: 5.54 s, total: 22 s\n", "Wall time: 4.2 s\n" ] }, { "data": { "text/plain": [ "['jadi dalam perjalanan ini dunia yang susah ini ketika nabi mengajar muaz bin jabal tadi ni alah maini',\n", " 'helo nama saya esin saya tak suka mandi ketak saya masak',\n", " 'helo nama saya musin saya suka mandi saya mandi titiap hari']" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "model.greedy_decoder([ceramah, record1, record2])" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 16.7 s, sys: 5.5 s, total: 22.2 s\n", "Wall time: 4.15 s\n" ] }, { "data": { "text/plain": [ "['jadi dalam perjalanan ini dunia yang susah ini ketika nabi mengajar muaz bin jabal tadi ni alah maini',\n", " 'helo nama saya esin saya tak suka mandi ketak saya masak',\n", " 'helo nama saya musin saya suka mandi saya mandi titiap hari']" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "quantized_model.greedy_decoder([ceramah, record1, record2])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Predict using beam decoder\n", "\n", "```python\n", "def beam_decoder(self, inputs, beam_width: int = 100):\n", " \"\"\"\n", " Transcribe inputs using beam decoder.\n", "\n", " Parameters\n", " ----------\n", " input: List[np.array]\n", " List[np.array] or List[malaya_speech.model.frame.Frame].\n", " beam_width: int, optional (default=100)\n", " beam size for beam decoder.\n", "\n", " Returns\n", " -------\n", " result: List[str]\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 26.9 s, sys: 11.8 s, total: 38.7 s\n", "Wall time: 21.9 s\n" ] }, { "data": { "text/plain": [ "['jadi dalam perjalanan ini dunia yang susah ini ketika nabi mengajar muaz bin jabal tadi ni alah ma ini',\n", " 'helo nama saya esin saya tak suka mandi ketak saya masak',\n", " 'helo nama saya musin saya suka mandi saya mandi titiap hari']" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "model.beam_decoder([ceramah, record1, record2])" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 26.5 s, sys: 11 s, total: 37.5 s\n", "Wall time: 19.3 s\n" ] }, { "data": { "text/plain": [ "['jadi dalam perjalanan ini dunia yang susah ini ketika nabi mengajar muaz bin jabal tadi ni alah ma ini',\n", " 'helo nama saya esin saya tak suka mandi ketak saya masak',\n", " 'helo nama saya musin saya suka mandi saya mandi titiap hari']" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "quantized_model.beam_decoder([ceramah, record1, record2])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Predict logits\n", "\n", "```python\n", "def predict_logits(self, inputs, norm_func=softmax):\n", " \"\"\"\n", " Predict logits from inputs.\n", "\n", " Parameters\n", " ----------\n", " input: List[np.array]\n", " List[np.array] or List[malaya_speech.model.frame.Frame].\n", " norm_func: Callable, optional (default=malaya.utils.activation.softmax)\n", "\n", "\n", " Returns\n", " -------\n", " result: List[np.array]\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 28.1 s, sys: 11.9 s, total: 40 s\n", "Wall time: 22.6 s\n" ] } ], "source": [ "%%time\n", "\n", "logits = model.predict_logits([ceramah, record1, record2])" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "([array([[1.8236330e-14, 1.0061867e-09, 9.1841962e-10, ..., 1.0257798e-12,\n", " 1.4030079e-09, 2.5916536e-04],\n", " [8.9639464e-15, 1.7033805e-09, 4.6274323e-10, ..., 6.6502495e-13,\n", " 8.0938090e-10, 4.3023058e-04],\n", " [3.1887525e-15, 2.8573975e-08, 1.4154197e-10, ..., 3.1815506e-13,\n", " 2.4540896e-09, 1.0617726e-03],\n", " ...,\n", " [8.8733091e-16, 1.2685842e-08, 6.6173206e-11, ..., 3.7230733e-13,\n", " 4.0759787e-12, 1.5587657e-04],\n", " [6.6120927e-16, 6.5794703e-09, 5.7139283e-11, ..., 4.9776345e-13,\n", " 4.2690807e-12, 8.7887602e-05],\n", " [5.4730716e-16, 2.1361308e-09, 9.3025414e-11, ..., 5.9886417e-13,\n", " 1.5507430e-11, 1.0300719e-04]], dtype=float32),\n", " array([[2.5148340e-15, 3.8227799e-09, 1.0598683e-09, ..., 4.1977169e-13,\n", " 7.1061929e-10, 3.9750684e-04],\n", " [2.9357024e-14, 4.2996367e-11, 8.0736609e-09, ..., 3.3508845e-12,\n", " 3.6292098e-11, 3.5416318e-07],\n", " [1.9910680e-15, 7.9035152e-09, 3.4416633e-10, ..., 1.3513049e-12,\n", " 1.3454201e-09, 1.0257948e-04],\n", " ...,\n", " [3.6507369e-16, 3.2854697e-09, 3.0207747e-10, ..., 8.0089668e-13,\n", " 1.2993007e-10, 1.5945344e-03],\n", " [4.2754438e-16, 6.5015402e-09, 2.0224432e-10, ..., 8.0650563e-13,\n", " 1.6817953e-10, 1.4560440e-03],\n", " [3.1852618e-16, 6.3009313e-09, 1.1704311e-10, ..., 5.8428116e-13,\n", " 2.6164873e-10, 1.3219280e-03]], dtype=float32),\n", " array([[7.4554222e-15, 8.0248808e-10, 4.3944155e-09, ..., 7.1492242e-13,\n", " 1.8890089e-09, 8.0676808e-05],\n", " [2.0308339e-15, 2.7493043e-09, 9.9622988e-10, ..., 3.6468658e-13,\n", " 4.4663002e-09, 4.2535696e-04],\n", " [9.8700091e-16, 5.8224430e-09, 9.5636510e-10, ..., 2.5640658e-13,\n", " 6.0229244e-10, 5.0613942e-04],\n", " ...,\n", " [6.6229983e-16, 2.5211668e-09, 1.0523804e-10, ..., 8.1805699e-13,\n", " 2.0467164e-10, 1.5298247e-03],\n", " [7.8388564e-16, 3.5194940e-09, 8.4871700e-11, ..., 6.7016152e-13,\n", " 1.1333524e-10, 1.0733902e-03],\n", " [6.9106448e-16, 3.5260113e-09, 6.5353133e-11, ..., 4.5797551e-13,\n", " 1.2999080e-10, 9.0871076e-04]], dtype=float32)],\n", " (499, 39),\n", " (299, 39))" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "logits, logits[0].shape, logits[1].shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can use output from `predict_logits` to feed into [ctc-decoders](https://github.com/huseinzol05/malaya-speech/tree/master/ctc-decoders) or [pyctcdecode](https://github.com/kensho-technologies/pyctcdecode) with language model to get better results." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }