{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Speech-to-Text CTC + CTC Decoders" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Encoder model + CTC loss + CTC Decoders with KenLM" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [malaya-speech/example/stt-ctc-model-ctc-decoders](https://github.com/huseinzol05/malaya-speech/tree/master/example/stt-ctc-model-ctc-decoders).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at [malaya-speech/example/pipeline](https://github.com/huseinzol05/malaya-speech/tree/master/example/pipeline).\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import malaya_speech\n", "import numpy as np\n", "from malaya_speech import Pipeline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Install ctc-decoders\n", "\n", "#### From PYPI\n", "\n", "```bash\n", "pip3 install ctc-decoders\n", "```\n", "\n", "But if you use linux, we unable to upload linux wheels to pypi repository, so download linux wheel at [malaya-speech/ctc-decoders](https://github.com/huseinzol05/malaya-speech/tree/master/ctc-decoders#available-whl).\n", "\n", "#### From source\n", "\n", "Check [malaya-speech/ctc-decoders](https://github.com/huseinzol05/malaya-speech/tree/master/ctc-decoders#from-source) how to build from source incase there is no available wheel for your operating system.\n", "\n", "Building from source should only take a few minutes.\n", "\n", "#### Benefit\n", "\n", "1. ctc-decoders faster than pyctcdecode, ~26x faster based on husein benchmark, but very slightly less accurate than pyctcdecode." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List available CTC model" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Size (MB)Quantized Size (MB)WERCERWER-LMCER-LMLanguage
hubert-conformer-tiny36.610.30.3359680.08825730.1992270.0635223[malay]
hubert-conformer11531.10.2387140.06089980.1414790.0450751[malay]
hubert-conformer-large3921000.2203140.0549270.1280060.0385329[malay]
hubert-conformer-large-3mixed3921000.2411260.07879390.1327610.057482[malay, singlish, mandarin]
best-rq-conformer-tiny36.610.30.3192910.0789880.1795820.055521[malay]
best-rq-conformer11531.10.2536780.06580450.1542060.0482278[malay]
best-rq-conformer-large3921000.2346510.06016050.1300820.044521[malay]
\n", "
" ], "text/plain": [ " Size (MB) Quantized Size (MB) WER \\\n", "hubert-conformer-tiny 36.6 10.3 0.335968 \n", "hubert-conformer 115 31.1 0.238714 \n", "hubert-conformer-large 392 100 0.220314 \n", "hubert-conformer-large-3mixed 392 100 0.241126 \n", "best-rq-conformer-tiny 36.6 10.3 0.319291 \n", "best-rq-conformer 115 31.1 0.253678 \n", "best-rq-conformer-large 392 100 0.234651 \n", "\n", " CER WER-LM CER-LM \\\n", "hubert-conformer-tiny 0.0882573 0.199227 0.0635223 \n", "hubert-conformer 0.0608998 0.141479 0.0450751 \n", "hubert-conformer-large 0.054927 0.128006 0.0385329 \n", "hubert-conformer-large-3mixed 0.0787939 0.132761 0.057482 \n", "best-rq-conformer-tiny 0.078988 0.179582 0.055521 \n", "best-rq-conformer 0.0658045 0.154206 0.0482278 \n", "best-rq-conformer-large 0.0601605 0.130082 0.044521 \n", "\n", " Language \n", "hubert-conformer-tiny [malay] \n", "hubert-conformer [malay] \n", "hubert-conformer-large [malay] \n", "hubert-conformer-large-3mixed [malay, singlish, mandarin] \n", "best-rq-conformer-tiny [malay] \n", "best-rq-conformer [malay] \n", "best-rq-conformer-large [malay] " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya_speech.stt.available_ctc()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load CTC model\n", "\n", "```python\n", "def deep_ctc(\n", " model: str = 'hubert-conformer', quantized: bool = False, **kwargs\n", "):\n", " \"\"\"\n", " Load Encoder-CTC ASR model.\n", "\n", " Parameters\n", " ----------\n", " model : str, optional (default='hubert-conformer')\n", " Model architecture supported. Allowed values:\n", "\n", " * ``'hubert-conformer-tiny'`` - Finetuned HuBERT Conformer TINY.\n", " * ``'hubert-conformer'`` - Finetuned HuBERT Conformer.\n", " * ``'hubert-conformer-large'`` - Finetuned HuBERT Conformer LARGE.\n", " * ``'hubert-conformer-large-3mixed'`` - Finetuned HuBERT Conformer LARGE for (Malay + Singlish + Mandarin) languages.\n", " * ``'best-rq-conformer-tiny'`` - Finetuned BEST-RQ Conformer TINY.\n", " * ``'best-rq-conformer'`` - Finetuned BEST-RQ Conformer.\n", " * ``'best-rq-conformer-large'`` - Finetuned BEST-RQ Conformer LARGE.\n", "\n", "\n", " quantized : bool, optional (default=False)\n", " if True, will load 8-bit quantized model.\n", " Quantized model not necessary faster, totally depends on the machine.\n", "\n", " Returns\n", " -------\n", " result : malaya_speech.model.tf.Wav2Vec2_CTC class\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "scrolled": true }, "outputs": [], "source": [ "model = malaya_speech.stt.deep_ctc(model = 'hubert-conformer-large')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load sample" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "ceramah, sr = malaya_speech.load('speech/khutbah/wadi-annuar.wav')\n", "record1, sr = malaya_speech.load('speech/record/savewav_2020-11-26_22-36-06_294832.wav')\n", "record2, sr = malaya_speech.load('speech/record/savewav_2020-11-26_22-40-56_929661.wav')" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import IPython.display as ipd\n", "\n", "ipd.Audio(ceramah, rate = sr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can hear, the speaker speaks in kedahan dialects plus some arabic words, let see how good our model is." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(record1, rate = sr)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(record2, rate = sr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, below is the output from beam decoder without language model,\n", "\n", "```python\n", "['jadi dalam perjalanan ini dunia yang susah ini ketika nabi mengajar muaz bin jabal tadi ni alah ma ini',\n", " 'helo nama saya esin saya tak suka mandi ketak saya masak',\n", " 'helo nama saya musin saya suka mandi saya mandi titiap hari']\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Predict logits\n", "\n", "```python\n", "def predict_logits(self, inputs):\n", " \"\"\"\n", " Predict logits from inputs.\n", "\n", " Parameters\n", " ----------\n", " input: List[np.array]\n", " List[np.array] or List[malaya_speech.model.frame.Frame].\n", "\n", "\n", " Returns\n", " -------\n", " result: List[np.array]\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 26.5 s, sys: 10.4 s, total: 36.9 s\n", "Wall time: 20.3 s\n" ] } ], "source": [ "%%time\n", "\n", "logits = model.predict_logits([ceramah, record1, record2])" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(499, 39)" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "logits[0].shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load ctc-decoders\n", "\n", "I will use `dump-combined` for this example." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "from ctc_decoders import Scorer\n", "from ctc_decoders import ctc_beam_search_decoder\n", "from malaya_speech.utils.char import CTC_VOCAB" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "lm = malaya_speech.language_model.kenlm(model = 'dump-combined')" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "scorer = Scorer(0.5, 1.0, lm, CTC_VOCAB)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'jadi dalam perjalanan ini dunia yang susah ini ketika nabi mengajar muaz bin jabal tadi ni allah maini'" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "o = ctc_beam_search_decoder(logits[0], CTC_VOCAB, 20, ext_scoring_func = scorer)[0][1]\n", "o" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'helo nama saya mesin saya tak suka mandi ketat saya masak'" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "o = ctc_beam_search_decoder(logits[1], CTC_VOCAB, 20, ext_scoring_func = scorer)[0][1]\n", "o" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'helo nama saya mesin saya suka mandi saya mandi titik hari'" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "o = ctc_beam_search_decoder(logits[2], CTC_VOCAB, 20, ext_scoring_func = scorer)[0][1]\n", "o" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 4 }