{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Speech-to-Text CTC + pyctcdecode + MLM" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Encoder model + CTC loss + pyctcdecode with Masked Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [malaya-speech/example/stt-ctc-model-pyctcdecode-mlm](https://github.com/huseinzol05/malaya-speech/tree/master/example/stt-ctc-model-ctc-pyctcdecode-mlm).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at [malaya-speech/example/pipeline](https://github.com/huseinzol05/malaya-speech/tree/master/example/pipeline).\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "os.environ['CUDA_VISIBLE_DEVICES'] = ''" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import malaya_speech\n", "import numpy as np\n", "from malaya_speech import Pipeline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Install pyctcdecode\n", "\n", "#### From PYPI\n", "\n", "```bash\n", "pip3 install pyctcdecode==0.1.0 pypi-kenlm==0.1.20210121\n", "```\n", "\n", "#### From source\n", "\n", "Check https://github.com/kensho-technologies/pyctcdecode how to build from source incase there is no available wheel for your operating system.\n", "\n", "Building from source should only take a few minutes.\n", "\n", "#### Benefit\n", "\n", "1. pyctcdecode accurate than ctc-decoders for certain cases, but slower than pyctcdecode.\n", "2. pip install and done, no need to compile." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List available CTC model" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Size (MB)Quantized Size (MB)WERCERWER-LMCER-LMLanguage
hubert-conformer-tiny36.610.30.3359680.0882570.1992270.063522[malay]
hubert-conformer11531.10.2387140.06090.1414790.045075[malay]
hubert-conformer-large3921000.2203140.0549270.1280060.038533[malay]
hubert-conformer-large-3mixed3921000.2411260.0787940.1327610.057482[malay, singlish, mandarin]
best-rq-conformer-tiny36.610.30.3192910.0789880.1795820.055521[malay]
best-rq-conformer11531.10.2536780.0658050.1542060.048228[malay]
best-rq-conformer-large3921000.2346510.060160.1300820.044521[malay]
\n", "
" ], "text/plain": [ " Size (MB) Quantized Size (MB) WER \\\n", "hubert-conformer-tiny 36.6 10.3 0.335968 \n", "hubert-conformer 115 31.1 0.238714 \n", "hubert-conformer-large 392 100 0.220314 \n", "hubert-conformer-large-3mixed 392 100 0.241126 \n", "best-rq-conformer-tiny 36.6 10.3 0.319291 \n", "best-rq-conformer 115 31.1 0.253678 \n", "best-rq-conformer-large 392 100 0.234651 \n", "\n", " CER WER-LM CER-LM \\\n", "hubert-conformer-tiny 0.088257 0.199227 0.063522 \n", "hubert-conformer 0.0609 0.141479 0.045075 \n", "hubert-conformer-large 0.054927 0.128006 0.038533 \n", "hubert-conformer-large-3mixed 0.078794 0.132761 0.057482 \n", "best-rq-conformer-tiny 0.078988 0.179582 0.055521 \n", "best-rq-conformer 0.065805 0.154206 0.048228 \n", "best-rq-conformer-large 0.06016 0.130082 0.044521 \n", "\n", " Language \n", "hubert-conformer-tiny [malay] \n", "hubert-conformer [malay] \n", "hubert-conformer-large [malay] \n", "hubert-conformer-large-3mixed [malay, singlish, mandarin] \n", "best-rq-conformer-tiny [malay] \n", "best-rq-conformer [malay] \n", "best-rq-conformer-large [malay] " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya_speech.stt.available_ctc()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load CTC model\n", "\n", "```python\n", "def deep_ctc(\n", " model: str = 'hubert-conformer', quantized: bool = False, **kwargs\n", "):\n", " \"\"\"\n", " Load Encoder-CTC ASR model.\n", "\n", " Parameters\n", " ----------\n", " model : str, optional (default='hubert-conformer')\n", " Check available models at `malaya_speech.stt.available_ctc()`.\n", " quantized : bool, optional (default=False)\n", " if True, will load 8-bit quantized model.\n", " Quantized model not necessary faster, totally depends on the machine.\n", "\n", " Returns\n", " -------\n", " result : malaya_speech.model.wav2vec.Wav2Vec2_CTC class\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "scrolled": true }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2022-09-17 15:08:48.914279: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA\n", "To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.\n", "2022-09-17 15:08:48.918355: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected\n", "2022-09-17 15:08:48.918373: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: husein-MS-7D31\n", "2022-09-17 15:08:48.918376: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: husein-MS-7D31\n", "2022-09-17 15:08:48.918455: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program\n", "2022-09-17 15:08:48.918471: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 470.141.3\n" ] } ], "source": [ "model = malaya_speech.stt.deep_ctc(model = 'hubert-conformer-large')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load sample" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "ceramah, sr = malaya_speech.load('speech/khutbah/wadi-annuar.wav')\n", "record1, sr = malaya_speech.load('speech/record/savewav_2020-11-26_22-36-06_294832.wav')\n", "record2, sr = malaya_speech.load('speech/record/savewav_2020-11-26_22-40-56_929661.wav')" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import IPython.display as ipd\n", "\n", "ipd.Audio(ceramah, rate = sr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can hear, the speaker speaks in kedahan dialects plus some arabic words, let see how good our model is." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(record1, rate = sr)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(record2, rate = sr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, below is the output from beam decoder without language model,\n", "\n", "```python\n", "['jadi dalam perjalanan ini dunia yang susah ini ketika nabi mengajar muaz bin jabal tadi ni alah ma ini',\n", " 'helo nama saya esin saya tak suka mandi ketak saya masak',\n", " 'helo nama saya musin saya suka mandi saya mandi titiap hari']\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Predict logits\n", "\n", "```python\n", "def predict_logits(self, inputs):\n", " \"\"\"\n", " Predict logits from inputs.\n", "\n", " Parameters\n", " ----------\n", " input: List[np.array]\n", " List[np.array] or List[malaya_speech.model.frame.Frame].\n", "\n", "\n", " Returns\n", " -------\n", " result: List[np.array]\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 22.1 s, sys: 3.28 s, total: 25.4 s\n", "Wall time: 5.43 s\n" ] } ], "source": [ "%%time\n", "\n", "logits = model.predict_logits([ceramah, record1, record2])" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(499, 39)" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "logits[0].shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load pyctcdecode + MLM\n", "\n", "**To get better performance, you need to have a really good Mask model, we are trying very best to release a really good Mask model**." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/husein/.local/lib/python3.8/site-packages/malaya/tokenizer.py:202: FutureWarning: Possible nested set at position 3361\n", " self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n", "/home/husein/.local/lib/python3.8/site-packages/malaya/tokenizer.py:202: FutureWarning: Possible nested set at position 3879\n", " self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lm = malaya_speech.language_model.mlm(alpha = 0.01, beta = 0.5)\n", "lm" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "from pyctcdecode import Alphabet, BeamSearchDecoderCTC\n", "from malaya_speech.utils.char import CTC_VOCAB\n", "\n", "labels = CTC_VOCAB + ['_']\n", "ctc_token_idx = len(CTC_VOCAB)\n", "alphabet = Alphabet.build_alphabet(labels, ctc_token_idx=ctc_token_idx)\n", "decoder = BeamSearchDecoderCTC(alphabet, lm)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'jadi dalam perjalanan ini dunia yang susah ini ketika nabi mengajar muaz bin jabal tadi ni allah maini'" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "out = decoder.decode_beams(logits[0], prune_history=True, beam_width = 10)\n", "d_lm, lm_state, timesteps, logit_score, lm_score = out[0]\n", "d_lm" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'helo nama saya besin saya tak suka mandi ketat saya masak'" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "out = decoder.decode_beams(logits[1], prune_history=True, beam_width = 10)\n", "d_lm, lm_state, timesteps, logit_score, lm_score = out[0]\n", "d_lm" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'helo nama saya musin saya suka mandi saya mandi titip hari'" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "out = decoder.decode_beams(logits[2], prune_history=True, beam_width = 10)\n", "d_lm, lm_state, timesteps, logit_score, lm_score = out[0]\n", "d_lm" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 4 }