{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Speech-to-Text CTC + pyctcdecode + MLM" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Encoder model + CTC loss + pyctcdecode with Masked Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [malaya-speech/example/stt-ctc-model-pyctcdecode-mlm](https://github.com/huseinzol05/malaya-speech/tree/master/example/stt-ctc-model-ctc-pyctcdecode-mlm).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at [malaya-speech/example/pipeline](https://github.com/huseinzol05/malaya-speech/tree/master/example/pipeline).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This interface deprecated, use HuggingFace interface instead.\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "os.environ['CUDA_VISIBLE_DEVICES'] = ''" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "`pyaudio` is not available, `malaya_speech.streaming.stream` is not able to use.\n" ] } ], "source": [ "import malaya_speech\n", "import numpy as np\n", "from malaya_speech import Pipeline" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import logging\n", "\n", "logging.basicConfig(level=logging.INFO)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "import warnings\n", "warnings.filterwarnings('default')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Install pyctcdecode\n", "\n", "#### From PYPI\n", "\n", "```bash\n", "pip3 install pyctcdecode==0.1.0 pypi-kenlm==0.1.20210121\n", "```\n", "\n", "#### From source\n", "\n", "Check https://github.com/kensho-technologies/pyctcdecode how to build from source incase there is no available wheel for your operating system.\n", "\n", "Building from source should only take a few minutes.\n", "\n", "#### Benefit\n", "\n", "1. pyctcdecode accurate than ctc-decoders for certain cases, but slower than pyctcdecode.\n", "2. pip install and done, no need to compile." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List available CTC model" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/husein/dev/malaya-speech/malaya_speech/stt/ctc.py:144: DeprecationWarning: `malaya.stt.ctc.available_transformer` is deprecated, use `malaya.stt.ctc.available_huggingface` instead\n", " warnings.warn(\n", "INFO:malaya_speech.stt:for `malay-fleur102` language, tested on FLEURS102 `ms_my` test set, https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/prepare-stt\n", "INFO:malaya_speech.stt:for `malay-malaya` language, tested on malaya-speech test set, https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/prepare-stt\n", "INFO:malaya_speech.stt:for `singlish` language, tested on IMDA malaya-speech test set, https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/prepare-stt\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Size (MB)Quantized Size (MB)malay-malayaLanguage
hubert-conformer-tiny36.610.3{'WER': 0.238714008166, 'CER': 0.060899814, 'W...[malay]
hubert-conformer11531.1{'WER': 0.2387140081, 'CER': 0.06089981404, 'W...[malay]
hubert-conformer-large392100{'WER': 0.2203140421, 'CER': 0.0549270416, 'WE...[malay]
\n", "
" ], "text/plain": [ " Size (MB) Quantized Size (MB) \\\n", "hubert-conformer-tiny 36.6 10.3 \n", "hubert-conformer 115 31.1 \n", "hubert-conformer-large 392 100 \n", "\n", " malay-malaya \\\n", "hubert-conformer-tiny {'WER': 0.238714008166, 'CER': 0.060899814, 'W... \n", "hubert-conformer {'WER': 0.2387140081, 'CER': 0.06089981404, 'W... \n", "hubert-conformer-large {'WER': 0.2203140421, 'CER': 0.0549270416, 'WE... \n", "\n", " Language \n", "hubert-conformer-tiny [malay] \n", "hubert-conformer [malay] \n", "hubert-conformer-large [malay] " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya_speech.stt.ctc.available_transformer()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load CTC model\n", "\n", "```python\n", "def transformer(\n", " model: str = 'hubert-conformer',\n", " quantized: bool = False,\n", " **kwargs,\n", "):\n", " \"\"\"\n", " Load Encoder-CTC ASR model.\n", "\n", " Parameters\n", " ----------\n", " model : str, optional (default='hubert-conformer')\n", " Check available models at `malaya_speech.stt.ctc.available_transformer()`.\n", " quantized : bool, optional (default=False)\n", " if True, will load 8-bit quantized model.\n", " Quantized model not necessary faster, totally depends on the machine.\n", "\n", " Returns\n", " -------\n", " result : malaya_speech.model.wav2vec.Wav2Vec2_CTC class\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "scrolled": true }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2022-09-17 15:08:48.914279: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA\n", "To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.\n", "2022-09-17 15:08:48.918355: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected\n", "2022-09-17 15:08:48.918373: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: husein-MS-7D31\n", "2022-09-17 15:08:48.918376: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: husein-MS-7D31\n", "2022-09-17 15:08:48.918455: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program\n", "2022-09-17 15:08:48.918471: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 470.141.3\n" ] } ], "source": [ "model = malaya_speech.stt.ctc.transformer(model = 'hubert-conformer-large')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load sample" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "ceramah, sr = malaya_speech.load('speech/khutbah/wadi-annuar.wav')\n", "record1, sr = malaya_speech.load('speech/record/savewav_2020-11-26_22-36-06_294832.wav')\n", "record2, sr = malaya_speech.load('speech/record/savewav_2020-11-26_22-40-56_929661.wav')" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import IPython.display as ipd\n", "\n", "ipd.Audio(ceramah, rate = sr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can hear, the speaker speaks in kedahan dialects plus some arabic words, let see how good our model is." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(record1, rate = sr)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(record2, rate = sr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, below is the output from beam decoder without language model,\n", "\n", "```python\n", "['jadi dalam perjalanan ini dunia yang susah ini ketika nabi mengajar muaz bin jabal tadi ni alah ma ini',\n", " 'helo nama saya esin saya tak suka mandi ketak saya masak',\n", " 'helo nama saya musin saya suka mandi saya mandi titiap hari']\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Predict logits\n", "\n", "```python\n", "def predict_logits(self, inputs):\n", " \"\"\"\n", " Predict logits from inputs.\n", "\n", " Parameters\n", " ----------\n", " input: List[np.array]\n", " List[np.array] or List[malaya_speech.model.frame.Frame].\n", "\n", "\n", " Returns\n", " -------\n", " result: List[np.array]\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 22.1 s, sys: 3.28 s, total: 25.4 s\n", "Wall time: 5.43 s\n" ] } ], "source": [ "%%time\n", "\n", "logits = model.predict_logits([ceramah, record1, record2])" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(499, 39)" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "logits[0].shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load pyctcdecode + MLM\n", "\n", "**To get better performance, you need to have a really good Mask model, we are trying very best to release a really good Mask model**." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/husein/.local/lib/python3.8/site-packages/malaya/tokenizer.py:202: FutureWarning: Possible nested set at position 3361\n", " self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n", "/home/husein/.local/lib/python3.8/site-packages/malaya/tokenizer.py:202: FutureWarning: Possible nested set at position 3879\n", " self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lm = malaya_speech.language_model.mlm(alpha = 0.01, beta = 0.5)\n", "lm" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "from pyctcdecode import Alphabet, BeamSearchDecoderCTC\n", "from malaya_speech.utils.char import CTC_VOCAB\n", "\n", "labels = CTC_VOCAB + ['_']\n", "ctc_token_idx = len(CTC_VOCAB)\n", "alphabet = Alphabet.build_alphabet(labels, ctc_token_idx=ctc_token_idx)\n", "decoder = BeamSearchDecoderCTC(alphabet, lm)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'jadi dalam perjalanan ini dunia yang susah ini ketika nabi mengajar muaz bin jabal tadi ni allah maini'" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "out = decoder.decode_beams(logits[0], prune_history=True, beam_width = 10)\n", "d_lm, lm_state, timesteps, logit_score, lm_score = out[0]\n", "d_lm" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'helo nama saya besin saya tak suka mandi ketat saya masak'" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "out = decoder.decode_beams(logits[1], prune_history=True, beam_width = 10)\n", "d_lm, lm_state, timesteps, logit_score, lm_score = out[0]\n", "d_lm" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'helo nama saya musin saya suka mandi saya mandi titip hari'" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "out = decoder.decode_beams(logits[2], prune_history=True, beam_width = 10)\n", "d_lm, lm_state, timesteps, logit_score, lm_score = out[0]\n", "d_lm" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 4 }