{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Speech-to-Text RNNT + GPT2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Encoder model + RNNT loss + GPT2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [malaya-speech/example/stt-transducer-model-lm-gpt2](https://github.com/huseinzol05/malaya-speech/tree/master/example/stt-transducer-model-lm-gpt2).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "os.environ['CUDA_VISIBLE_DEVICES'] = ''" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import malaya_speech\n", "import numpy as np\n", "from malaya_speech import Pipeline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List available RNNT model" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Size (MB)Quantized Size (MB)WERCERWER-LMCER-LMLanguage
tiny-conformer24.49.140.2128110.0813690.1996830.077004[malay]
small-conformer49.218.10.1985330.0744950.1853610.071143[malay]
conformer12537.10.1636020.0587440.1561820.05719[malay]
large-conformer4041070.1566840.0619710.1486220.05901[malay]
conformer-stack-2mixed13038.50.1036080.0500690.1029110.050201[malay, singlish]
conformer-stack-3mixed13038.50.2347680.1339440.2292410.130702[malay, singlish, mandarin]
small-conformer-singlish49.218.10.0878310.0456860.0873330.045317[singlish]
conformer-singlish12537.10.0777920.0403620.0771860.03987[singlish]
large-conformer-singlish4041070.0701470.0358720.0698120.035723[singlish]
xs-squeezeformer51.923.40.1980920.0790350.1988420.078122[malay]
sm-squeezeformer14747.40.1761270.0680790.168730.061468[malay]
m-squeezeformer26178.50.1670080.0597280.1561850.053639[malay]
\n", "
" ], "text/plain": [ " Size (MB) Quantized Size (MB) WER CER \\\n", "tiny-conformer 24.4 9.14 0.212811 0.081369 \n", "small-conformer 49.2 18.1 0.198533 0.074495 \n", "conformer 125 37.1 0.163602 0.058744 \n", "large-conformer 404 107 0.156684 0.061971 \n", "conformer-stack-2mixed 130 38.5 0.103608 0.050069 \n", "conformer-stack-3mixed 130 38.5 0.234768 0.133944 \n", "small-conformer-singlish 49.2 18.1 0.087831 0.045686 \n", "conformer-singlish 125 37.1 0.077792 0.040362 \n", "large-conformer-singlish 404 107 0.070147 0.035872 \n", "xs-squeezeformer 51.9 23.4 0.198092 0.079035 \n", "sm-squeezeformer 147 47.4 0.176127 0.068079 \n", "m-squeezeformer 261 78.5 0.167008 0.059728 \n", "\n", " WER-LM CER-LM Language \n", "tiny-conformer 0.199683 0.077004 [malay] \n", "small-conformer 0.185361 0.071143 [malay] \n", "conformer 0.156182 0.05719 [malay] \n", "large-conformer 0.148622 0.05901 [malay] \n", "conformer-stack-2mixed 0.102911 0.050201 [malay, singlish] \n", "conformer-stack-3mixed 0.229241 0.130702 [malay, singlish, mandarin] \n", "small-conformer-singlish 0.087333 0.045317 [singlish] \n", "conformer-singlish 0.077186 0.03987 [singlish] \n", "large-conformer-singlish 0.069812 0.035723 [singlish] \n", "xs-squeezeformer 0.198842 0.078122 [malay] \n", "sm-squeezeformer 0.16873 0.061468 [malay] \n", "m-squeezeformer 0.156185 0.053639 [malay] " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya_speech.stt.available_transducer()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lower is better. Mixed models tested on different dataset." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load RNNT model\n", "\n", "```python\n", "def deep_transducer(\n", " model: str = 'conformer', quantized: bool = False, **kwargs\n", "):\n", " \"\"\"\n", " Load Encoder-Transducer ASR model.\n", "\n", " Parameters\n", " ----------\n", " model : str, optional (default='conformer')\n", " Check available models at `malaya_speech.stt.available_transducer()`.\n", " quantized : bool, optional (default=False)\n", " if True, will load 8-bit quantized model.\n", " Quantized model not necessary faster, totally depends on the machine.\n", "\n", " Returns\n", " -------\n", " result : malaya_speech.model.transducer.Transducer class\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "scrolled": true }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2022-09-15 13:15:52.398400: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA\n", "To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.\n", "2022-09-15 13:15:52.402512: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected\n", "2022-09-15 13:15:52.402530: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: husein-MS-7D31\n", "2022-09-15 13:15:52.402533: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: husein-MS-7D31\n", "2022-09-15 13:15:52.402613: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program\n", "2022-09-15 13:15:52.402633: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 470.141.3\n" ] } ], "source": [ "small_model = malaya_speech.stt.deep_transducer(model = 'small-conformer')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load sample" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "ceramah, sr = malaya_speech.load('speech/khutbah/wadi-annuar.wav')\n", "record1, sr = malaya_speech.load('speech/record/savewav_2020-11-26_22-36-06_294832.wav')\n", "record2, sr = malaya_speech.load('speech/record/savewav_2020-11-26_22-40-56_929661.wav')\n", "shafiqah_idayu, sr = malaya_speech.load('speech/example-speaker/shafiqah-idayu.wav')\n", "mas_aisyah, sr = malaya_speech.load('speech/example-speaker/mas-aisyah.wav')\n", "khalil, sr = malaya_speech.load('speech/example-speaker/khalil-nooh.wav')" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import IPython.display as ipd\n", "\n", "ipd.Audio(ceramah, rate = sr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can hear, the speaker speaks in kedahan dialects plus some arabic words, let see how good our model is." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(record1, rate = sr)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(record2, rate = sr)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(shafiqah_idayu, rate = sr)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(mas_aisyah, rate = sr)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(khalil, rate = sr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load GPT2\n", "\n", "**To get better performance, you need to have a really good GPT2 model, we are trying very best to release a really good GPT2 model**." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "language_model = malaya_speech.language_model.gpt2(alpha = 0.01, beta = 0.2)\n", "language_model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Predict using beam decoder language model\n", "\n", "```python\n", "def beam_decoder_lm(self, inputs, language_model,\n", " beam_width: int = 5,\n", " token_min_logp: float = -20.0,\n", " beam_prune_logp: float = -5.0,\n", " temperature: float = 0.0,\n", " score_norm: bool = True):\n", " \"\"\"\n", " Transcribe inputs using beam decoder + KenLM.\n", "\n", " Parameters\n", " ----------\n", " inputs: List[np.array]\n", " List[np.array] or List[malaya_speech.model.frame.Frame].\n", " language_model: pyctcdecode.language_model.LanguageModel\n", " pyctcdecode language model, load from `LanguageModel(kenlm_model, alpha = alpha, beta = beta)`.\n", " beam_width: int, optional (default=5)\n", " beam size for beam decoder.\n", " token_min_logp: float, optional (default=-20.0)\n", " minimum log probability to select a token.\n", " beam_prune_logp: float, optional (default=-5.0)\n", " filter candidates >= max score lm + `beam_prune_logp`.\n", " temperature: float, optional (default=0.0)\n", " apply temperature function for logits, can help for certain case,\n", " logits += -np.log(-np.log(uniform_noise_shape_logits)) * temperature\n", " score_norm: bool, optional (default=True)\n", " descending sort beam based on score / length of decoded.\n", "\n", " Returns\n", " -------\n", " result: List[str]\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/husein/dev/malaya-speech/malaya_speech/torch_model/gpt2_lm.py:42: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).\n", " context = to_tensor_cuda(torch.tensor(tokenized)[0], cuda)\n", "/home/husein/dev/malaya-speech/malaya_speech/torch_model/gpt2_lm.py:56: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).\n", " context = to_tensor_cuda(torch.tensor(tokenized), cuda)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 11min 13s, sys: 11.3 s, total: 11min 25s\n", "Wall time: 1min 1s\n" ] }, { "data": { "text/plain": [ "['tolong sebut anti kata']" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "small_model.beam_decoder_lm([khalil], language_model, beam_width = 3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**RNNT model beam decoder language model not able to utilise batch processing, if feed a batch, it will process one by one**." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }