{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Language Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [malaya-speech/example/ctc-language-model](https://github.com/huseinzol05/malaya-speech/tree/master/example/ctc-language-model).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Purpose" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When doing CTC or RNNT beam decoding, we want to add language bias during find the optimum alignment." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List available Language Model\n", "\n", "We provided language model for our ASR models," ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import malaya_speech" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Size (MB)LM orderDescriptionCommand
bahasa173Gathered from malaya-speech ASR bahasa transcript[./lmplz --text text.txt --arpa out.arpa -o 3 ...
bahasa-news243Gathered from malaya-speech bahasa ASR transcr...[./lmplz --text text.txt --arpa out.arpa -o 3 ...
bahasa-combined293Gathered from malaya-speech ASR bahasa transcr...[./lmplz --text text.txt --arpa out.arpa -o 3 ...
redape-community887.14Mirror for https://github.com/redapesolutions/...[./lmplz --text text.txt --arpa out.arpa -o 4 ...
dump-combined3103Academia + News + IIUM + Parliament + Watpadd ...[./lmplz --text text.txt --arpa out.arpa -o 3 ...
manglish2023Manglish News + Manglish Reddit + Manglish for...[./lmplz --text text.txt --arpa out.arpa -o 3 ...
bahasa-manglish-combined6083Combined `dump-combined` and `manglish`.[./lmplz --text text.txt --arpa out.arpa -o 3 ...
\n", "
" ], "text/plain": [ " Size (MB) LM order \\\n", "bahasa 17 3 \n", "bahasa-news 24 3 \n", "bahasa-combined 29 3 \n", "redape-community 887.1 4 \n", "dump-combined 310 3 \n", "manglish 202 3 \n", "bahasa-manglish-combined 608 3 \n", "\n", " Description \\\n", "bahasa Gathered from malaya-speech ASR bahasa transcript \n", "bahasa-news Gathered from malaya-speech bahasa ASR transcr... \n", "bahasa-combined Gathered from malaya-speech ASR bahasa transcr... \n", "redape-community Mirror for https://github.com/redapesolutions/... \n", "dump-combined Academia + News + IIUM + Parliament + Watpadd ... \n", "manglish Manglish News + Manglish Reddit + Manglish for... \n", "bahasa-manglish-combined Combined `dump-combined` and `manglish`. \n", "\n", " Command \n", "bahasa [./lmplz --text text.txt --arpa out.arpa -o 3 ... \n", "bahasa-news [./lmplz --text text.txt --arpa out.arpa -o 3 ... \n", "bahasa-combined [./lmplz --text text.txt --arpa out.arpa -o 3 ... \n", "redape-community [./lmplz --text text.txt --arpa out.arpa -o 4 ... \n", "dump-combined [./lmplz --text text.txt --arpa out.arpa -o 3 ... \n", "manglish [./lmplz --text text.txt --arpa out.arpa -o 3 ... \n", "bahasa-manglish-combined [./lmplz --text text.txt --arpa out.arpa -o 3 ... " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya_speech.stt.available_language_model()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`redape-community` got from https://github.com/redapesolutions/suara-kami-community, another good malay speech-to-text repository." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load Language Model\n", "\n", "```python\n", "def language_model(\n", " model: str = 'dump-combined', **kwargs\n", "):\n", " \"\"\"\n", " Load KenLM language model.\n", "\n", " Parameters\n", " ----------\n", " model : str, optional (default='dump-combined')\n", " Model architecture supported. Allowed values:\n", "\n", " * ``'bahasa'`` - Gathered from malaya-speech ASR bahasa transcript.\n", " * ``'bahasa-news'`` - Gathered from malaya-speech ASR bahasa transcript + Bahasa News (Random sample 300k sentences).\n", " * ``'bahasa-combined'`` - Gathered from malaya-speech ASR bahasa transcript + Bahasa News (Random sample 300k sentences) + Bahasa Wikipedia (Random sample 150k sentences).\n", " * ``'redape-community'`` - Mirror for https://github.com/redapesolutions/suara-kami-community\n", " * ``'dump-combined'`` - Academia + News + IIUM + Parliament + Watpadd + Wikipedia + Common Crawl + training set from https://github.com/huseinzol05/malay-dataset/tree/master/dumping/clean.\n", " * ``'manglish'`` - Manglish News + Manglish Reddit + Manglish forum + training set from https://github.com/huseinzol05/malay-dataset/tree/master/dumping/clean.\n", " * ``'bahasa-manglish-combined'`` - Combined `dump-combined` and `manglish`.\n", "\n", " Returns\n", " -------\n", " result : str\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "'/Users/huseinzolkepli/Malaya-Speech/language-model/bahasa/model.trie.klm'" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lm = malaya_speech.stt.language_model(model = 'bahasa')\n", "lm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Build custom Language Model\n", "\n", "1. Build KenLM,\n", "\n", "```bash\n", "wget -O - https://kheafield.com/code/kenlm.tar.gz |tar xz\n", "mkdir kenlm/build\n", "cd kenlm/build\n", "cmake ..\n", "make -j2\n", "```\n", "\n", "2. Prepare newlines text file. Feel free to use some from https://github.com/huseinzol05/Malay-Dataset/tree/master/dumping.\n", "\n", "```bash\n", "kenlm/build/bin/lmplz --text text.txt --arpa out.arpa -o 3 --prune 0 1 1\n", "kenlm/build/bin/build_binary -q 8 -b 7 -a 256 trie out.arpa out.trie.klm\n", "```\n", "\n", "3. Once you have `out.trie.klm`, you can load to scorer interface.\n", "\n", "```python\n", "from ctc_decoders import Scorer\n", "\n", "scorer = Scorer(alpha, beta, 'out.trie.klm', vocab_list)\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Use ctc-decoders\n", "\n", "#### From PYPI\n", "\n", "```bash\n", "pip3 install ctc-decoders\n", "```\n", "\n", "But if you use linux, we unable to upload linux wheels to pypi repository, so download linux wheel at [malaya-speech/ctc-decoders](https://github.com/huseinzol05/malaya-speech/tree/master/ctc-decoders#available-whl).\n", "\n", "#### From source\n", "\n", "Check [malaya-speech/ctc-decoders](https://github.com/huseinzol05/malaya-speech/tree/master/ctc-decoders#from-source) how to build from source incase there is no available wheel for your operating system.\n", "\n", "Building from source should only take a few minutes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Load ctc-decoders" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "from ctc_decoders import Scorer\n", "from malaya_speech.utils.char import CTC_VOCAB" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```text\n", "Init signature: Scorer(alpha, beta, model_path, vocabulary)\n", "Docstring: \n", "Wrapper for Scorer.\n", "\n", ":param alpha: Parameter associated with language model. Don't use\n", " language model when alpha = 0.\n", ":type alpha: float\n", ":param beta: Parameter associated with word count. Don't use word\n", " count when beta = 0.\n", ":type beta: float\n", ":model_path: Path to load language model.\n", ":type model_path: basestring\n", "```" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ " >" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "scorer = Scorer(0.5, 1.0, lm, CTC_VOCAB)\n", "scorer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Test" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from ctc_decoders import ctc_greedy_decoder, ctc_beam_search_decoder\n", "import numpy as np\n", "import malaya_speech" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "# https://github.com/PaddlePaddle/DeepSpeech/blob/master/decoders/tests/test_decoders.py\n", "\n", "vocab_list = [\"\\'\", ' ', 'a', 'b', 'c', 'dk ']\n", "beam_size = 20\n", "probs_seq1 = [[\n", " 0.06390443, 0.21124858, 0.27323887, 0.06870235, 0.0361254,\n", " 0.18184413, 0.16493624\n", "], [\n", " 0.03309247, 0.22866108, 0.24390638, 0.09699597, 0.31895462,\n", " 0.0094893, 0.06890021\n", "], [\n", " 0.218104, 0.19992557, 0.18245131, 0.08503348, 0.14903535,\n", " 0.08424043, 0.08120984\n", "], [\n", " 0.12094152, 0.19162472, 0.01473646, 0.28045061, 0.24246305,\n", " 0.05206269, 0.09772094\n", "], [\n", " 0.1333387, 0.00550838, 0.00301669, 0.21745861, 0.20803985,\n", " 0.41317442, 0.01946335\n", "], [\n", " 0.16468227, 0.1980699, 0.1906545, 0.18963251, 0.19860937,\n", " 0.04377724, 0.01457421\n", "]]\n", "probs_seq2 = [[\n", " 0.08034842, 0.22671944, 0.05799633, 0.36814645, 0.11307441,\n", " 0.04468023, 0.10903471\n", "], [\n", " 0.09742457, 0.12959763, 0.09435383, 0.21889204, 0.15113123,\n", " 0.10219457, 0.20640612\n", "], [\n", " 0.45033529, 0.09091417, 0.15333208, 0.07939558, 0.08649316,\n", " 0.12298585, 0.01654384\n", "], [\n", " 0.02512238, 0.22079203, 0.19664364, 0.11906379, 0.07816055,\n", " 0.22538587, 0.13483174\n", "], [\n", " 0.17928453, 0.06065261, 0.41153005, 0.1172041, 0.11880313,\n", " 0.07113197, 0.04139363\n", "], [\n", " 0.15882358, 0.1235788, 0.23376776, 0.20510435, 0.00279306,\n", " 0.05294827, 0.22298418\n", "]]\n", "greedy_result = [\"ac'bdk c\", \"b'dk a\"]\n", "beam_search_result = ['acdk c', \"b'a\"]" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ctc_greedy_decoder(np.array(probs_seq1), vocab_list) == greedy_result[0]" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ctc_greedy_decoder(np.array(probs_seq2), vocab_list) == greedy_result[1]" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(-6.480283737182617, 'acdk c'),\n", " (-6.483003616333008, 'acdk '),\n", " (-6.52116060256958, 'acdk a'),\n", " (-6.526535511016846, 'acdk b'),\n", " (-6.570488452911377, 'a dk c'),\n", " (-6.573208332061768, 'a dk '),\n", " (-6.61136531829834, 'a dk a'),\n", " (-6.6167402267456055, 'a dk b'),\n", " (-6.630837440490723, 'acbc'),\n", " (-6.63310432434082, 'acb'),\n", " (-6.633557319641113, 'acb '),\n", " (-6.644730091094971, 'a bc'),\n", " (-6.647449970245361, 'a b '),\n", " (-6.650537490844727, 'a b'),\n", " (-6.667605400085449, \"acdk '\"),\n", " (-6.6717143058776855, 'acba'),\n", " (-6.685606956481934, 'a ba'),\n", " (-6.686768531799316, ' cdk c'),\n", " (-6.689488410949707, ' cdk '),\n", " (-6.709468364715576, 'a c')]" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ctc_beam_search_decoder(probs_seq = np.array(probs_seq1), \n", " beam_size = beam_size,\n", " vocabulary = vocab_list)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "[(-4.989980220794678, \"b'a\"),\n", " (-5.298550128936768, \"b'dk a\"),\n", " (-5.3370184898376465, \"b' a\"),\n", " (-5.585845470428467, \"b'a'\"),\n", " (-5.652693271636963, \" 'a\"),\n", " (-5.7635698318481445, \"b'ab\"),\n", " (-5.788026332855225, \"b'ba\"),\n", " (-6.0385026931762695, 'bdk a'),\n", " (-6.132683753967285, \"b'ca\"),\n", " (-6.137714385986328, \" 'dk a\"),\n", " (-6.158307075500488, \" ' a\"),\n", " (-6.171831130981445, \"b'dk '\"),\n", " (-6.221673011779785, \"b' '\"),\n", " (-6.240574359893799, 'b a'),\n", " (-6.270209312438965, \"b'a \"),\n", " (-6.2848052978515625, \"b'dk ab\"),\n", " (-6.304642200469971, 'ba'),\n", " (-6.305397987365723, \"b' ab\"),\n", " (-6.426036834716797, \" 'ab\"),\n", " (-6.505356311798096, \"b'b\")]" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ctc_beam_search_decoder(probs_seq = np.array(probs_seq2), \n", " beam_size = beam_size,\n", " vocabulary = vocab_list)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Use pyctcdecode\n", "\n", "#### From PYPI\n", "\n", "```bash\n", "pip3 install pyctcdecode==0.1.0 pypi-kenlm==0.1.20210121\n", "```\n", "\n", "#### From source\n", "\n", "Check https://github.com/kensho-technologies/pyctcdecode how to build from source incase there is no available wheel for your operating system.\n", "\n", "Building from source should only take a few minutes." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "import kenlm\n", "from pyctcdecode import build_ctcdecoder\n", "\n", "kenlm_model = kenlm.Model(lm)\n", "decoder = build_ctcdecoder(\n", " CTC_VOCAB,\n", " kenlm_model,\n", " alpha=0.5,\n", " beta=1.0,\n", ")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }