{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Language Model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "\n",
    "This tutorial is available as an IPython notebook at [malaya-speech/example/ctc-language-model](https://github.com/huseinzol05/malaya-speech/tree/master/example/ctc-language-model).\n",
    "    \n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-warning\">\n",
    "\n",
    "This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.\n",
    "    \n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Purpose"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When doing CTC or RNNT beam decoding, we want to add language bias during find the optimum alignment."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### List available Language Model\n",
    "\n",
    "We provided language model for our ASR models,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import malaya_speech"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Size (MB)</th>\n",
       "      <th>LM order</th>\n",
       "      <th>Description</th>\n",
       "      <th>Command</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>bahasa</th>\n",
       "      <td>17</td>\n",
       "      <td>3</td>\n",
       "      <td>Gathered from malaya-speech ASR bahasa transcript</td>\n",
       "      <td>[./lmplz --text text.txt --arpa out.arpa -o 3 ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>bahasa-news</th>\n",
       "      <td>24</td>\n",
       "      <td>3</td>\n",
       "      <td>Gathered from malaya-speech bahasa ASR transcr...</td>\n",
       "      <td>[./lmplz --text text.txt --arpa out.arpa -o 3 ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>bahasa-combined</th>\n",
       "      <td>29</td>\n",
       "      <td>3</td>\n",
       "      <td>Gathered from malaya-speech ASR bahasa transcr...</td>\n",
       "      <td>[./lmplz --text text.txt --arpa out.arpa -o 3 ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>redape-community</th>\n",
       "      <td>887.1</td>\n",
       "      <td>4</td>\n",
       "      <td>Mirror for https://github.com/redapesolutions/...</td>\n",
       "      <td>[./lmplz --text text.txt --arpa out.arpa -o 4 ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>dump-combined</th>\n",
       "      <td>310</td>\n",
       "      <td>3</td>\n",
       "      <td>Academia + News + IIUM + Parliament + Watpadd ...</td>\n",
       "      <td>[./lmplz --text text.txt --arpa out.arpa -o 3 ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>manglish</th>\n",
       "      <td>202</td>\n",
       "      <td>3</td>\n",
       "      <td>Manglish News + Manglish Reddit + Manglish for...</td>\n",
       "      <td>[./lmplz --text text.txt --arpa out.arpa -o 3 ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>bahasa-manglish-combined</th>\n",
       "      <td>608</td>\n",
       "      <td>3</td>\n",
       "      <td>Combined `dump-combined` and `manglish`.</td>\n",
       "      <td>[./lmplz --text text.txt --arpa out.arpa -o 3 ...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                         Size (MB) LM order  \\\n",
       "bahasa                          17        3   \n",
       "bahasa-news                     24        3   \n",
       "bahasa-combined                 29        3   \n",
       "redape-community             887.1        4   \n",
       "dump-combined                  310        3   \n",
       "manglish                       202        3   \n",
       "bahasa-manglish-combined       608        3   \n",
       "\n",
       "                                                                Description  \\\n",
       "bahasa                    Gathered from malaya-speech ASR bahasa transcript   \n",
       "bahasa-news               Gathered from malaya-speech bahasa ASR transcr...   \n",
       "bahasa-combined           Gathered from malaya-speech ASR bahasa transcr...   \n",
       "redape-community          Mirror for https://github.com/redapesolutions/...   \n",
       "dump-combined             Academia + News + IIUM + Parliament + Watpadd ...   \n",
       "manglish                  Manglish News + Manglish Reddit + Manglish for...   \n",
       "bahasa-manglish-combined           Combined `dump-combined` and `manglish`.   \n",
       "\n",
       "                                                                    Command  \n",
       "bahasa                    [./lmplz --text text.txt --arpa out.arpa -o 3 ...  \n",
       "bahasa-news               [./lmplz --text text.txt --arpa out.arpa -o 3 ...  \n",
       "bahasa-combined           [./lmplz --text text.txt --arpa out.arpa -o 3 ...  \n",
       "redape-community          [./lmplz --text text.txt --arpa out.arpa -o 4 ...  \n",
       "dump-combined             [./lmplz --text text.txt --arpa out.arpa -o 3 ...  \n",
       "manglish                  [./lmplz --text text.txt --arpa out.arpa -o 3 ...  \n",
       "bahasa-manglish-combined  [./lmplz --text text.txt --arpa out.arpa -o 3 ...  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "malaya_speech.stt.available_language_model()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`redape-community` got from https://github.com/redapesolutions/suara-kami-community, another good malay speech-to-text repository."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load Language Model\n",
    "\n",
    "```python\n",
    "def language_model(\n",
    "    model: str = 'dump-combined', **kwargs\n",
    "):\n",
    "    \"\"\"\n",
    "    Load KenLM language model.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    model : str, optional (default='dump-combined')\n",
    "        Model architecture supported. Allowed values:\n",
    "\n",
    "        * ``'bahasa'`` - Gathered from malaya-speech ASR bahasa transcript.\n",
    "        * ``'bahasa-news'`` - Gathered from malaya-speech ASR bahasa transcript + Bahasa News (Random sample 300k sentences).\n",
    "        * ``'bahasa-combined'`` - Gathered from malaya-speech ASR bahasa transcript + Bahasa News (Random sample 300k sentences) + Bahasa Wikipedia (Random sample 150k sentences).\n",
    "        * ``'redape-community'`` - Mirror for https://github.com/redapesolutions/suara-kami-community\n",
    "        * ``'dump-combined'`` - Academia + News + IIUM + Parliament + Watpadd + Wikipedia + Common Crawl + training set from https://github.com/huseinzol05/malay-dataset/tree/master/dumping/clean.\n",
    "        * ``'manglish'`` - Manglish News + Manglish Reddit + Manglish forum + training set from https://github.com/huseinzol05/malay-dataset/tree/master/dumping/clean.\n",
    "        * ``'bahasa-manglish-combined'`` - Combined `dump-combined` and `manglish`.\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    result : str\n",
    "    \"\"\"\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'/Users/huseinzolkepli/Malaya-Speech/language-model/bahasa/model.trie.klm'"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lm = malaya_speech.stt.language_model(model = 'bahasa')\n",
    "lm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Build custom Language Model\n",
    "\n",
    "1. Build KenLM,\n",
    "\n",
    "```bash\n",
    "wget -O - https://kheafield.com/code/kenlm.tar.gz |tar xz\n",
    "mkdir kenlm/build\n",
    "cd kenlm/build\n",
    "cmake ..\n",
    "make -j2\n",
    "```\n",
    "\n",
    "2. Prepare newlines text file. Feel free to use some from https://github.com/huseinzol05/Malay-Dataset/tree/master/dumping.\n",
    "\n",
    "```bash\n",
    "kenlm/build/bin/lmplz --text text.txt --arpa out.arpa -o 3 --prune 0 1 1\n",
    "kenlm/build/bin/build_binary -q 8 -b 7 -a 256 trie out.arpa out.trie.klm\n",
    "```\n",
    "\n",
    "3. Once you have `out.trie.klm`, you can load to scorer interface.\n",
    "\n",
    "```python\n",
    "from ctc_decoders import Scorer\n",
    "\n",
    "scorer = Scorer(alpha, beta, 'out.trie.klm', vocab_list)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Use ctc-decoders\n",
    "\n",
    "#### From PYPI\n",
    "\n",
    "```bash\n",
    "pip3 install ctc-decoders\n",
    "```\n",
    "\n",
    "But if you use linux, we unable to upload linux wheels to pypi repository, so download linux wheel at [malaya-speech/ctc-decoders](https://github.com/huseinzol05/malaya-speech/tree/master/ctc-decoders#available-whl).\n",
    "\n",
    "#### From source\n",
    "\n",
    "Check [malaya-speech/ctc-decoders](https://github.com/huseinzol05/malaya-speech/tree/master/ctc-decoders#from-source) how to build from source incase there is no available wheel for your operating system.\n",
    "\n",
    "Building from source should only take a few minutes."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Load ctc-decoders"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "from ctc_decoders import Scorer\n",
    "from malaya_speech.utils.char import CTC_VOCAB"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```text\n",
    "Init signature: Scorer(alpha, beta, model_path, vocabulary)\n",
    "Docstring:     \n",
    "Wrapper for Scorer.\n",
    "\n",
    ":param alpha: Parameter associated with language model. Don't use\n",
    "              language model when alpha = 0.\n",
    ":type alpha: float\n",
    ":param beta: Parameter associated with word count. Don't use word\n",
    "             count when beta = 0.\n",
    ":type beta: float\n",
    ":model_path: Path to load language model.\n",
    ":type model_path: basestring\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<ctc_decoders.Scorer; proxy of <Swig Object of type 'Scorer *' at 0x14ffe3c00> >"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "scorer = Scorer(0.5, 1.0, lm, CTC_VOCAB)\n",
    "scorer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Test"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from ctc_decoders import ctc_greedy_decoder, ctc_beam_search_decoder\n",
    "import numpy as np\n",
    "import malaya_speech"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "# https://github.com/PaddlePaddle/DeepSpeech/blob/master/decoders/tests/test_decoders.py\n",
    "\n",
    "vocab_list = [\"\\'\", ' ', 'a', 'b', 'c', 'dk ']\n",
    "beam_size = 20\n",
    "probs_seq1 = [[\n",
    "    0.06390443, 0.21124858, 0.27323887, 0.06870235, 0.0361254,\n",
    "    0.18184413, 0.16493624\n",
    "], [\n",
    "    0.03309247, 0.22866108, 0.24390638, 0.09699597, 0.31895462,\n",
    "    0.0094893, 0.06890021\n",
    "], [\n",
    "    0.218104, 0.19992557, 0.18245131, 0.08503348, 0.14903535,\n",
    "    0.08424043, 0.08120984\n",
    "], [\n",
    "    0.12094152, 0.19162472, 0.01473646, 0.28045061, 0.24246305,\n",
    "    0.05206269, 0.09772094\n",
    "], [\n",
    "    0.1333387, 0.00550838, 0.00301669, 0.21745861, 0.20803985,\n",
    "    0.41317442, 0.01946335\n",
    "], [\n",
    "    0.16468227, 0.1980699, 0.1906545, 0.18963251, 0.19860937,\n",
    "    0.04377724, 0.01457421\n",
    "]]\n",
    "probs_seq2 = [[\n",
    "    0.08034842, 0.22671944, 0.05799633, 0.36814645, 0.11307441,\n",
    "    0.04468023, 0.10903471\n",
    "], [\n",
    "    0.09742457, 0.12959763, 0.09435383, 0.21889204, 0.15113123,\n",
    "    0.10219457, 0.20640612\n",
    "], [\n",
    "    0.45033529, 0.09091417, 0.15333208, 0.07939558, 0.08649316,\n",
    "    0.12298585, 0.01654384\n",
    "], [\n",
    "    0.02512238, 0.22079203, 0.19664364, 0.11906379, 0.07816055,\n",
    "    0.22538587, 0.13483174\n",
    "], [\n",
    "    0.17928453, 0.06065261, 0.41153005, 0.1172041, 0.11880313,\n",
    "    0.07113197, 0.04139363\n",
    "], [\n",
    "    0.15882358, 0.1235788, 0.23376776, 0.20510435, 0.00279306,\n",
    "    0.05294827, 0.22298418\n",
    "]]\n",
    "greedy_result = [\"ac'bdk c\", \"b'dk a\"]\n",
    "beam_search_result = ['acdk c', \"b'a\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ctc_greedy_decoder(np.array(probs_seq1), vocab_list) == greedy_result[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ctc_greedy_decoder(np.array(probs_seq2), vocab_list) == greedy_result[1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(-6.480283737182617, 'acdk c'),\n",
       " (-6.483003616333008, 'acdk  '),\n",
       " (-6.52116060256958, 'acdk a'),\n",
       " (-6.526535511016846, 'acdk b'),\n",
       " (-6.570488452911377, 'a dk c'),\n",
       " (-6.573208332061768, 'a dk  '),\n",
       " (-6.61136531829834, 'a dk a'),\n",
       " (-6.6167402267456055, 'a dk b'),\n",
       " (-6.630837440490723, 'acbc'),\n",
       " (-6.63310432434082, 'acb'),\n",
       " (-6.633557319641113, 'acb '),\n",
       " (-6.644730091094971, 'a bc'),\n",
       " (-6.647449970245361, 'a b '),\n",
       " (-6.650537490844727, 'a b'),\n",
       " (-6.667605400085449, \"acdk '\"),\n",
       " (-6.6717143058776855, 'acba'),\n",
       " (-6.685606956481934, 'a ba'),\n",
       " (-6.686768531799316, ' cdk c'),\n",
       " (-6.689488410949707, ' cdk  '),\n",
       " (-6.709468364715576, 'a c')]"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ctc_beam_search_decoder(probs_seq = np.array(probs_seq1), \n",
    "                        beam_size = beam_size,\n",
    "                        vocabulary = vocab_list)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(-4.989980220794678, \"b'a\"),\n",
       " (-5.298550128936768, \"b'dk a\"),\n",
       " (-5.3370184898376465, \"b' a\"),\n",
       " (-5.585845470428467, \"b'a'\"),\n",
       " (-5.652693271636963, \" 'a\"),\n",
       " (-5.7635698318481445, \"b'ab\"),\n",
       " (-5.788026332855225, \"b'ba\"),\n",
       " (-6.0385026931762695, 'bdk a'),\n",
       " (-6.132683753967285, \"b'ca\"),\n",
       " (-6.137714385986328, \" 'dk a\"),\n",
       " (-6.158307075500488, \" ' a\"),\n",
       " (-6.171831130981445, \"b'dk '\"),\n",
       " (-6.221673011779785, \"b' '\"),\n",
       " (-6.240574359893799, 'b a'),\n",
       " (-6.270209312438965, \"b'a \"),\n",
       " (-6.2848052978515625, \"b'dk ab\"),\n",
       " (-6.304642200469971, 'ba'),\n",
       " (-6.305397987365723, \"b' ab\"),\n",
       " (-6.426036834716797, \" 'ab\"),\n",
       " (-6.505356311798096, \"b'b\")]"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ctc_beam_search_decoder(probs_seq = np.array(probs_seq2), \n",
    "                        beam_size = beam_size,\n",
    "                        vocabulary = vocab_list)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Use pyctcdecode\n",
    "\n",
    "#### From PYPI\n",
    "\n",
    "```bash\n",
    "pip3 install pyctcdecode==0.1.0 pypi-kenlm==0.1.20210121\n",
    "```\n",
    "\n",
    "#### From source\n",
    "\n",
    "Check https://github.com/kensho-technologies/pyctcdecode how to build from source incase there is no available wheel for your operating system.\n",
    "\n",
    "Building from source should only take a few minutes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "import kenlm\n",
    "from pyctcdecode import build_ctcdecoder\n",
    "\n",
    "kenlm_model = kenlm.Model(lm)\n",
    "decoder = build_ctcdecoder(\n",
    "    CTC_VOCAB,\n",
    "    kenlm_model,\n",
    "    alpha=0.5,\n",
    "    beta=1.0,\n",
    ")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.7"
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}