{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Speech-to-Text CTC HuggingFace + CTC Decoders"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finetuned hyperlocal languages on pretrained HuggingFace models + CTC Decoders with KenLM, https://huggingface.co/mesolitica"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "\n",
    "This tutorial is available as an IPython notebook at [malaya-speech/example/stt-ctc-huggingface-ctc-decoders](https://github.com/huseinzol05/malaya-speech/tree/master/example/stt-ctc-huggingface-ctc-decoders).\n",
    "    \n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-warning\">\n",
    "\n",
    "This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.\n",
    "    \n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-warning\">\n",
    "\n",
    "This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at [malaya-speech/example/pipeline](https://github.com/huseinzol05/malaya-speech/tree/master/example/pipeline).\n",
    "    \n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "`pyaudio` is not available, `malaya_speech.streaming.stream` is not able to use.\n"
     ]
    }
   ],
   "source": [
    "import malaya_speech\n",
    "import numpy as np\n",
    "from malaya_speech import Pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import logging\n",
    "\n",
    "logging.basicConfig(level=logging.INFO)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Install ctc-decoders\n",
    "\n",
    "#### From PYPI\n",
    "\n",
    "```bash\n",
    "pip3 install ctc-decoders\n",
    "```\n",
    "\n",
    "But if you use linux, we unable to upload linux wheels to pypi repository, so download linux wheel at [malaya-speech/ctc-decoders](https://github.com/huseinzol05/malaya-speech/tree/master/ctc-decoders#available-whl).\n",
    "\n",
    "#### From source\n",
    "\n",
    "Check [malaya-speech/ctc-decoders](https://github.com/huseinzol05/malaya-speech/tree/master/ctc-decoders#from-source) how to build from source incase there is no available wheel for your operating system.\n",
    "\n",
    "Building from source should only take a few minutes.\n",
    "\n",
    "#### Benefit\n",
    "\n",
    "1. ctc-decoders faster than pyctcdecode, ~26x faster based on husein benchmark, but very slightly less accurate than pyctcdecode."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### List available HuggingFace model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO:malaya_speech.stt:for `malay-fleur102` language, tested on FLEURS102 `ms_my` test set, https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/prepare-stt\n",
      "INFO:malaya_speech.stt:for `malay-malaya` language, tested on malaya-speech test set, https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/prepare-stt\n",
      "INFO:malaya_speech.stt:for `singlish` language, tested on IMDA malaya-speech test set, https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/prepare-stt\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Size (MB)</th>\n",
       "      <th>malay-malaya</th>\n",
       "      <th>malay-fleur102</th>\n",
       "      <th>singlish</th>\n",
       "      <th>Language</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>mesolitica/wav2vec2-xls-r-300m-mixed</th>\n",
       "      <td>1180</td>\n",
       "      <td>{'WER': 0.194655128, 'CER': 0.04775798, 'WER-L...</td>\n",
       "      <td>{'WER': 0.2373861259, 'CER': 0.07055478, 'WER-...</td>\n",
       "      <td>{'WER': 0.127588595, 'CER': 0.0494924979, 'WER...</td>\n",
       "      <td>[malay, singlish]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mesolitica/wav2vec2-xls-r-300m-mixed-v2</th>\n",
       "      <td>1180</td>\n",
       "      <td>{'WER': 0.154782923, 'CER': 0.035164031, 'WER-...</td>\n",
       "      <td>{'WER': 0.2013994374, 'CER': 0.0518170369, 'WE...</td>\n",
       "      <td>{'WER': 0.2258822139, 'CER': 0.082982312, 'WER...</td>\n",
       "      <td>[malay, singlish]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mesolitica/wav2vec2-xls-r-300m-12layers-ms</th>\n",
       "      <td>657</td>\n",
       "      <td>{'WER': 0.1494983789, 'CER': 0.0342059992, 'WE...</td>\n",
       "      <td>{'WER': 0.217107489, 'CER': 0.0546614199, 'WER...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>[malay]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mesolitica/wav2vec2-xls-r-300m-6layers-ms</th>\n",
       "      <td>339</td>\n",
       "      <td>{'WER': 0.1494983789, 'CER': 0.0342059992, 'WE...</td>\n",
       "      <td>{'WER': 0.217107489, 'CER': 0.0546614199, 'WER...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>[malay]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mesolitica/wav2vec2-xls-r-300m-3layers-ms</th>\n",
       "      <td>195</td>\n",
       "      <td>{'WER': 0.1494983789, 'CER': 0.0342059992, 'WE...</td>\n",
       "      <td>{'WER': 0.217107489, 'CER': 0.0546614199, 'WER...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>[malay]</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                           Size (MB)  \\\n",
       "mesolitica/wav2vec2-xls-r-300m-mixed            1180   \n",
       "mesolitica/wav2vec2-xls-r-300m-mixed-v2         1180   \n",
       "mesolitica/wav2vec2-xls-r-300m-12layers-ms       657   \n",
       "mesolitica/wav2vec2-xls-r-300m-6layers-ms        339   \n",
       "mesolitica/wav2vec2-xls-r-300m-3layers-ms        195   \n",
       "\n",
       "                                                                                 malay-malaya  \\\n",
       "mesolitica/wav2vec2-xls-r-300m-mixed        {'WER': 0.194655128, 'CER': 0.04775798, 'WER-L...   \n",
       "mesolitica/wav2vec2-xls-r-300m-mixed-v2     {'WER': 0.154782923, 'CER': 0.035164031, 'WER-...   \n",
       "mesolitica/wav2vec2-xls-r-300m-12layers-ms  {'WER': 0.1494983789, 'CER': 0.0342059992, 'WE...   \n",
       "mesolitica/wav2vec2-xls-r-300m-6layers-ms   {'WER': 0.1494983789, 'CER': 0.0342059992, 'WE...   \n",
       "mesolitica/wav2vec2-xls-r-300m-3layers-ms   {'WER': 0.1494983789, 'CER': 0.0342059992, 'WE...   \n",
       "\n",
       "                                                                               malay-fleur102  \\\n",
       "mesolitica/wav2vec2-xls-r-300m-mixed        {'WER': 0.2373861259, 'CER': 0.07055478, 'WER-...   \n",
       "mesolitica/wav2vec2-xls-r-300m-mixed-v2     {'WER': 0.2013994374, 'CER': 0.0518170369, 'WE...   \n",
       "mesolitica/wav2vec2-xls-r-300m-12layers-ms  {'WER': 0.217107489, 'CER': 0.0546614199, 'WER...   \n",
       "mesolitica/wav2vec2-xls-r-300m-6layers-ms   {'WER': 0.217107489, 'CER': 0.0546614199, 'WER...   \n",
       "mesolitica/wav2vec2-xls-r-300m-3layers-ms   {'WER': 0.217107489, 'CER': 0.0546614199, 'WER...   \n",
       "\n",
       "                                                                                     singlish  \\\n",
       "mesolitica/wav2vec2-xls-r-300m-mixed        {'WER': 0.127588595, 'CER': 0.0494924979, 'WER...   \n",
       "mesolitica/wav2vec2-xls-r-300m-mixed-v2     {'WER': 0.2258822139, 'CER': 0.082982312, 'WER...   \n",
       "mesolitica/wav2vec2-xls-r-300m-12layers-ms                                                NaN   \n",
       "mesolitica/wav2vec2-xls-r-300m-6layers-ms                                                 NaN   \n",
       "mesolitica/wav2vec2-xls-r-300m-3layers-ms                                                 NaN   \n",
       "\n",
       "                                                     Language  \n",
       "mesolitica/wav2vec2-xls-r-300m-mixed        [malay, singlish]  \n",
       "mesolitica/wav2vec2-xls-r-300m-mixed-v2     [malay, singlish]  \n",
       "mesolitica/wav2vec2-xls-r-300m-12layers-ms            [malay]  \n",
       "mesolitica/wav2vec2-xls-r-300m-6layers-ms             [malay]  \n",
       "mesolitica/wav2vec2-xls-r-300m-3layers-ms             [malay]  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "malaya_speech.stt.ctc.available_huggingface()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load HuggingFace model\n",
    "\n",
    "```python\n",
    "def huggingface(\n",
    "    model: str = 'mesolitica/wav2vec2-xls-r-300m-mixed',\n",
    "    force_check: bool = True,\n",
    "    **kwargs,\n",
    "):\n",
    "    \"\"\"\n",
    "    Load Finetuned models from HuggingFace.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    model : str, optional (default='mesolitica/wav2vec2-xls-r-300m-mixed')\n",
    "        Check available models at `malaya_speech.stt.ctc.available_huggingface()`.\n",
    "    force_check: bool, optional (default=True)\n",
    "        Force check model one of malaya model.\n",
    "        Set to False if you have your own huggingface model.\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    result : malaya_speech.torch_model.huggingface.CTC class\n",
    "    \"\"\"\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = malaya_speech.stt.ctc.huggingface(model = 'mesolitica/wav2vec2-xls-r-300m-mixed')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load sample"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "ceramah, sr = malaya_speech.load('speech/khutbah/wadi-annuar.wav')\n",
    "record1, sr = malaya_speech.load('speech/record/savewav_2020-11-26_22-36-06_294832.wav')\n",
    "record2, sr = malaya_speech.load('speech/record/savewav_2020-11-26_22-40-56_929661.wav')\n",
    "singlish0, sr = malaya_speech.load('speech/singlish/singlish0.wav')\n",
    "singlish1, sr = malaya_speech.load('speech/singlish/singlish1.wav')\n",
    "singlish2, sr = malaya_speech.load('speech/singlish/singlish2.wav')\n",
    "mandarin0, sr = malaya_speech.load('speech/mandarin/597.wav')\n",
    "mandarin1, sr = malaya_speech.load('speech/mandarin/584.wav')\n",
    "mandarin2, sr = malaya_speech.load('speech/mandarin/509.wav')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Predict logits\n",
    "\n",
    "```python\n",
    "def predict_logits(self, inputs, norm_func=softmax):\n",
    "    \"\"\"\n",
    "    Predict logits from inputs.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    input: List[np.array]\n",
    "        List[np.array] or List[malaya_speech.model.frame.Frame].\n",
    "    norm_func: Callable, optional (default=malaya.utils.activation.softmax)\n",
    "\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    result: List[np.array]\n",
    "    \"\"\"\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 36.4 s, sys: 3.22 s, total: 39.6 s\n",
      "Wall time: 3.67 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "logits = model.predict_logits([ceramah, record1, record2])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(3, 499, 40)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "logits.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load ctc-decoders\n",
    "\n",
    "I will use `dump-combined` for this example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "from ctc_decoders import Scorer\n",
    "from ctc_decoders import ctc_beam_search_decoder\n",
    "from malaya_speech.utils.char import HF_CTC_VOCAB"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "lm = malaya_speech.language_model.kenlm(model = 'dump-combined')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "scorer = Scorer(0.5, 1.0, lm, HF_CTC_VOCAB)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'jadi dalam perjalanan ini dunia yang susah ini ketika nabi mengajar muaz bin jabal tadi ni allah ma ini'"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "o = ctc_beam_search_decoder(logits[0], HF_CTC_VOCAB, 20, ext_scoring_func = scorer)[0][1]\n",
    "o"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'hello nama saya husin saya tak skema ke tiap saya masam'"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "o = ctc_beam_search_decoder(logits[1], HF_CTC_VOCAB, 20, ext_scoring_func = scorer)[0][1]\n",
    "o"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'hello nama saya hussein saya sekoman saya mandi dia tiap hari'"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "o = ctc_beam_search_decoder(logits[2], HF_CTC_VOCAB, 20, ext_scoring_func = scorer)[0][1]\n",
    "o"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}