{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Speech-to-Text HuggingFace + CTC Decoders" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finetuned hyperlocal languages on pretrained HuggingFace models + CTC Decoders with KenLM, https://huggingface.co/mesolitica" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [malaya-speech/example/stt-huggingface-ctc-decoders](https://github.com/huseinzol05/malaya-speech/tree/master/example/stt-huggingface-ctc-decoders).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at [malaya-speech/example/pipeline](https://github.com/huseinzol05/malaya-speech/tree/master/example/pipeline).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "Required Tensorflow >= 2.0 due to group convolution is not available for Tensorflow 1.\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import malaya_speech\n", "import numpy as np\n", "from malaya_speech import Pipeline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Install ctc-decoders\n", "\n", "#### From PYPI\n", "\n", "```bash\n", "pip3 install ctc-decoders\n", "```\n", "\n", "But if you use linux, we unable to upload linux wheels to pypi repository, so download linux wheel at [malaya-speech/ctc-decoders](https://github.com/huseinzol05/malaya-speech/tree/master/ctc-decoders#available-whl).\n", "\n", "#### From source\n", "\n", "Check [malaya-speech/ctc-decoders](https://github.com/huseinzol05/malaya-speech/tree/master/ctc-decoders#from-source) how to build from source incase there is no available wheel for your operating system.\n", "\n", "Building from source should only take a few minutes.\n", "\n", "#### Benefit\n", "\n", "1. ctc-decoders faster than pyctcdecode, ~26x faster based on husein benchmark, but very slightly less accurate than pyctcdecode." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List available HuggingFace model" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CERCER-LMLanguageSize (MB)WERWER-LM
mesolitica/wav2vec2-xls-r-300m-mixed0.0481050.041196[malay, singlish, mandarin]11800.132220.098802
\n", "
" ], "text/plain": [ " CER CER-LM \\\n", "mesolitica/wav2vec2-xls-r-300m-mixed 0.048105 0.041196 \n", "\n", " Language Size (MB) \\\n", "mesolitica/wav2vec2-xls-r-300m-mixed [malay, singlish, mandarin] 1180 \n", "\n", " WER WER-LM \n", "mesolitica/wav2vec2-xls-r-300m-mixed 0.13222 0.098802 " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya_speech.stt.available_huggingface()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load HuggingFace model\n", "\n", "```python\n", "def huggingface(model: str = 'mesolitica/wav2vec2-xls-r-300m-mixed', **kwargs):\n", " \"\"\"\n", " Load Finetuned models from HuggingFace. Required Tensorflow >= 2.0.\n", "\n", " Parameters\n", " ----------\n", " model : str, optional (default='mesolitica/wav2vec2-xls-r-300m-mixed')\n", " Model architecture supported. Allowed values:\n", "\n", " * ``'mesolitica/wav2vec2-xls-r-300m-mixed'`` - wav2vec2 XLS-R 300M finetuned on (Malay + Singlish + Mandarin) languages.\n", "\n", " Returns\n", " -------\n", " result : malaya_speech.model.huggingface.CTC class\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "model = malaya_speech.stt.huggingface(model = 'mesolitica/wav2vec2-xls-r-300m-mixed')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load sample" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "ceramah, sr = malaya_speech.load('speech/khutbah/wadi-annuar.wav')\n", "record1, sr = malaya_speech.load('speech/record/savewav_2020-11-26_22-36-06_294832.wav')\n", "record2, sr = malaya_speech.load('speech/record/savewav_2020-11-26_22-40-56_929661.wav')\n", "singlish0, sr = malaya_speech.load('speech/singlish/singlish0.wav')\n", "singlish1, sr = malaya_speech.load('speech/singlish/singlish1.wav')\n", "singlish2, sr = malaya_speech.load('speech/singlish/singlish2.wav')\n", "mandarin0, sr = malaya_speech.load('speech/mandarin/597.wav')\n", "mandarin1, sr = malaya_speech.load('speech/mandarin/584.wav')\n", "mandarin2, sr = malaya_speech.load('speech/mandarin/509.wav')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Predict logits\n", "\n", "```python\n", "def predict_logits(self, inputs, norm_func=softmax):\n", " \"\"\"\n", " Predict logits from inputs.\n", "\n", " Parameters\n", " ----------\n", " input: List[np.array]\n", " List[np.array] or List[malaya_speech.model.frame.Frame].\n", " norm_func: Callable, optional (default=malaya.utils.activation.softmax)\n", "\n", "\n", " Returns\n", " -------\n", " result: List[np.array]\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 36 s, sys: 19.7 s, total: 55.7 s\n", "Wall time: 10.6 s\n" ] } ], "source": [ "%%time\n", "\n", "logits = model.predict_logits([ceramah, record1, record2])" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(3, 499, 40)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "logits.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load ctc-decoders\n", "\n", "I will use `dump-combined` for this example." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "from ctc_decoders import Scorer\n", "from ctc_decoders import ctc_beam_search_decoder\n", "from malaya_speech.utils.char import HF_CTC_VOCAB" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "033c3f9b86204a31b30acc7c166d9e6c", "version_major": 2, "version_minor": 0 }, "text/plain": [ "HBox(children=(FloatProgress(value=0.0, description='Downloading', max=325929726.0, style=ProgressStyle(descri…" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "lm = malaya_speech.language_model.kenlm(model = 'dump-combined')" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "scorer = Scorer(0.5, 1.0, lm, HF_CTC_VOCAB)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'jadi dalam perjalanan ini dunia yang susah ini ketika nabi mengajar muaz bin jabal tadi ni allah ma ini'" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "o = ctc_beam_search_decoder(logits[0], HF_CTC_VOCAB, 20, ext_scoring_func = scorer)[0][1]\n", "o" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'hello nama saya husin saya tak skema ke tiap saya masam'" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "o = ctc_beam_search_decoder(logits[1], HF_CTC_VOCAB, 20, ext_scoring_func = scorer)[0][1]\n", "o" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'hello nama saya hussein saya sekoman saya mandi dia tiap hari'" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "o = ctc_beam_search_decoder(logits[2], HF_CTC_VOCAB, 20, ext_scoring_func = scorer)[0][1]\n", "o" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }