{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Speech-to-Text CTC HuggingFace" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finetuned hyperlocal languages on pretrained HuggingFace models, https://huggingface.co/mesolitica" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [malaya-speech/example/stt-ctc-huggingface](https://github.com/huseinzol05/malaya-speech/tree/master/example/stt-ctc-huggingface).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at [malaya-speech/example/pipeline](https://github.com/huseinzol05/malaya-speech/tree/master/example/pipeline).\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "`pyaudio` is not available, `malaya_speech.streaming.stream` is not able to use.\n" ] } ], "source": [ "import malaya_speech\n", "import numpy as np\n", "from malaya_speech import Pipeline" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import logging\n", "\n", "logging.basicConfig(level=logging.INFO)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List available HuggingFace model" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:malaya_speech.stt:for `malay-fleur102` language, tested on FLEURS102 `ms_my` test set, https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/prepare-stt\n", "INFO:malaya_speech.stt:for `malay-malaya` language, tested on malaya-speech test set, https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/prepare-stt\n", "INFO:malaya_speech.stt:for `singlish` language, tested on IMDA malaya-speech test set, https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/prepare-stt\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Size (MB)malay-malayamalay-fleur102singlishLanguage
mesolitica/wav2vec2-xls-r-300m-mixed1180{'WER': 0.194655128, 'CER': 0.04775798, 'WER-L...{'WER': 0.2373861259, 'CER': 0.07055478, 'WER-...{'WER': 0.127588595, 'CER': 0.0494924979, 'WER...[malay, singlish]
mesolitica/wav2vec2-xls-r-300m-mixed-v21180{'WER': 0.154782923, 'CER': 0.035164031, 'WER-...{'WER': 0.2013994374, 'CER': 0.0518170369, 'WE...{'WER': 0.2258822139, 'CER': 0.082982312, 'WER...[malay, singlish]
mesolitica/wav2vec2-xls-r-300m-12layers-ms657{'WER': 0.1494983789, 'CER': 0.0342059992, 'WE...{'WER': 0.217107489, 'CER': 0.0546614199, 'WER...NaN[malay]
mesolitica/wav2vec2-xls-r-300m-6layers-ms339{'WER': 0.1494983789, 'CER': 0.0342059992, 'WE...{'WER': 0.217107489, 'CER': 0.0546614199, 'WER...NaN[malay]
mesolitica/wav2vec2-xls-r-300m-3layers-ms195{'WER': 0.1494983789, 'CER': 0.0342059992, 'WE...{'WER': 0.217107489, 'CER': 0.0546614199, 'WER...NaN[malay]
\n", "
" ], "text/plain": [ " Size (MB) \\\n", "mesolitica/wav2vec2-xls-r-300m-mixed 1180 \n", "mesolitica/wav2vec2-xls-r-300m-mixed-v2 1180 \n", "mesolitica/wav2vec2-xls-r-300m-12layers-ms 657 \n", "mesolitica/wav2vec2-xls-r-300m-6layers-ms 339 \n", "mesolitica/wav2vec2-xls-r-300m-3layers-ms 195 \n", "\n", " malay-malaya \\\n", "mesolitica/wav2vec2-xls-r-300m-mixed {'WER': 0.194655128, 'CER': 0.04775798, 'WER-L... \n", "mesolitica/wav2vec2-xls-r-300m-mixed-v2 {'WER': 0.154782923, 'CER': 0.035164031, 'WER-... \n", "mesolitica/wav2vec2-xls-r-300m-12layers-ms {'WER': 0.1494983789, 'CER': 0.0342059992, 'WE... \n", "mesolitica/wav2vec2-xls-r-300m-6layers-ms {'WER': 0.1494983789, 'CER': 0.0342059992, 'WE... \n", "mesolitica/wav2vec2-xls-r-300m-3layers-ms {'WER': 0.1494983789, 'CER': 0.0342059992, 'WE... \n", "\n", " malay-fleur102 \\\n", "mesolitica/wav2vec2-xls-r-300m-mixed {'WER': 0.2373861259, 'CER': 0.07055478, 'WER-... \n", "mesolitica/wav2vec2-xls-r-300m-mixed-v2 {'WER': 0.2013994374, 'CER': 0.0518170369, 'WE... \n", "mesolitica/wav2vec2-xls-r-300m-12layers-ms {'WER': 0.217107489, 'CER': 0.0546614199, 'WER... \n", "mesolitica/wav2vec2-xls-r-300m-6layers-ms {'WER': 0.217107489, 'CER': 0.0546614199, 'WER... \n", "mesolitica/wav2vec2-xls-r-300m-3layers-ms {'WER': 0.217107489, 'CER': 0.0546614199, 'WER... \n", "\n", " singlish \\\n", "mesolitica/wav2vec2-xls-r-300m-mixed {'WER': 0.127588595, 'CER': 0.0494924979, 'WER... \n", "mesolitica/wav2vec2-xls-r-300m-mixed-v2 {'WER': 0.2258822139, 'CER': 0.082982312, 'WER... \n", "mesolitica/wav2vec2-xls-r-300m-12layers-ms NaN \n", "mesolitica/wav2vec2-xls-r-300m-6layers-ms NaN \n", "mesolitica/wav2vec2-xls-r-300m-3layers-ms NaN \n", "\n", " Language \n", "mesolitica/wav2vec2-xls-r-300m-mixed [malay, singlish] \n", "mesolitica/wav2vec2-xls-r-300m-mixed-v2 [malay, singlish] \n", "mesolitica/wav2vec2-xls-r-300m-12layers-ms [malay] \n", "mesolitica/wav2vec2-xls-r-300m-6layers-ms [malay] \n", "mesolitica/wav2vec2-xls-r-300m-3layers-ms [malay] " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya_speech.stt.ctc.available_huggingface()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load HuggingFace model\n", "\n", "```python\n", "def huggingface(\n", " model: str = 'mesolitica/wav2vec2-xls-r-300m-mixed',\n", " force_check: bool = True,\n", " **kwargs,\n", "):\n", " \"\"\"\n", " Load Finetuned models from HuggingFace.\n", "\n", " Parameters\n", " ----------\n", " model : str, optional (default='mesolitica/wav2vec2-xls-r-300m-mixed')\n", " Check available models at `malaya_speech.stt.ctc.available_huggingface()`.\n", " force_check: bool, optional (default=True)\n", " Force check model one of malaya model.\n", " Set to False if you have your own huggingface model.\n", "\n", " Returns\n", " -------\n", " result : malaya_speech.torch_model.huggingface.CTC class\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "model = malaya_speech.stt.ctc.huggingface(model = 'mesolitica/wav2vec2-xls-r-300m-mixed')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load sample" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "ceramah, sr = malaya_speech.load('speech/khutbah/wadi-annuar.wav')\n", "record1, sr = malaya_speech.load('speech/record/savewav_2020-11-26_22-36-06_294832.wav')\n", "record2, sr = malaya_speech.load('speech/record/savewav_2020-11-26_22-40-56_929661.wav')\n", "singlish0, sr = malaya_speech.load('speech/singlish/singlish0.wav')\n", "singlish1, sr = malaya_speech.load('speech/singlish/singlish1.wav')\n", "singlish2, sr = malaya_speech.load('speech/singlish/singlish2.wav')\n", "mandarin0, sr = malaya_speech.load('speech/mandarin/597.wav')\n", "mandarin1, sr = malaya_speech.load('speech/mandarin/584.wav')\n", "mandarin2, sr = malaya_speech.load('speech/mandarin/509.wav')" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import IPython.display as ipd\n", "\n", "ipd.Audio(ceramah, rate = sr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can hear, the speaker speaks in kedahan dialects plus some arabic words, let see how good our model is." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(record1, rate = sr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Predict using greedy decoder\n", "\n", "```python\n", "def greedy_decoder(self, inputs):\n", " \"\"\"\n", " Transcribe inputs using greedy decoder.\n", "\n", " Parameters\n", " ----------\n", " input: List[np.array]\n", " List[np.array] or List[malaya_speech.model.frame.Frame].\n", "\n", " Returns\n", " -------\n", " result: List[str]\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1min 30s, sys: 35.7 s, total: 2min 6s\n", "Wall time: 11.6 s\n" ] }, { "data": { "text/plain": [ "['jadi dalam perjalanan ini dunia yang susah ini ketika nabi mengajar muaz bin jabal tadi ini allah maaini',\n", " 'hello nama saya husin saya tak beskemandi ketiap saya masam',\n", " 'hello nama saya hussein saya sukomandi saya mandi diatia hari',\n", " 'and then see how they roll it in film okay actually',\n", " 'atat to your eyes',\n", " 'sa versa in bal',\n", " 'gei wo lai ge zhang jie zui xin de ge',\n", " 'wo xiang shou kan zhiang shu ying shi pin dao de jie mu',\n", " 'qiu yi shou ge de ming zhe ge ci li you zhuan sheng yi meng wang si qin shi xiu']" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "model.greedy_decoder([ceramah, record1, record2, singlish0, singlish1, singlish2, \n", " mandarin0, mandarin1, mandarin2])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Predict using beam decoder\n", "\n", "Natively the model is not able to do `beam_decoder`, so we need to use `ctc_decoders` using output from `predict_logits`,\n", "\n", "```python\n", "def predict_logits(self, inputs, norm_func=softmax):\n", " \"\"\"\n", " Predict logits from inputs.\n", "\n", " Parameters\n", " ----------\n", " input: List[np.array]\n", " List[np.array] or List[malaya_speech.model.frame.Frame].\n", " norm_func: Callable, optional (default=malaya.utils.activation.softmax)\n", "\n", "\n", " Returns\n", " -------\n", " result: List[np.array]\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "from ctc_decoders import ctc_beam_search_decoder\n", "from malaya_speech.utils.char import HF_CTC_VOCAB" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1min 34s, sys: 27.7 s, total: 2min 2s\n", "Wall time: 10.9 s\n" ] } ], "source": [ "%%time\n", "\n", "logits = model.predict_logits([ceramah, record1, record2, singlish0, singlish1, singlish2, \n", " mandarin0, mandarin1, mandarin2])" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((9, 499, 40), 39)" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "logits.shape, len(HF_CTC_VOCAB)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 jadi dalam perjalanan ini dunia yang susah ini ketika nabi mengajar muaz bin jabal tadi ini allah maa ini\n", "1 hello nama saya husin saya tak beskemandi ketiap saya masam\n", "2 hello nama saya hussein saya sukomandi saya mandi diatia hari\n", "3 and then see how they roll it in film okay actually\n", "4 an tat to your eyes\n", "5 sa versa in bal\n", "6 gei wo lai ge zhang jie zui xin de ge\n", "7 wo xiang shou kan ziang shu ying shi pin dao de jie mu\n", "8 qiu yi shou ge de ming zhe ge ci li you zhuan sheng yi meng wang si qin shi xiu\n" ] } ], "source": [ "for no, l in enumerate(logits):\n", " o = ctc_beam_search_decoder(l, HF_CTC_VOCAB, 20)[0][1]\n", " print(no, o)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }