{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Speech-to-Text HuggingFace" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finetuned hyperlocal languages on pretrained HuggingFace models, https://huggingface.co/mesolitica" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [malaya-speech/example/stt-huggingface](https://github.com/huseinzol05/malaya-speech/tree/master/example/stt-huggingface).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at [malaya-speech/example/pipeline](https://github.com/huseinzol05/malaya-speech/tree/master/example/pipeline).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "Required Tensorflow >= 2.0 due to group convolution is not available for Tensorflow 1.\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import malaya_speech\n", "import numpy as np\n", "from malaya_speech import Pipeline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List available HuggingFace model" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CERCER-LMLanguageSize (MB)WERWER-LM
mesolitica/wav2vec2-xls-r-300m-mixed0.0481050.041196[malay, singlish, mandarin]11800.132220.098802
\n", "
" ], "text/plain": [ " CER CER-LM \\\n", "mesolitica/wav2vec2-xls-r-300m-mixed 0.048105 0.041196 \n", "\n", " Language Size (MB) \\\n", "mesolitica/wav2vec2-xls-r-300m-mixed [malay, singlish, mandarin] 1180 \n", "\n", " WER WER-LM \n", "mesolitica/wav2vec2-xls-r-300m-mixed 0.13222 0.098802 " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya_speech.stt.available_huggingface()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load HuggingFace model\n", "\n", "```python\n", "def huggingface(model: str = 'mesolitica/wav2vec2-xls-r-300m-mixed', **kwargs):\n", " \"\"\"\n", " Load Finetuned models from HuggingFace. Required Tensorflow >= 2.0.\n", "\n", " Parameters\n", " ----------\n", " model : str, optional (default='mesolitica/wav2vec2-xls-r-300m-mixed')\n", " Model architecture supported. Allowed values:\n", "\n", " * ``'mesolitica/wav2vec2-xls-r-300m-mixed'`` - wav2vec2 XLS-R 300M finetuned on (Malay + Singlish + Mandarin) languages.\n", "\n", " Returns\n", " -------\n", " result : malaya_speech.model.huggingface.CTC class\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "model = malaya_speech.stt.huggingface(model = 'mesolitica/wav2vec2-xls-r-300m-mixed')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load sample" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "ceramah, sr = malaya_speech.load('speech/khutbah/wadi-annuar.wav')\n", "record1, sr = malaya_speech.load('speech/record/savewav_2020-11-26_22-36-06_294832.wav')\n", "record2, sr = malaya_speech.load('speech/record/savewav_2020-11-26_22-40-56_929661.wav')\n", "singlish0, sr = malaya_speech.load('speech/singlish/singlish0.wav')\n", "singlish1, sr = malaya_speech.load('speech/singlish/singlish1.wav')\n", "singlish2, sr = malaya_speech.load('speech/singlish/singlish2.wav')\n", "mandarin0, sr = malaya_speech.load('speech/mandarin/597.wav')\n", "mandarin1, sr = malaya_speech.load('speech/mandarin/584.wav')\n", "mandarin2, sr = malaya_speech.load('speech/mandarin/509.wav')" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import IPython.display as ipd\n", "\n", "ipd.Audio(ceramah, rate = sr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can hear, the speaker speaks in kedahan dialects plus some arabic words, let see how good our model is." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(record1, rate = sr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Predict using greedy decoder\n", "\n", "```python\n", "def greedy_decoder(self, inputs):\n", " \"\"\"\n", " Transcribe inputs using greedy decoder.\n", "\n", " Parameters\n", " ----------\n", " input: List[np.array]\n", " List[np.array] or List[malaya_speech.model.frame.Frame].\n", "\n", " Returns\n", " -------\n", " result: List[str]\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1min 52s, sys: 58.6 s, total: 2min 51s\n", "Wall time: 33.7 s\n" ] }, { "data": { "text/plain": [ "['jadi dalam perjalanan ini dunia yang susah ini ketika nabi mengajar muaz bin jabal tadi ini allah maaini',\n", " 'hello nama saya husin saya tak beskemandi ketiap saya masam',\n", " 'hello nama saya hussein saya sukomandi saya mandi diatia hari',\n", " 'and then see how they roll it in film okay actually',\n", " 'atat to your eyes',\n", " 'sa versa in bal',\n", " 'gei wo lai ge zhang jie zui xin de ge',\n", " 'wo xiang shou kan zhiang shu ying shi pin dao de jie mu',\n", " 'qiu yi shou ge de ming zhe ge ci li you zhuan sheng yi meng wang si qin shi xiu']" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "model.greedy_decoder([ceramah, record1, record2, singlish0, singlish1, singlish2, \n", " mandarin0, mandarin1, mandarin2])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Predict using beam decoder\n", "\n", "Natively the model is not able to do `beam_decoder`, so we need to use `ctc_decoders` using output from `predict_logits`,\n", "\n", "```python\n", "def predict_logits(self, inputs, norm_func=softmax):\n", " \"\"\"\n", " Predict logits from inputs.\n", "\n", " Parameters\n", " ----------\n", " input: List[np.array]\n", " List[np.array] or List[malaya_speech.model.frame.Frame].\n", " norm_func: Callable, optional (default=malaya.utils.activation.softmax)\n", "\n", "\n", " Returns\n", " -------\n", " result: List[np.array]\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "from ctc_decoders import ctc_beam_search_decoder\n", "from malaya_speech.utils.char import HF_CTC_VOCAB" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1min 54s, sys: 53 s, total: 2min 47s\n", "Wall time: 29.1 s\n" ] } ], "source": [ "%%time\n", "\n", "logits = model.predict_logits([ceramah, record1, record2, singlish0, singlish1, singlish2, \n", " mandarin0, mandarin1, mandarin2])" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((9, 499, 40), 39)" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "logits.shape, len(HF_CTC_VOCAB)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 jadi dalam perjalanan ini dunia yang susah ini ketika nabi mengajar muaz bin jabal tadi ini allah maa ini\n", "1 hello nama saya husin saya tak beskemandi ketiap saya masam\n", "2 hello nama saya hussein saya sukomandi saya mandi diatia hari\n", "3 and then see how they roll it in film okay actually\n", "4 an tat to your eyes\n", "5 sa versa in bal\n", "6 gei wo lai ge zhang jie zui xin de ge\n", "7 wo xiang shou kan ziang shu ying shi pin dao de jie mu\n", "8 qiu yi shou ge de ming zhe ge ci li you zhuan sheng yi meng wang si qin shi xiu\n" ] } ], "source": [ "for no, l in enumerate(logits):\n", " o = ctc_beam_search_decoder(l, HF_CTC_VOCAB, 20)[0][1]\n", " print(no, o)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }