{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Speech-to-Text Seq2Seq Whisper" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finetuned hyperlocal languages on pretrained HuggingFace models, https://huggingface.co/mesolitica" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [malaya-speech/example/stt-seq2seq-whisper](https://github.com/huseinzol05/malaya-speech/tree/master/example/stt-seq2seq-whisper).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at [malaya-speech/example/pipeline](https://github.com/huseinzol05/malaya-speech/tree/master/example/pipeline).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Why official OpenAI Whisper instead HuggingFace?\n", "\n", "Some implementation from official repository is much better and evolved into better features, eg, https://github.com/m-bain/whisperX" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Install OpenAI Whisper\n", "\n", "Simply,\n", "\n", "```bash\n", "pip install openai-whisper\n", "```" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "`pyaudio` is not available, `malaya_speech.streaming.stream` is not able to use.\n" ] } ], "source": [ "import malaya_speech\n", "import numpy as np\n", "from malaya_speech import Pipeline" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import logging\n", "\n", "logging.basicConfig(level=logging.INFO)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List available Whisper model" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:malaya_speech.stt:for `malay-fleur102` language, tested on FLEURS102 `ms_my` test set, https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/prepare-stt\n", "INFO:malaya_speech.stt:for `malay-malaya` language, tested on malaya-speech test set, https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/prepare-stt\n", "INFO:malaya_speech.stt:for `singlish` language, tested on IMDA malaya-speech test set, https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/prepare-stt\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Size (MB)malay-malayamalay-fleur102singlishLanguage
mesolitica/finetune-whisper-tiny-ms-singlish151{'WER': 0.20141585, 'CER': 0.071964908}{'WER': 0.235680975, 'CER': 0.0986880877}{'WER': 0.09045121, 'CER': 0.0481965}[malay, singlish]
mesolitica/finetune-whisper-tiny-ms-singlish-v2151{'WER': 0.20141585, 'CER': 0.071964908}{'WER': 0.22459602, 'CER': 0.089406469}{'WER': 0.138882971, 'CER': 0.074929807}[malay, singlish]
mesolitica/finetune-whisper-base-ms-singlish-v2290{'WER': 0.172632664, 'CER': 0.0680027682}{'WER': 0.1837319118, 'CER': 0.0599804251}{'WER': 0.111506313, 'CER': 0.05852830724}[malay, singlish]
mesolitica/finetune-whisper-small-ms-singlish-v2967{'WER': 0.13189875561, 'CER': 0.0434602169}{'WER': 0.13277694, 'CER': 0.0478108612}{'WER': 0.09489335668, 'CER': 0.05045327551}[malay, singlish]
\n", "
" ], "text/plain": [ " Size (MB) \\\n", "mesolitica/finetune-whisper-tiny-ms-singlish 151 \n", "mesolitica/finetune-whisper-tiny-ms-singlish-v2 151 \n", "mesolitica/finetune-whisper-base-ms-singlish-v2 290 \n", "mesolitica/finetune-whisper-small-ms-singlish-v2 967 \n", "\n", " malay-malaya \\\n", "mesolitica/finetune-whisper-tiny-ms-singlish {'WER': 0.20141585, 'CER': 0.071964908} \n", "mesolitica/finetune-whisper-tiny-ms-singlish-v2 {'WER': 0.20141585, 'CER': 0.071964908} \n", "mesolitica/finetune-whisper-base-ms-singlish-v2 {'WER': 0.172632664, 'CER': 0.0680027682} \n", "mesolitica/finetune-whisper-small-ms-singlish-v2 {'WER': 0.13189875561, 'CER': 0.0434602169} \n", "\n", " malay-fleur102 \\\n", "mesolitica/finetune-whisper-tiny-ms-singlish {'WER': 0.235680975, 'CER': 0.0986880877} \n", "mesolitica/finetune-whisper-tiny-ms-singlish-v2 {'WER': 0.22459602, 'CER': 0.089406469} \n", "mesolitica/finetune-whisper-base-ms-singlish-v2 {'WER': 0.1837319118, 'CER': 0.0599804251} \n", "mesolitica/finetune-whisper-small-ms-singlish-v2 {'WER': 0.13277694, 'CER': 0.0478108612} \n", "\n", " singlish \\\n", "mesolitica/finetune-whisper-tiny-ms-singlish {'WER': 0.09045121, 'CER': 0.0481965} \n", "mesolitica/finetune-whisper-tiny-ms-singlish-v2 {'WER': 0.138882971, 'CER': 0.074929807} \n", "mesolitica/finetune-whisper-base-ms-singlish-v2 {'WER': 0.111506313, 'CER': 0.05852830724} \n", "mesolitica/finetune-whisper-small-ms-singlish-v2 {'WER': 0.09489335668, 'CER': 0.05045327551} \n", "\n", " Language \n", "mesolitica/finetune-whisper-tiny-ms-singlish [malay, singlish] \n", "mesolitica/finetune-whisper-tiny-ms-singlish-v2 [malay, singlish] \n", "mesolitica/finetune-whisper-base-ms-singlish-v2 [malay, singlish] \n", "mesolitica/finetune-whisper-small-ms-singlish-v2 [malay, singlish] " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya_speech.stt.seq2seq.available_whisper()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load Whisper model\n", "\n", "```python\n", "def whisper(\n", " model: str = 'mesolitica/finetune-whisper-base-ms-singlish-v2',\n", " force_check: bool = True,\n", " **kwargs,\n", "):\n", " \"\"\"\n", " Load Finetuned models from HuggingFace.\n", "\n", " Parameters\n", " ----------\n", " model : str, optional (default='mesolitica/finetune-whisper-base-ms-singlish-v2')\n", " Check available models at `malaya_speech.stt.seq2seq.available_whisper()`.\n", " force_check: bool, optional (default=True)\n", " Force check model one of malaya model.\n", " Set to False if you have your own huggingface model.\n", "\n", " Returns\n", " -------\n", " result : whisper.model.Whisper class\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "scrolled": true }, "outputs": [], "source": [ "model = malaya_speech.stt.seq2seq.whisper(model = 'mesolitica/finetune-whisper-base-ms-singlish-v2')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Generate\n", "\n", "You can read more at official repository, https://github.com/openai/whisper" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "model = model.to('cpu')" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "import whisper" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'dalam perjalanan ini dunia yang susah ini ketika nabi mengajar muaz bin jabal tadi ni alah maha'" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "audio = whisper.load_audio('speech/khutbah/wadi-annuar.wav')\n", "audio = whisper.pad_or_trim(audio)\n", "\n", "mel = whisper.log_mel_spectrogram(audio).to(model.device)\n", "options = whisper.DecodingOptions(fp16 = False)\n", "result = whisper.decode(model, mel, options)\n", "result.text" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'how they roll it in film okay actually'" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "audio = whisper.load_audio('speech/singlish/singlish0.wav')\n", "audio = whisper.pad_or_trim(audio)\n", "\n", "mel = whisper.log_mel_spectrogram(audio).to(model.device)\n", "options = whisper.DecodingOptions(fp16 = False)\n", "result = whisper.decode(model, mel, options)\n", "result.text" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }