{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Realtime Alignment\n", "\n", "Let say you want to align realtime recording / input, malaya-speech able to do that." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [malaya-speech/example/realtime-alignment](https://github.com/huseinzol05/malaya-speech/tree/master/example/realtime-alignment).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at [malaya-speech/example/pipeline](https://github.com/huseinzol05/malaya-speech/tree/master/example/pipeline).\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Cannot import beam_search_ops from Tensorflow Addons, ['malaya.jawi_rumi.deep_model', 'malaya.phoneme.deep_model', 'malaya.rumi_jawi.deep_model', 'malaya.stem.deep_model'] will not available to use, make sure Tensorflow Addons version >= 0.12.0\n", "check compatible Tensorflow version with Tensorflow Addons at https://github.com/tensorflow/addons/releases\n", "/Users/huseinzolkepli/.pyenv/versions/3.9.4/lib/python3.9/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", " from .autonotebook import tqdm as notebook_tqdm\n", "torchaudio.io.StreamReader exception: FFmpeg libraries are not found. Please install FFmpeg.\n", "`torchaudio.io.StreamReader` is not available, `malaya_speech.streaming.torchaudio.stream` is not able to use.\n", "`openai-whisper` is not available, native whisper processor is not available, will use huggingface processor instead.\n", "`torchaudio.io.StreamReader` is not available, `malaya_speech.streaming.torchaudio` is not able to use.\n" ] } ], "source": [ "import malaya_speech\n", "from malaya_speech import Pipeline\n", "from malaya_speech.utils.astype import float_to_int" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load VAD model\n", "\n", "We are going to use WebRTC VAD model, read more about VAD at https://malaya-speech.readthedocs.io/en/latest/load-vad.html" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "scrolled": true }, "outputs": [], "source": [ "vad_model = malaya_speech.vad.webrtc()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "p_vad = Pipeline()\n", "pipeline = (\n", " p_vad.map(lambda x: float_to_int(x, divide_max_abs=False))\n", " .map(vad_model)\n", ")\n", "p_vad.visualize()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Starting malaya-speech 1.4.0, streaming always returned a float32 array between -1 and +1 values." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Streaming interface\n", "\n", "```python\n", "def stream(\n", " vad_model=None,\n", " asr_model=None,\n", " classification_model=None,\n", " sample_rate: int = 16000,\n", " segment_length: int = 2560,\n", " num_padding_frames: int = 20,\n", " ratio: float = 0.75,\n", " min_length: float = 0.1,\n", " max_length: float = 10.0,\n", " realtime_print: bool = True,\n", " **kwargs,\n", "):\n", " \"\"\"\n", " Stream an audio using pyaudio library.\n", "\n", " Parameters\n", " ----------\n", " vad_model: object, optional (default=None)\n", " vad model / pipeline.\n", " asr_model: object, optional (default=None)\n", " ASR model / pipeline, will transcribe each subsamples realtime.\n", " classification_model: object, optional (default=None)\n", " classification pipeline, will classify each subsamples realtime.\n", " device: None, optional (default=None)\n", " `device` parameter for pyaudio, check available devices from `sounddevice.query_devices()`.\n", " sample_rate: int, optional (default = 16000)\n", " output sample rate.\n", " segment_length: int, optional (default=2560)\n", " usually derived from asr_model.segment_length * asr_model.hop_length,\n", " size of audio chunks, actual size in term of second is `segment_length` / `sample_rate`.\n", " ratio: float, optional (default = 0.75)\n", " if 75% of the queue is positive, assumed it is a voice activity.\n", " min_length: float, optional (default=0.1)\n", " minimum length (second) to accept a subsample.\n", " max_length: float, optional (default=10.0)\n", " maximum length (second) to accept a subsample.\n", " realtime_print: bool, optional (default=True)\n", " Will print results for ASR.\n", " **kwargs: vector argument\n", " vector argument pass to malaya_speech.streaming.pyaudio.Audio interface.\n", "\n", " Returns\n", " -------\n", " result : List[dict]\n", " \"\"\"\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Check available devices" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "> 0 MacBook Air Microphone, Core Audio (1 in, 0 out)\n", "< 1 MacBook Air Speakers, Core Audio (0 in, 2 out)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import sounddevice\n", "\n", "sounddevice.query_devices()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default it will use `0` index." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load ASR model" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "model = malaya_speech.stt.transducer.transformer(model = 'conformer')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Force Alignment Pipeline\n", "\n", "Feel free to add speech enhancement or any function, but in this example, I just keep it simple.\n", "\n", "Right now only transducer Tensorflow model supported `force_alignment` method." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "p_asr = Pipeline()\n", "pipeline_asr = (\n", " p_asr.map(lambda x: model.predict_alignment(x), name = 'speech-to-text')\n", ")\n", "p_asr.visualize()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**You need to make sure the last output should named as `speech-to-text` or else the realtime engine will throw an error**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Start Recording" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Again, once you start to run the code below, it will straight away recording your voice**. \n", "\n", "If you run in jupyter notebook, press button stop up there to stop recording, if in terminal, press `CTRL + c`." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[{'text': 'helo', 'start': 0.040000003, 'end': 0.1300000047683716}] [{'text': 'nama', 'start': 0.080000006, 'end': 0.21000000298023225}, {'text': 'saya', 'start': 0.28, 'end': 0.49000001907348634}] [] [{'text': 'hari', 'start': 0.040000003, 'end': 0.2500000095367432}, {'text': 'ini', 'start': 0.28, 'end': 0.2900000011920929}, {'text': 'saya', 'start': 0.52000004, 'end': 0.65000004529953}, {'text': 'nak', 'start': 0.76000005, 'end': 0.7700000500679016}, {'text': 'cakap', 'start': 1.0, 'end': 1.2100000476837158}, {'text': 'tentang', 'start': 1.24, 'end': 1.610000023841858}] [{'text': 'ini', 'start': 0.080000006, 'end': 0.09000000566244125}, {'text': 'saya', 'start': 0.32000002, 'end': 0.5700000023841858}] [{'text': 'saya', 'start': 0.24000001, 'end': 0.37000001430511475}, {'text': 'macam', 'start': 0.44000003, 'end': 0.65000004529953}, {'text': 'tu', 'start': 1.08, 'end': 1.0900000429153442}] [{'text': 'dan', 'start': 0.120000005, 'end': 0.1300000047683716}, {'text': 'saya', 'start': 0.4, 'end': 0.49000001907348634}, {'text': 'tak', 'start': 0.56, 'end': 0.5700000023841858}, {'text': 'suka', 'start': 0.76000005, 'end': 0.9300000166893005}, {'text': 'mandi', 'start': 1.0, 'end': 1.1700000858306885}] [{'text': 'terima', 'start': 0.040000003, 'end': 0.09000000566244125}, {'text': 'kasih', 'start': 0.24000001, 'end': 0.49000001907348634}] " ] } ], "source": [ "samples = malaya_speech.streaming.pyaudio.stream(vad_model = p_vad, asr_model = p_asr,\n", " segment_length = 320)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Actually it is pretty nice. As you can see, it able to transcribe realtime, you can try it by yourself." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "import IPython.display as ipd\n", "import numpy as np" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'wav_data': array([-0.00130833, -0.00126185, -0.00101168, ..., 0.00048238,\n", " 0.00061575, 0.00031507], dtype=float32),\n", " 'timestamp': datetime.datetime(2023, 2, 17, 1, 42, 21, 411091),\n", " 'asr_model': [{'text': 'dan',\n", " 'start': 0.120000005,\n", " 'end': 0.1300000047683716},\n", " {'text': 'saya', 'start': 0.4, 'end': 0.49000001907348634},\n", " {'text': 'tak', 'start': 0.56, 'end': 0.5700000023841858},\n", " {'text': 'suka', 'start': 0.76000005, 'end': 0.9300000166893005},\n", " {'text': 'mandi', 'start': 1.0, 'end': 1.1700000858306885}]}" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "samples[-2]" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(samples[-2]['wav_data'], rate = 16000)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.4" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }