{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Put comma using Force Alignment\n", "\n", "Put commas on ASR output using Force Alignment model." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [malaya-speech/example/put-comma-force-alignment](https://github.com/huseinzol05/malaya-speech/tree/master/example/put-comma-force-alignment).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import malaya_speech\n", "import numpy as np\n", "from malaya_speech import Pipeline\n", "import IPython.display as ipd\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List available Force Aligner model" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Size (MB)Quantized Size (MB)Language
conformer-transducer12032.3[malay]
conformer-transducer-mixed12032.3[mixed]
conformer-transducer-singlish12032.3[singlish]
\n", "
" ], "text/plain": [ " Size (MB) Quantized Size (MB) Language\n", "conformer-transducer 120 32.3 [malay]\n", "conformer-transducer-mixed 120 32.3 [mixed]\n", "conformer-transducer-singlish 120 32.3 [singlish]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya_speech.force_alignment.available_aligner()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load Force Aligner model\n", "\n", "```python\n", "def deep_aligner(\n", " model: str = 'conformer-transducer', quantized: bool = False, **kwargs\n", "):\n", " \"\"\"\n", " Load Deep Aligner model.\n", "\n", " Parameters\n", " ----------\n", " model : str, optional (default='conformer-transducer')\n", " Model architecture supported. Allowed values:\n", "\n", " * ``'conformer-transducer'`` - Conformer + RNNT trained on Malay STT dataset.\n", " * ``'conformer-transducer-mixed'`` - Conformer + RNNT trained on Mixed STT dataset.\n", " * ``'conformer-transducer-singlish'`` - Conformer + RNNT trained on Singlish STT dataset.\n", "\n", " quantized : bool, optional (default=False)\n", " if True, will load 8-bit quantized model.\n", " Quantized model not necessary faster, totally depends on the machine.\n", "\n", " Returns\n", " -------\n", " result : malaya_speech.model.tf.TransducerAligner class\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "model = malaya_speech.force_alignment.deep_aligner(model = 'conformer-transducer')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load sample" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Malay samples" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "malay1, sr = malaya_speech.load('speech/example-speaker/shafiqah-idayu.wav')\n", "malay2, sr = malaya_speech.load('speech/example-speaker/haqkiem.wav')" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "texts = ['nama saya shafiqah idayu',\n", " 'sebagai pembangkang yang matang dan sejahtera pas akan menghadapi pilihan raya umum dan tidak menumbang kerajaan dari pintu belakang']" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(malay2, rate = sr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Predict\n", "\n", "```python\n", "def predict(self, input, transcription: str):\n", " \"\"\"\n", " Transcribe input, will return a string.\n", "\n", " Parameters\n", " ----------\n", " input: np.array\n", " np.array or malaya_speech.model.frame.Frame.\n", " transcription: str\n", " transcription of input audio\n", "\n", " Returns\n", " -------\n", " result: Dict[words_alignment, subwords_alignment, subwords, alignment]\n", " \"\"\"\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Predict Malay\n", "\n", "Our original text is: 'sebagai pembangkang yang matang dan sejahtera pas akan menghadapi pilihan raya umum dan tidak menumbang kerajaan dari pintu belakang'" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "results = model.predict(malay2, texts[1])" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "dict_keys(['words_alignment', 'subwords_alignment', 'subwords', 'alignment'])" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results.keys()" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'text': 'sebagai', 'start': 0.08, 'end': 0.45},\n", " {'text': 'pembangkang', 'start': 0.56, 'end': 1.05},\n", " {'text': 'yang', 'start': 1.16, 'end': 1.29},\n", " {'text': 'matang', 'start': 1.4, 'end': 1.69},\n", " {'text': 'dan', 'start': 1.84, 'end': 1.85},\n", " {'text': 'sejahtera', 'start': 2.08, 'end': 2.57},\n", " {'text': 'pas', 'start': 2.84, 'end': 2.85},\n", " {'text': 'akan', 'start': 3.12, 'end': 3.33},\n", " {'text': 'menghadapi', 'start': 3.4, 'end': 3.93},\n", " {'text': 'pilihan', 'start': 4.04, 'end': 4.45},\n", " {'text': 'raya', 'start': 4.56, 'end': 4.81},\n", " {'text': 'umum', 'start': 4.88, 'end': 5.17},\n", " {'text': 'dan', 'start': 5.36, 'end': 5.37},\n", " {'text': 'tidak', 'start': 5.56, 'end': 5.73},\n", " {'text': 'menumbang', 'start': 5.84, 'end': 6.25},\n", " {'text': 'kerajaan', 'start': 6.36, 'end': 6.85},\n", " {'text': 'dari', 'start': 7.04, 'end': 7.25},\n", " {'text': 'pintu', 'start': 7.36, 'end': 7.53},\n", " {'text': 'belakang', 'start': 7.68, 'end': 8.05}]" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results['words_alignment']" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "scrolled": true }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "fig = plt.figure(figsize=(10, 8))\n", "ax = fig.add_subplot(111)\n", "ax.set_title('Alignment steps')\n", "im = ax.imshow(\n", " results['alignment'],\n", " aspect='auto',\n", " origin='lower',\n", " interpolation='none')\n", "ax.set_yticks(range(len(results['subwords'])))\n", "labels = [item.get_text() for item in ax.get_yticklabels()]\n", "ax.set_yticklabels(results['subwords'])\n", "fig.colorbar(im, ax=ax)\n", "xlabel = 'Encoder timestep'\n", "plt.xlabel(xlabel)\n", "plt.ylabel('Decoder timestep')\n", "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you listen to the audio, there is a silent between `pas` and `akan`, this we can assume a comma between those words. We can use `malaya_speech.aligner.put_comma` to help us,\n", "\n", "```python\n", "def put_comma(alignment, min_threshold: float = 0.5):\n", " \"\"\"\n", " Put comma in alignment from force alignment model.\n", "\n", " Parameters\n", " -----------\n", " alignment: List[Dict[text, start, end]]\n", " min_threshold: float, optional (default=0.5)\n", " minimum threshold in term of seconds to assume a comma.\n", "\n", " Returns\n", " --------\n", " result: List[str]\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['sebagai',\n", " 'pembangkang',\n", " 'yang',\n", " 'matang',\n", " 'dan',\n", " 'sejahtera',\n", " ',',\n", " 'pas',\n", " ',',\n", " 'akan',\n", " 'menghadapi',\n", " 'pilihan',\n", " 'raya',\n", " 'umum',\n", " 'dan',\n", " 'tidak',\n", " 'menumbang',\n", " 'kerajaan',\n", " 'dari',\n", " 'pintu',\n", " 'belakang',\n", " '.']" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya_speech.aligner.put_comma(results['words_alignment'], 0.25)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" } }, "nbformat": 4, "nbformat_minor": 4 }