{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Split utterances using VAD" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let say you have a long audio sample, and you want to cut to small samples based on utterances. Malaya-speech can help you!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [malaya-speech/example/split-utterances](https://github.com/huseinzol05/malaya-speech/tree/master/example/utterances).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This module is language independent, so it save to use on different languages.\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at [malaya-speech/example/pipeline](https://github.com/huseinzol05/malaya-speech/tree/master/example/pipeline).\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import malaya_speech\n", "import numpy as np\n", "from malaya_speech import Pipeline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List available VAD model" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Size (MB)Quantized Size (MB)Accuracy
vggvox-v170.817.700.9500
vggvox-v231.17.920.9594
speakernet20.35.180.9000
\n", "
" ], "text/plain": [ " Size (MB) Quantized Size (MB) Accuracy\n", "vggvox-v1 70.8 17.70 0.9500\n", "vggvox-v2 31.1 7.92 0.9594\n", "speakernet 20.3 5.18 0.9000" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya_speech.vad.available_model()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load deep model\n", "\n", "I will load quantized model, we found out VAD quantized models have the same accuracy as normal models, read more about VAD at https://malaya-speech.readthedocs.io/en/latest/load-vad.html" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING:root:Load quantized model will cause accuracy drop.\n" ] } ], "source": [ "vad = malaya_speech.vad.deep_model(model = 'vggvox-v2', quantized = True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load long samples" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "294.504" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y, sr = malaya_speech.load('speech/podcast/2x5%20Ep%2010.wav')\n", "len(y) / sr" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "294 seconds!" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import IPython.display as ipd\n", "\n", "ipd.Audio(y[:sr * 10], rate = sr)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "from pydub import AudioSegment\n", "import numpy as np\n", "\n", "sr = 16000\n", "sound = AudioSegment.from_file('speech/video/70_Peratus_Gaji_Rakyat_Malaysia_Dibelanjakan_Untuk_Barang_Keperluan.mp3')\n", "samples = sound.set_frame_rate(sr).set_channels(1).get_array_of_samples()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "samples = np.array(samples)\n", "samples = malaya_speech.utils.astype.int_to_float(samples)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "110.106125" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(samples) / sr" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Initiate pipeline\n", "\n", "Read more how to use Malaya-Speech VAD model at https://malaya-speech.readthedocs.io/en/latest/load-vad.html#How-to-detect-Voice-Activity." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "p = Pipeline()\n", "\n", "pipeline = (\n", " p.map(malaya_speech.utils.generator.frames, frame_duration_ms = 30)\n", " .batching(5)\n", " .foreach_map(vad.predict)\n", " .flatten()\n", ")\n", "p.visualize()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "scrolled": true }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/huseinzolkepli/Documents/tf-1.15/env/lib/python3.7/site-packages/librosa/core/spectrum.py:224: UserWarning: n_fft=512 is too small for input signal of length=480\n", " n_fft, y.shape[-1]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 5min 34s, sys: 52.3 s, total: 6min 27s\n", "Wall time: 1min 22s\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/Users/huseinzolkepli/Documents/tf-1.15/env/lib/python3.7/site-packages/librosa/core/spectrum.py:224: UserWarning: n_fft=512 is too small for input signal of length=384\n", " n_fft, y.shape[-1]\n" ] }, { "data": { "text/plain": [ "dict_keys(['frames', 'batching', 'predict', 'flatten'])" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "result = p(y)\n", "result.keys()" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "frames = result['frames']\n", "frames_vad = [\n", " (frame, result['flatten'][no]) for no, frame in enumerate(frames)\n", "]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Split utterances based on size negative VAD\n", "\n", "So to split based on negative VAD, we need to use `malaya_speech.split.split_vad`,\n", "\n", "```python\n", "def split_vad(\n", " frames,\n", " n: int = 3,\n", " negative_threshold: float = 0.1,\n", " silent_trail: int = 500,\n", " sample_rate: int = 16000,\n", " use_negative_as_silent: bool = False,\n", "):\n", " \"\"\"\n", " Split a sample into multiple samples based `n` size of negative VAD.\n", "\n", " Parameters\n", " ----------\n", " frames: List[Tuple[Frame, label]]\n", " n: int, optional (default=3)\n", " `n` size of negative VAD to assume in one subsample.\n", " negative_threshold: float, optional (default = 0.1)\n", " If `negative_threshold` is 0.1, means that, length negative samples must at least 0.1 second.\n", " silent_trail: int, optional (default = 500)\n", " If an element is not a voice activity, append with `silent_trail` frame size. \n", " sample_rate: int, optional (default = 16000)\n", " sample rate for frames.\n", " use_negative_as_silent: bool, optional (default = False)\n", " If True, will use negative VAD as silent, else, use zeros array size of `silent_trail`.\n", "\n", " Returns\n", " -------\n", " result : List[Frame]\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "splitted = malaya_speech.split.split_vad(frames_vad)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(splitted[0].array, rate = sr)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(splitted[1].array, rate = sr)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(splitted[2].array, rate = sr)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(splitted[3].array, rate = sr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Split utterances based on maximum duration VAD\n", "\n", "So to split based on maximum duration VAD, we need to use `malaya_speech.split.split_vad_duration`,\n", "\n", "```python\n", "def split_vad_duration(\n", " frames,\n", " max_duration: float = 5.0,\n", " negative_threshold: float = 0.1,\n", " silent_trail = 500,\n", " sample_rate: int = 16000,\n", " use_negative_as_silent: bool = False,\n", "):\n", " \"\"\"\n", " Split a sample into multiple samples based maximum duration of voice activities.\n", "\n", " Parameters\n", " ----------\n", " frames: List[Tuple[Frame, label]]\n", " max_duration: float, optional (default = 5.0)\n", " Maximum duration to assume one sample combined from voice activities.\n", " negative_threshold: float, optional (default = 0.1)\n", " If `negative_threshold` is 0.1, means that, length negative samples must at least 0.1 second.\n", " silent_trail: int, optional (default = 500)\n", " If an element is not a voice activity, append with `silent_trail` frame size.\n", " sample_rate: int, optional (default = 16000)\n", " sample rate for frames.\n", " use_negative_as_silent: bool, optional (default = False)\n", " If True, will use negative VAD as silent, else, use zeros array size of `silent_trail`.\n", "\n", " Returns\n", " -------\n", " result : List[Frame]\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "splitted = malaya_speech.split.split_vad_duration(frames_vad, negative_threshold = 0.3)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(splitted[0].array, rate = sr)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(splitted[1].array, rate = sr)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" } }, "nbformat": 4, "nbformat_minor": 4 }