Realtime VAD
Contents
Realtime VAD#
Let say you want to cut your realtime recording audio by using VAD, malaya-speech able to do that.
This tutorial is available as an IPython notebook at malaya-speech/example/realtime-vad.
This module is language independent, so it save to use on different languages. Pretrained models trained on multilanguages.
This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at malaya-speech/example/pipeline.
[1]:
import malaya_speech
from malaya_speech import Pipeline
from malaya_speech.utils.astype import float_to_int
Cannot import beam_search_ops from Tensorflow Addons, ['malaya.jawi_rumi.deep_model', 'malaya.phoneme.deep_model', 'malaya.rumi_jawi.deep_model', 'malaya.stem.deep_model'] will not available to use, make sure Tensorflow Addons version >= 0.12.0
check compatible Tensorflow version with Tensorflow Addons at https://github.com/tensorflow/addons/releases
/Users/huseinzolkepli/.pyenv/versions/3.9.4/lib/python3.9/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
torchaudio.io.StreamReader exception: FFmpeg libraries are not found. Please install FFmpeg.
`torchaudio.io.StreamReader` is not available, `malaya_speech.streaming.torchaudio.stream` is not able to use.
`openai-whisper` is not available, native whisper processor is not available, will use huggingface processor instead.
`torchaudio.io.StreamReader` is not available, `malaya_speech.streaming.torchaudio` is not able to use.
Load VAD model#
Fastest and common model people use, is webrtc. Read more about VAD at https://malaya-speech.readthedocs.io/en/latest/load-vad.html
[2]:
webrtc = malaya_speech.vad.webrtc()
[3]:
p_vad = Pipeline()
pipeline = (
p_vad.map(lambda x: float_to_int(x, divide_max_abs=False))
.map(webrtc)
)
p_vad.visualize()
[3]:
Starting malaya-speech 1.4.0, streaming always returned a float32 array between -1 and +1 values.
Streaming interface#
def stream(
vad_model=None,
asr_model=None,
classification_model=None,
sample_rate: int = 16000,
segment_length: int = 2560,
num_padding_frames: int = 20,
ratio: float = 0.75,
min_length: float = 0.1,
max_length: float = 10.0,
realtime_print: bool = True,
**kwargs,
):
"""
Stream an audio using pyaudio library.
Parameters
----------
vad_model: object, optional (default=None)
vad model / pipeline.
asr_model: object, optional (default=None)
ASR model / pipeline, will transcribe each subsamples realtime.
classification_model: object, optional (default=None)
classification pipeline, will classify each subsamples realtime.
device: None, optional (default=None)
`device` parameter for pyaudio, check available devices from `sounddevice.query_devices()`.
sample_rate: int, optional (default = 16000)
output sample rate.
segment_length: int, optional (default=2560)
usually derived from asr_model.segment_length * asr_model.hop_length,
size of audio chunks, actual size in term of second is `segment_length` / `sample_rate`.
ratio: float, optional (default = 0.75)
if 75% of the queue is positive, assumed it is a voice activity.
min_length: float, optional (default=0.1)
minimum length (second) to accept a subsample.
max_length: float, optional (default=10.0)
maximum length (second) to accept a subsample.
realtime_print: bool, optional (default=True)
Will print results for ASR.
**kwargs: vector argument
vector argument pass to malaya_speech.streaming.pyaudio.Audio interface.
Returns
-------
result : List[dict]
"""
Once you start to run the code below, it will straight away recording your voice. Right now I am using built-in microphone.
If you run in jupyter notebook, press button stop up there to stop recording, if in terminal, press CTRL + c.
[4]:
samples = malaya_speech.streaming.pyaudio.stream(vad_model = p_vad, segment_length = 320)
[5]:
len(samples)
[5]:
4
[6]:
import IPython.display as ipd
import numpy as np
[7]:
ipd.Audio(samples[-1]['wav_data'], rate = 16000)
[7]:
[8]:
ipd.Audio(np.concatenate([s['wav_data'] for s in samples]), rate = 16000)
[8]:
Use deep learning#
We know, webrtc does not work really good in noisy environment, so to improve that, we can use VAD deep model from malaya-speech.
[9]:
vad_model = malaya_speech.vad.deep_model(model = 'vggvox-v2')
/Users/huseinzolkepli/Documents/dev/malaya-speech/malaya_speech/utils/featurization.py:38: FutureWarning: Pass sr=16000, n_fft=512 as keyword args. From version 0.10 passing these as positional arguments will result in an error
self.mel_basis = librosa.filters.mel(
Once you start to run the code below, it will straight away recording your voice. Right now I am using built-in microphone.
If you run in jupyter notebook, press button stop up there to stop recording, if in terminal, press CTRL + c.
deep learning model trained on 50 ms and bigger frames, so, 16000 * 0.05 = 800
[20]:
samples = malaya_speech.streaming.pyaudio.stream(vad_model, segment_length = 800)
Use Silero VAD#
Originally from https://github.com/snakers4/silero-vad
[23]:
import numpy as np
import torch
model, _ = torch.hub.load(repo_or_dir='snakers4/silero-vad', model='silero_vad')
Using cache found in /Users/huseinzolkepli/.cache/torch/hub/snakers4_silero-vad_master
[24]:
p = Pipeline()
pipeline = (
p.map(lambda x: model(torch.tensor(x.astype(np.float32)), 16000).numpy()[0][0] >= 0.5, name = 'vad')
)
p.visualize()
[24]:
If you are using vilero VAD, ``segment_length`` must >= 640.
[25]:
samples = malaya_speech.streaming.pyaudio.stream(p, segment_length = 640)
[26]:
len(samples)
[26]:
1
[27]:
ipd.Audio(samples[0]['wav_data'], rate = 16000)
[27]: