RNNT Streaming
Contents
RNNT Streaming#
Transducer streaming using TorchAudio, malaya-speech able to do that.
This tutorial is available as an IPython notebook at malaya-speech/example/rnnt-streaming-torchaudio.
This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.
This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at malaya-speech/example/pipeline.
[2]:
import malaya_speech
Starting malaya-speech 1.4.0, streaming always returned a float32 array between -1 and +1 values.
Streaming interface#
def stream_rnnt(
src,
asr_model=None,
classification_model=None,
format=None,
option=None,
beam_width: int = 10,
buffer_size: int = 4096,
sample_rate: int = 16000,
segment_length: int = 2560,
context_length: int = 640,
realtime_print: bool = True,
**kwargs,
):
"""
Parameters
-----------
src: str
Supported `src` for `torchaudio.io.StreamReader`
Read more at https://pytorch.org/audio/stable/tutorials/streamreader_basic_tutorial.html#sphx-glr-tutorials-streamreader-basic-tutorial-py
or https://pytorch.org/audio/stable/tutorials/streamreader_advanced_tutorial.html#sphx-glr-tutorials-streamreader-advanced-tutorial-py
asr_model: object, optional (default=None)
ASR model / pipeline, will transcribe each subsamples realtime.
must be an object of `malaya_speech.torch_model.torchaudio.Conformer`.
classification_model: object, optional (default=None)
classification pipeline, will classify each subsamples realtime.
format: str, optional (default=None)
Supported `format` for `torchaudio.io.StreamReader`,
https://pytorch.org/audio/stable/generated/torchaudio.io.StreamReader.html#torchaudio.io.StreamReader
option: dict, optional (default=None)
Supported `option` for `torchaudio.io.StreamReader`,
https://pytorch.org/audio/stable/generated/torchaudio.io.StreamReader.html#torchaudio.io.StreamReader
buffer_size: int, optional (default=4096)
Supported `buffer_size` for `torchaudio.io.StreamReader`, buffer size in byte. Used only when src is file-like object,
https://pytorch.org/audio/stable/generated/torchaudio.io.StreamReader.html#torchaudio.io.StreamReader
sample_rate: int, optional (default=16000)
sample rate from input device, this will auto resampling.
segment_length: int, optional (default=2560)
usually derived from asr_model.segment_length * asr_model.hop_length,
size of audio chunks, actual size in term of second is `segment_length` / `sample_rate`.
context_length: int, optional (default=640)
usually derived from asr_model.right_context_length * asr_model.hop_length,
size of append context chunks, only useful for streaming RNNT.
beam_width: int, optional (default=10)
width for beam decoding.
realtime_print: bool, optional (default=True)
Will print results for ASR.
"""
Load ASR model#
[3]:
malaya_speech.stt.transducer.available_pt_transformer()
[3]:
Size (MB) | malay-malaya | malay-fleur102 | Language | singlish | |
---|---|---|---|---|---|
mesolitica/conformer-tiny | 38.5 | {'WER': 0.17341180814, 'CER': 0.05957485024} | {'WER': 0.19524478979, 'CER': 0.0830808938} | [malay] | NaN |
mesolitica/conformer-base | 121 | {'WER': 0.122076123261, 'CER': 0.03879606324} | {'WER': 0.1326737206665, 'CER': 0.05032914857} | [malay] | NaN |
mesolitica/conformer-medium | 243 | {'WER': 0.12777757303, 'CER': 0.0393998776} | {'WER': 0.1379928549, 'CER': 0.05876827088} | [malay] | NaN |
mesolitica/emformer-base | 162 | {'WER': 0.175762423786, 'CER': 0.06233919000537} | {'WER': 0.18303839134, 'CER': 0.0773853362} | [malay] | NaN |
mesolitica/conformer-singlish | 121 | NaN | NaN | [singlish] | {'WER': 0.08535878149, 'CER': 0.0452357273822,... |
mesolitica/conformer-medium-mixed | 243 | {'WER': 0.122076123261, 'CER': 0.03879606324} | {'WER': 0.1326737206665, 'CER': 0.05032914857} | [malay, singlish] | {'WER': 0.08535878149, 'CER': 0.0452357273822,... |
RNNT Streaming only support Emformer or else TorchAudio will throw an error.
[4]:
model = malaya_speech.stt.transducer.pt_transformer(model = 'mesolitica/emformer-base')
[5]:
_ = model.eval()
You need to make sure the last output should named as ``speech-to-text`` or else the streaming interface will throw an error.
Start streaming#
[8]:
samples = malaya_speech.streaming.torchaudio.stream_rnnt('speech/podcast/toodia.mp3',
asr_model = model)
amalan kuling tapi kalau boleh aku nak kena buat dulu mandi jah kan semalam tu jah dah habisan ke tengok kita dah mai orang yang kita nak sihat yalah premia dengan awak sho aku suka pergi ya aku suka
[9]:
len(samples)
[9]:
375