RNNT Streaming#

Transducer streaming using TorchAudio, malaya-speech able to do that.

This tutorial is available as an IPython notebook at malaya-speech/example/rnnt-streaming-torchaudio.

This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.

This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at malaya-speech/example/pipeline.

[2]:
import malaya_speech

Starting malaya-speech 1.4.0, streaming always returned a float32 array between -1 and +1 values.

Streaming interface#

def stream_rnnt(
    src,
    asr_model=None,
    classification_model=None,
    format=None,
    option=None,
    beam_width: int = 10,
    buffer_size: int = 4096,
    sample_rate: int = 16000,
    segment_length: int = 2560,
    context_length: int = 640,
    realtime_print: bool = True,
    **kwargs,
):
    """
    Parameters
    -----------
    src: str
        Supported `src` for `torchaudio.io.StreamReader`
        Read more at https://pytorch.org/audio/stable/tutorials/streamreader_basic_tutorial.html#sphx-glr-tutorials-streamreader-basic-tutorial-py
        or https://pytorch.org/audio/stable/tutorials/streamreader_advanced_tutorial.html#sphx-glr-tutorials-streamreader-advanced-tutorial-py
    asr_model: object, optional (default=None)
        ASR model / pipeline, will transcribe each subsamples realtime.
        must be an object of `malaya_speech.torch_model.torchaudio.Conformer`.
    classification_model: object, optional (default=None)
        classification pipeline, will classify each subsamples realtime.
    format: str, optional (default=None)
        Supported `format` for `torchaudio.io.StreamReader`,
        https://pytorch.org/audio/stable/generated/torchaudio.io.StreamReader.html#torchaudio.io.StreamReader
    option: dict, optional (default=None)
        Supported `option` for `torchaudio.io.StreamReader`,
        https://pytorch.org/audio/stable/generated/torchaudio.io.StreamReader.html#torchaudio.io.StreamReader
    buffer_size: int, optional (default=4096)
        Supported `buffer_size` for `torchaudio.io.StreamReader`, buffer size in byte. Used only when src is file-like object,
        https://pytorch.org/audio/stable/generated/torchaudio.io.StreamReader.html#torchaudio.io.StreamReader
    sample_rate: int, optional (default=16000)
        sample rate from input device, this will auto resampling.
    segment_length: int, optional (default=2560)
        usually derived from asr_model.segment_length * asr_model.hop_length,
        size of audio chunks, actual size in term of second is `segment_length` / `sample_rate`.
    context_length: int, optional (default=640)
        usually derived from asr_model.right_context_length * asr_model.hop_length,
        size of append context chunks, only useful for streaming RNNT.
    beam_width: int, optional (default=10)
        width for beam decoding.
    realtime_print: bool, optional (default=True)
        Will print results for ASR.
    """

Load ASR model#

[3]:
malaya_speech.stt.transducer.available_pt_transformer()
[3]:
Size (MB) malay-malaya malay-fleur102 Language singlish
mesolitica/conformer-tiny 38.5 {'WER': 0.17341180814, 'CER': 0.05957485024} {'WER': 0.19524478979, 'CER': 0.0830808938} [malay] NaN
mesolitica/conformer-base 121 {'WER': 0.122076123261, 'CER': 0.03879606324} {'WER': 0.1326737206665, 'CER': 0.05032914857} [malay] NaN
mesolitica/conformer-medium 243 {'WER': 0.12777757303, 'CER': 0.0393998776} {'WER': 0.1379928549, 'CER': 0.05876827088} [malay] NaN
mesolitica/emformer-base 162 {'WER': 0.175762423786, 'CER': 0.06233919000537} {'WER': 0.18303839134, 'CER': 0.0773853362} [malay] NaN
mesolitica/conformer-singlish 121 NaN NaN [singlish] {'WER': 0.08535878149, 'CER': 0.0452357273822,...
mesolitica/conformer-medium-mixed 243 {'WER': 0.122076123261, 'CER': 0.03879606324} {'WER': 0.1326737206665, 'CER': 0.05032914857} [malay, singlish] {'WER': 0.08535878149, 'CER': 0.0452357273822,...

RNNT Streaming only support Emformer or else TorchAudio will throw an error.

[4]:
model = malaya_speech.stt.transducer.pt_transformer(model = 'mesolitica/emformer-base')
[5]:
_ = model.eval()

You need to make sure the last output should named as ``speech-to-text`` or else the streaming interface will throw an error.

Start streaming#

[8]:
samples = malaya_speech.streaming.torchaudio.stream_rnnt('speech/podcast/toodia.mp3',
                                                    asr_model = model)
 amalan kuling tapi kalau boleh aku nak kena buat dulu mandi jah kan semalam tu jah dah habisan ke tengok kita dah mai orang yang kita nak sihat yalah premia dengan awak sho aku suka pergi ya aku suka
[9]:
len(samples)
[9]:
375