RNNT Streaming#

Transducer streaming using TorchAudio, malaya-speech able to do that.

This tutorial is available as an IPython notebook at malaya-speech/example/rnnt-streaming-torchaudio.

This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.

This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at malaya-speech/example/pipeline.

[2]:

import malaya_speech

Starting malaya-speech 1.4.0, streaming always returned a float32 array between -1 and +1 values.

Streaming interface#

def stream_rnnt(
    src,
    asr_model=None,
    classification_model=None,
    format=None,
    option=None,
    beam_width: int = 10,
    buffer_size: int = 4096,
    sample_rate: int = 16000,
    segment_length: int = 2560,
    context_length: int = 640,
    realtime_print: bool = True,
    **kwargs,
):
    """
    Parameters
    -----------
    src: str
        Supported `src` for `torchaudio.io.StreamReader`
        Read more at https://pytorch.org/audio/stable/tutorials/streamreader_basic_tutorial.html#sphx-glr-tutorials-streamreader-basic-tutorial-py
        or https://pytorch.org/audio/stable/tutorials/streamreader_advanced_tutorial.html#sphx-glr-tutorials-streamreader-advanced-tutorial-py
    asr_model: object, optional (default=None)
        ASR model / pipeline, will transcribe each subsamples realtime.
        must be an object of `malaya_speech.torch_model.torchaudio.Conformer`.
    classification_model: object, optional (default=None)
        classification pipeline, will classify each subsamples realtime.
    format: str, optional (default=None)
        Supported `format` for `torchaudio.io.StreamReader`,
        https://pytorch.org/audio/stable/generated/torchaudio.io.StreamReader.html#torchaudio.io.StreamReader
    option: dict, optional (default=None)
        Supported `option` for `torchaudio.io.StreamReader`,
        https://pytorch.org/audio/stable/generated/torchaudio.io.StreamReader.html#torchaudio.io.StreamReader
    buffer_size: int, optional (default=4096)
        Supported `buffer_size` for `torchaudio.io.StreamReader`, buffer size in byte. Used only when src is file-like object,
        https://pytorch.org/audio/stable/generated/torchaudio.io.StreamReader.html#torchaudio.io.StreamReader
    sample_rate: int, optional (default=16000)
        sample rate from input device, this will auto resampling.
    segment_length: int, optional (default=2560)
        usually derived from asr_model.segment_length * asr_model.hop_length,
        size of audio chunks, actual size in term of second is `segment_length` / `sample_rate`.
    context_length: int, optional (default=640)
        usually derived from asr_model.right_context_length * asr_model.hop_length,
        size of append context chunks, only useful for streaming RNNT.
    beam_width: int, optional (default=10)
        width for beam decoding.
    realtime_print: bool, optional (default=True)
        Will print results for ASR.
    """

Load ASR model#

[3]:

malaya_speech.stt.transducer.available_pt_transformer()

[3]:

	Size (MB)	malay-malaya	malay-fleur102	Language	singlish
mesolitica/conformer-tiny	38.5	{'WER': 0.17341180814, 'CER': 0.05957485024}	{'WER': 0.19524478979, 'CER': 0.0830808938}	[malay]	NaN
mesolitica/conformer-base	121	{'WER': 0.122076123261, 'CER': 0.03879606324}	{'WER': 0.1326737206665, 'CER': 0.05032914857}	[malay]	NaN
mesolitica/conformer-medium	243	{'WER': 0.12777757303, 'CER': 0.0393998776}	{'WER': 0.1379928549, 'CER': 0.05876827088}	[malay]	NaN
mesolitica/emformer-base	162	{'WER': 0.175762423786, 'CER': 0.06233919000537}	{'WER': 0.18303839134, 'CER': 0.0773853362}	[malay]	NaN
mesolitica/conformer-singlish	121	NaN	NaN	[singlish]	{'WER': 0.08535878149, 'CER': 0.0452357273822,...
mesolitica/conformer-medium-mixed	243	{'WER': 0.122076123261, 'CER': 0.03879606324}	{'WER': 0.1326737206665, 'CER': 0.05032914857}	[malay, singlish]	{'WER': 0.08535878149, 'CER': 0.0452357273822,...

RNNT Streaming only support Emformer or else TorchAudio will throw an error.

[4]:

model = malaya_speech.stt.transducer.pt_transformer(model = 'mesolitica/emformer-base')

[5]:

_ = model.eval()

You need to make sure the last output should named as ``speech-to-text`` or else the streaming interface will throw an error.

Start streaming#

[8]:

samples = malaya_speech.streaming.torchaudio.stream_rnnt('speech/podcast/toodia.mp3',
                                                    asr_model = model)

 amalan kuling tapi kalau boleh aku nak kena buat dulu mandi jah kan semalam tu jah dah habisan ke tengok kita dah mai orang yang kita nak sihat yalah premia dengan awak sho aku suka pergi ya aku suka

[9]:

len(samples)

[9]:

RNNT Streaming

Contents

RNNT Streaming#

Streaming interface#

Load ASR model#

Start streaming#