Speech-to-Text RNNT web inference using Gradio#

Encoder model + RNNT loss web inference using Gradio

This tutorial is available as an IPython notebook at malaya-speech/example/stt-transducer-gradio.

This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.

[1]:
import malaya_speech
import numpy as np
from malaya_speech import Pipeline

List available RNNT model#

[2]:
malaya_speech.stt.available_transducer()
[2]:
Size (MB) Quantized Size (MB) WER CER WER-LM CER-LM Language
tiny-conformer 24.4 9.14 0.212811 0.081369 0.199683 0.077004 [malay]
small-conformer 49.2 18.1 0.198533 0.074495 0.185361 0.071143 [malay]
conformer 125 37.1 0.163602 0.058744 0.156182 0.05719 [malay]
large-conformer 404 107 0.156684 0.061971 0.148622 0.05901 [malay]
conformer-stack-2mixed 130 38.5 0.103608 0.050069 0.102911 0.050201 [malay, singlish]
conformer-stack-3mixed 130 38.5 0.234768 0.133944 0.229241 0.130702 [malay, singlish, mandarin]
small-conformer-singlish 49.2 18.1 0.087831 0.045686 0.087333 0.045317 [singlish]
conformer-singlish 125 37.1 0.077792 0.040362 0.077186 0.03987 [singlish]
large-conformer-singlish 404 107 0.070147 0.035872 0.069812 0.035723 [singlish]

Lower is better. Mixed models tested on different dataset.

Load RNNT model#

def deep_transducer(
    model: str = 'conformer', quantized: bool = False, **kwargs
):
    """
    Load Encoder-Transducer ASR model.

    Parameters
    ----------
    model : str, optional (default='conformer')
        Model architecture supported. Allowed values:

        * ``'tiny-conformer'`` - TINY size Google Conformer.
        * ``'small-conformer'`` - SMALL size Google Conformer.
        * ``'conformer'`` - BASE size Google Conformer.
        * ``'large-conformer'`` - LARGE size Google Conformer.
        * ``'conformer-stack-2mixed'`` - BASE size Stacked Google Conformer for (Malay + Singlish) languages.
        * ``'conformer-stack-3mixed'`` - BASE size Stacked Google Conformer for (Malay + Singlish + Mandarin) languages.
        * ``'small-conformer-singlish'`` - SMALL size Google Conformer for singlish language.
        * ``'conformer-singlish'`` - BASE size Google Conformer for singlish language.
        * ``'large-conformer-singlish'`` - LARGE size Google Conformer for singlish language.

    quantized : bool, optional (default=False)
        if True, will load 8-bit quantized model.
        Quantized model not necessary faster, totally depends on the machine.

    Returns
    -------
    result : malaya_speech.model.tf.Transducer class
    """
[4]:
model = malaya_speech.stt.deep_transducer(model = 'conformer')

web inference using Gradio#

def gradio(self, record_mode: bool = True, **kwargs):
    """
    Transcribe an input using beam decoder on Gradio interface.

    Parameters
    ----------
    record_mode: bool, optional (default=True)
        if True, Gradio will use record mode, else, file upload mode.

    **kwargs: keyword arguments for beam decoder and `iface.launch`.
    """

record mode#

[6]:
model.gradio(record_mode = True)
[10]:
from IPython.core.display import Image, display

display(Image('record-mode.png', width=800))
_images/stt-transducer-gradio_13_0.png

upload mode#

[9]:
model.gradio(record_mode = False)
[11]:
from IPython.core.display import Image, display

display(Image('upload-mode.png', width=800))
_images/stt-transducer-gradio_16_0.png
[ ]: