Contents

Speech-to-Text RNNT web inference using Gradio

Contents

Speech-to-Text RNNT web inference using Gradio#

Encoder model + RNNT loss web inference using Gradio

This tutorial is available as an IPython notebook at malaya-speech/example/stt-transducer-gradio.

This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.

[1]:

import malaya_speech
import numpy as np
from malaya_speech import Pipeline

List available RNNT model#

[2]:

malaya_speech.stt.available_transducer()

[2]:

	Size (MB)	Quantized Size (MB)	WER	CER	WER-LM	CER-LM	Language
tiny-conformer	24.4	9.14	0.212811	0.081369	0.199683	0.077004	[malay]
small-conformer	49.2	18.1	0.198533	0.074495	0.185361	0.071143	[malay]
conformer	125	37.1	0.163602	0.058744	0.156182	0.05719	[malay]
large-conformer	404	107	0.156684	0.061971	0.148622	0.05901	[malay]
conformer-stack-2mixed	130	38.5	0.103608	0.050069	0.102911	0.050201	[malay, singlish]
conformer-stack-3mixed	130	38.5	0.234768	0.133944	0.229241	0.130702	[malay, singlish, mandarin]
small-conformer-singlish	49.2	18.1	0.087831	0.045686	0.087333	0.045317	[singlish]
conformer-singlish	125	37.1	0.077792	0.040362	0.077186	0.03987	[singlish]
large-conformer-singlish	404	107	0.070147	0.035872	0.069812	0.035723	[singlish]

Lower is better. Mixed models tested on different dataset.

Load RNNT model#

def deep_transducer(
    model: str = 'conformer', quantized: bool = False, **kwargs
):
    """
    Load Encoder-Transducer ASR model.

    Parameters
    ----------
    model : str, optional (default='conformer')
        Model architecture supported. Allowed values:

        * ``'tiny-conformer'`` - TINY size Google Conformer.
        * ``'small-conformer'`` - SMALL size Google Conformer.
        * ``'conformer'`` - BASE size Google Conformer.
        * ``'large-conformer'`` - LARGE size Google Conformer.
        * ``'conformer-stack-2mixed'`` - BASE size Stacked Google Conformer for (Malay + Singlish) languages.
        * ``'conformer-stack-3mixed'`` - BASE size Stacked Google Conformer for (Malay + Singlish + Mandarin) languages.
        * ``'small-conformer-singlish'`` - SMALL size Google Conformer for singlish language.
        * ``'conformer-singlish'`` - BASE size Google Conformer for singlish language.
        * ``'large-conformer-singlish'`` - LARGE size Google Conformer for singlish language.

    quantized : bool, optional (default=False)
        if True, will load 8-bit quantized model.
        Quantized model not necessary faster, totally depends on the machine.

    Returns
    -------
    result : malaya_speech.model.tf.Transducer class
    """

[4]:

model = malaya_speech.stt.deep_transducer(model = 'conformer')

web inference using Gradio#

def gradio(self, record_mode: bool = True, **kwargs):
    """
    Transcribe an input using beam decoder on Gradio interface.

    Parameters
    ----------
    record_mode: bool, optional (default=True)
        if True, Gradio will use record mode, else, file upload mode.

    **kwargs: keyword arguments for beam decoder and `iface.launch`.
    """

record mode#

[6]:

model.gradio(record_mode = True)

[10]:

from IPython.core.display import Image, display

display(Image('record-mode.png', width=800))

_images/stt-transducer-gradio_13_0.png

upload mode#

[9]:

model.gradio(record_mode = False)

[11]:

from IPython.core.display import Image, display

display(Image('upload-mode.png', width=800))

_images/stt-transducer-gradio_16_0.png

[ ]: