Speech-to-Text RNNT Singlish#

Encoder model + RNNT loss for Singlish language, trained on Singapore National Speech Corpus, https://www.imda.gov.sg/programme-listing/digital-services-lab/national-speech-corpus

This tutorial is available as an IPython notebook at malaya-speech/example/stt-transducer-model-singlish.

This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.

This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at malaya-speech/example/pipeline.

[1]:

import malaya_speech
import numpy as np
from malaya_speech import Pipeline

List available RNNT model#

[2]:

malaya_speech.stt.available_transducer()

[2]:

	Size (MB)	Quantized Size (MB)	WER	CER	WER-LM	CER-LM	Language
tiny-conformer	24.4	9.14	0.212811	0.081369	0.199683	0.077004	[malay]
small-conformer	49.2	18.1	0.198533	0.074495	0.185361	0.071143	[malay]
conformer	125	37.1	0.163602	0.058744	0.156182	0.05719	[malay]
large-conformer	404	107	0.156684	0.061971	0.148622	0.05901	[malay]
conformer-stack-2mixed	130	38.5	0.103608	0.050069	0.102911	0.050201	[malay, singlish]
conformer-stack-3mixed	130	38.5	0.234768	0.133944	0.229241	0.130702	[malay, singlish, mandarin]
small-conformer-singlish	49.2	18.1	0.087831	0.045686	0.087333	0.045317	[singlish]
conformer-singlish	125	37.1	0.077792	0.040362	0.077186	0.03987	[singlish]
large-conformer-singlish	404	107	0.070147	0.035872	0.069812	0.035723	[singlish]

Lower is better. Mixed models tested on different dataset.

Google Speech-to-Text accuracy#

We tested on the same malay dataset to compare malaya-speech models and Google Speech-to-Text, check the notebook at benchmark-google-speech-singlish-dataset.ipynb.

[15]:

malaya_speech.stt.google_accuracy

[15]:

{'malay': {'WER': 0.164775, 'CER': 0.059732},
 'singlish': {'WER': 0.4941349, 'CER': 0.3026296}}

Load RNNT model#

def deep_transducer(
    model: str = 'conformer', quantized: bool = False, **kwargs
):
    """
    Load Encoder-Transducer ASR model.

    Parameters
    ----------
    model : str, optional (default='conformer')
        Model architecture supported. Allowed values:

        * ``'tiny-conformer'`` - TINY size Google Conformer.
        * ``'small-conformer'`` - SMALL size Google Conformer.
        * ``'conformer'`` - BASE size Google Conformer.
        * ``'large-conformer'`` - LARGE size Google Conformer.
        * ``'conformer-stack-2mixed'`` - BASE size Stacked Google Conformer for (Malay + Singlish) languages.
        * ``'conformer-stack-3mixed'`` - BASE size Stacked Google Conformer for (Malay + Singlish + Mandarin) languages.
        * ``'small-conformer-singlish'`` - SMALL size Google Conformer for singlish language.
        * ``'conformer-singlish'`` - BASE size Google Conformer for singlish language.
        * ``'large-conformer-singlish'`` - LARGE size Google Conformer for singlish language.

    quantized : bool, optional (default=False)
        if True, will load 8-bit quantized model.
        Quantized model not necessary faster, totally depends on the machine.

    Returns
    -------
    result : malaya_speech.model.tf.Transducer class
    """

[2]:

model = malaya_speech.stt.deep_transducer(model = 'conformer-singlish')

Load Quantized deep model#

To load 8-bit quantized model, simply pass quantized = True, default is False.

We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine.

[3]:

quantized_model = malaya_speech.stt.deep_transducer(model = 'conformer-singlish', quantized = True)

WARNING:root:Load quantized model will cause accuracy drop.

Load sample#

[6]:

singlish0, sr = malaya_speech.load('speech/singlish/singlish0.wav')
singlish1, sr = malaya_speech.load('speech/singlish/singlish1.wav')
singlish2, sr = malaya_speech.load('speech/singlish/singlish2.wav')
imda0, sr = malaya_speech.load('speech/imda/221931702.WAV')
imda1, sr = malaya_speech.load('speech/imda/221931727.WAV')
imda2, sr = malaya_speech.load('speech/imda/221931818.WAV')

[7]:

import IPython.display as ipd

ipd.Audio(singlish0, rate = sr)

[7]:

[12]:

ipd.Audio(singlish1, rate = sr)

[12]:

[13]:

ipd.Audio(singlish2, rate = sr)

[13]:

[8]:

ipd.Audio(imda0, rate = sr)

[8]:

[10]:

ipd.Audio(imda1, rate = sr)

[10]:

[14]:

ipd.Audio(imda2, rate = sr)

[14]:

Predict#

We can choose,

greedy decoder.
beam decoder, by default beam_size is 5, feel free to edit it.

def predict(
    self, inputs, decoder: str = 'greedy', beam_size: int = 5, **kwargs
):
    """
    Transcribe inputs, will return list of strings.

    Parameters
    ----------
    inputs: List[np.array]
        List[np.array] or List[malaya_speech.model.frame.Frame].
    decoder: str, optional (default='greedy')
        decoder mode, allowed values:

        * ``'greedy'`` - will call self.greedy_decoder
        * ``'beam'`` - will call self.beam_decoder
    beam_size: int, optional (default=5)
        beam size for beam decoder.

    Returns
    -------
    result: List[str]
    """

Greedy decoder#

Greedy able to utilize batch processing, and faster than beam decoder.

def greedy_decoder(self, inputs):
    """
    Transcribe inputs, will return list of strings.

    Parameters
    ----------
    inputs: List[np.array]
        List[np.array] or List[malaya_speech.model.frame.Frame].

    Returns
    -------
    result: List[str]
    """

[11]:

%%time

model.greedy_decoder([singlish0, singlish1, singlish2, imda0, imda1, imda2])

CPU times: user 10.3 s, sys: 2.65 s, total: 12.9 s
Wall time: 9.4 s

[11]:

['and then see how they roll it in film okay actually',
 'then you tap to your eyes',
 'sembawang seven in mal',
 'wantan mee is a traditional local cuisine',
 'saravanan gopinathan george yeo yong boon and tay kheng soon',
 'ahmad khan adelene wee chin suan and robert ibbetson']

[16]:

%%time

quantized_model.greedy_decoder([singlish0, singlish1, singlish2, imda0, imda1, imda2])

CPU times: user 9.92 s, sys: 2.41 s, total: 12.3 s
Wall time: 8.76 s

[16]:

['and then see how they roll it in film okay actually',
 'then you tap to your eyes',
 'sembawang seven in mal',
 'wantan mee is a traditional local cuisine',
 'saravanan gopinathan george yeo yong boon and tay kheng soon',
 'ahmad khan adelene wee chin suan and robert ibbetson']

Beam decoder#

To get better results, use beam decoder with optimum beam size.

def beam_decoder(self, inputs, beam_size: int = 5):
    """
    Transcribe inputs, will return list of strings.

    Parameters
    ----------
    inputs: List[np.array]
        List[np.array] or List[malaya_speech.model.frame.Frame].
    beam_size: int, optional (default=5)
        beam size for beam decoder.

    Returns
    -------
    result: List[str]
    """

[17]:

%%time

model.beam_decoder([singlish0, singlish1, singlish2, imda0, imda1, imda2], beam_size = 3)

CPU times: user 27.1 s, sys: 2.52 s, total: 29.6 s
Wall time: 21 s

[17]:

['and then see how they roll it in film okay actually',
 'okay then you tap to your eyes',
 'sembawang seven in male',
 'wantan mee is a traditional local cuisine',
 'saravanan gopinathan george yeo yong boon and tay kheng soon',
 'ahmad khan adelene wee chin suan and robert ibbetson']

RNNT model beam decoder not able to utilise batch programming, if feed a batch, it will process one by one.

Predict alignment#

We want to know when the speakers speak certain words, so we can use predict_timestamp,

def predict_alignment(self, input, combined = True):
    """
    Transcribe input and get timestamp, only support greedy decoder.

    Parameters
    ----------
    input: np.array
        np.array or malaya_speech.model.frame.Frame.
    combined: bool, optional (default=True)
        If True, will combined subwords to become a word.

    Returns
    -------
    result: List[Dict[text, start, end]]
    """

[18]:

%%time

model.predict_alignment(singlish0)

CPU times: user 6.08 s, sys: 2.01 s, total: 8.09 s
Wall time: 7.22 s

[18]:

[{'text': 'and', 'start': 0.2, 'end': 0.21},
 {'text': 'then', 'start': 0.36, 'end': 0.45},
 {'text': 'see', 'start': 0.6, 'end': 0.61},
 {'text': 'how', 'start': 0.88, 'end': 0.89},
 {'text': 'they', 'start': 1.36, 'end': 1.49},
 {'text': 'roll', 'start': 1.96, 'end': 2.09},
 {'text': 'it', 'start': 2.16, 'end': 2.17},
 {'text': 'in', 'start': 2.4, 'end': 2.41},
 {'text': 'film', 'start': 2.6, 'end': 2.85},
 {'text': 'okay', 'start': 3.68, 'end': 3.85},
 {'text': 'actually', 'start': 3.92, 'end': 4.21}]

[19]:

%%time

model.predict_alignment(singlish0, combined = False)

CPU times: user 1.12 s, sys: 200 ms, total: 1.32 s
Wall time: 306 ms

[19]:

[{'text': 'and', 'start': 0.2, 'end': 0.21},
 {'text': ' ', 'start': 0.28, 'end': 0.29},
 {'text': 'the', 'start': 0.36, 'end': 0.37},
 {'text': 'n_', 'start': 0.44, 'end': 0.45},
 {'text': 'see', 'start': 0.6, 'end': 0.61},
 {'text': ' ', 'start': 0.76, 'end': 0.77},
 {'text': 'how', 'start': 0.88, 'end': 0.89},
 {'text': ' ', 'start': 1.08, 'end': 1.09},
 {'text': 'the', 'start': 1.36, 'end': 1.37},
 {'text': 'y_', 'start': 1.48, 'end': 1.49},
 {'text': 'ro', 'start': 1.96, 'end': 1.97},
 {'text': 'll_', 'start': 2.08, 'end': 2.09},
 {'text': 'it_', 'start': 2.2, 'end': 2.21},
 {'text': 'in_', 'start': 2.4, 'end': 2.41},
 {'text': 'fil', 'start': 2.6, 'end': 2.61},
 {'text': 'm_', 'start': 2.84, 'end': 2.85},
 {'text': 'oka', 'start': 3.68, 'end': 3.69},
 {'text': 'y_', 'start': 3.84, 'end': 3.85},
 {'text': 'act', 'start': 3.92, 'end': 3.93},
 {'text': 'ual', 'start': 4.0, 'end': 4.01},
 {'text': 'ly', 'start': 4.2, 'end': 4.21}]

[20]:

%%time

model.predict_alignment(imda0)

CPU times: user 1.05 s, sys: 192 ms, total: 1.25 s
Wall time: 278 ms

[20]:

[{'text': 'wantan', 'start': 0.92, 'end': 1.05},
 {'text': 'mee', 'start': 1.4, 'end': 1.53},
 {'text': 'is', 'start': 1.64, 'end': 1.65},
 {'text': 'a', 'start': 1.84, 'end': 1.85},
 {'text': 'traditional', 'start': 2.08, 'end': 2.69},
 {'text': 'local', 'start': 2.8, 'end': 2.93},
 {'text': 'cuisine', 'start': 3.12, 'end': 3.45}]

Speech-to-Text RNNT Singlish

Contents

Speech-to-Text RNNT Singlish#

List available RNNT model#

Google Speech-to-Text accuracy#

Load RNNT model#

Load Quantized deep model#

Load sample#

Predict#

Greedy decoder#

Beam decoder#

Predict alignment#