Speech-to-Text RNNT Malay + Singlish + Mandarin#

Encoder model + RNNT loss for Malay + Singlish + Mandarin

This tutorial is available as an IPython notebook at malaya-speech/example/stt-transducer-model-3mixed.

This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.

This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at malaya-speech/example/pipeline.

[ ]:
import malaya_speech
import numpy as np
from malaya_speech import Pipeline

List available RNNT model#

[ ]:
malaya_speech.stt.available_transducer()

Lower is better. Test set can get at https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/prepare-stt

Malay trained on Malaya Speech dataset, singlish trained on IMDA dataset, mandarin trained on https://openslr.org/68/ and https://openslr.org/38/

Google Speech-to-Text only supported monolanguage, so we are not able to compare the accuracy.

Load RNNT model#

def deep_transducer(
    model: str = 'conformer', quantized: bool = False, **kwargs
):
    """
    Load Encoder-Transducer ASR model.

    Parameters
    ----------
    model : str, optional (default='conformer')
        Model architecture supported. Allowed values:

        * ``'tiny-conformer'`` - TINY size Google Conformer.
        * ``'small-conformer'`` - SMALL size Google Conformer.
        * ``'conformer'`` - BASE size Google Conformer.
        * ``'large-conformer'`` - LARGE size Google Conformer.
        * ``'conformer-stack-2mixed'`` - BASE size Stacked Google Conformer for (Malay + Singlish) languages.
        * ``'conformer-stack-3mixed'`` - BASE size Stacked Google Conformer for (Malay + Singlish + Mandarin) languages.
        * ``'small-conformer-singlish'`` - SMALL size Google Conformer for singlish language.
        * ``'conformer-singlish'`` - BASE size Google Conformer for singlish language.
        * ``'large-conformer-singlish'`` - LARGE size Google Conformer for singlish language.

    quantized : bool, optional (default=False)
        if True, will load 8-bit quantized model.
        Quantized model not necessary faster, totally depends on the machine.

    Returns
    -------
    result : malaya_speech.model.tf.Transducer class
    """
[17]:
model = malaya_speech.stt.deep_transducer(model = 'conformer-stack-3mixed')

Load Quantized deep model#

To load 8-bit quantized model, simply pass quantized = True, default is False.

We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine.

[4]:
quantized_model = malaya_speech.stt.deep_transducer(model = 'conformer-stack-3mixed', quantized = True)
WARNING:root:Load quantized model will cause accuracy drop.
37.0MB [00:05, 6.65MB/s]
1.00MB [00:00, 233MB/s]
1.00MB [00:00, 227MB/s]
1.00MB [00:00, 296MB/s]

Load sample#

[5]:
ceramah, sr = malaya_speech.load('speech/khutbah/wadi-annuar.wav')
record1, sr = malaya_speech.load('speech/record/savewav_2020-11-26_22-36-06_294832.wav')
record2, sr = malaya_speech.load('speech/record/savewav_2020-11-26_22-40-56_929661.wav')
singlish0, sr = malaya_speech.load('speech/singlish/singlish0.wav')
singlish1, sr = malaya_speech.load('speech/singlish/singlish1.wav')
singlish2, sr = malaya_speech.load('speech/singlish/singlish2.wav')
mandarin0, sr = malaya_speech.load('speech/mandarin/597.wav')
mandarin1, sr = malaya_speech.load('speech/mandarin/584.wav')
mandarin2, sr = malaya_speech.load('speech/mandarin/509.wav')
[6]:
import IPython.display as ipd

ipd.Audio(ceramah, rate = sr)
[6]:
[7]:
ipd.Audio(record1, rate = sr)
[7]:
[8]:
ipd.Audio(record2, rate = sr)
[8]:
[9]:
ipd.Audio(singlish0, rate = sr)
[9]:
[10]:
ipd.Audio(singlish1, rate = sr)
[10]:
[11]:
ipd.Audio(singlish2, rate = sr)
[11]:
[12]:
ipd.Audio(mandarin0, rate = sr)
[12]:
[13]:
ipd.Audio(mandarin1, rate = sr)
[13]:
[14]:
ipd.Audio(mandarin2, rate = sr)
[14]:

Predict using greedy decoder#

def greedy_decoder(self, inputs):
    """
    Transcribe inputs using greedy decoder.

    Parameters
    ----------
    inputs: List[np.array]
        List[np.array] or List[malaya_speech.model.frame.Frame].

    Returns
    -------
    result: List[str]
    """
[15]:
%%time

model.greedy_decoder([ceramah, record1, record2, singlish0, singlish1, singlish2,
                      mandarin0, mandarin1, mandarin2])
CPU times: user 13.8 s, sys: 3.86 s, total: 17.6 s
Wall time: 10.3 s
[15]:
['jadi dalam perjalanan ini ini yang susah ini ketika nabi mengajar muaz bin jabal tadi ni allah',
 'kalau nama saya musim saya tak suka mandi kata saya masak',
 'hello im sorry sorry so',
 'and then see how they bring and film okay actually',
 'later to your s',
 'seven seven more',
 'gei wo lai ge zhang jie zui xin de ge',
 'wo xiang ting shu ying do',
 'qiu yi shou ge de ming zi ge ci li you zhuan sheng ji meng wang shi son su']
[16]:
%%time

quantized_model.greedy_decoder([ceramah, record1, record2, singlish0, singlish1, singlish2,
                      mandarin0, mandarin1, mandarin2])
CPU times: user 13.2 s, sys: 3.55 s, total: 16.7 s
Wall time: 9.94 s
[16]:
['jadi dalam perjalanan ini yang susah ini ketika nabi mengajar muaz bin jabal tadi ni allah',
 'kalau nama saya musim saya tak suka mandi kata saya masak',
 'hello im sorry saya using saya sekemandi saya mandi jp',
 'and then see how they break and film okay actually',
 'later to your as',
 'seven seven more',
 'gei wo lai ge zhang jie zui xin de ge',
 'wo xiang ting shu ying do',
 'qiu yi shou ge de ming zi ge ci li you zhuan sheng di meng wang shi son su']

Predict using beam decoder#

def beam_decoder(self, inputs, beam_width: int = 5,
                 temperature: float = 0.0,
                 score_norm: bool = True):
    """
    Transcribe inputs using beam decoder.

    Parameters
    ----------
    inputs: List[np.array]
        List[np.array] or List[malaya_speech.model.frame.Frame].
    beam_width: int, optional (default=5)
        beam size for beam decoder.
    temperature: float, optional (default=0.0)
        apply temperature function for logits, can help for certain case,
        logits += -np.log(-np.log(uniform_noise_shape_logits)) * temperature
    score_norm: bool, optional (default=True)
        descending sort beam based on score / length of decoded.

    Returns
    -------
    result: List[str]
    """
[18]:
%%time

model.beam_decoder([ceramah, record1, record2, singlish0, singlish1, singlish2,
                      mandarin0, mandarin1, mandarin2], beam_width = 5)
CPU times: user 31.3 s, sys: 4.14 s, total: 35.5 s
Wall time: 17.1 s
[18]:
['jadi dalam perjalanan ini yang susah ini ketika nabi mengajar muaz bin jabal tadi ni allah',
 'kalau nama saya musim saya tak suka mandi kata saya masak',
 'hello n saya sekemandi saya mandi jeti hari',
 'and then see how they broad and film okay actually',
 'later to you as',
 'seven seven eight more',
 'gei wo lai ge zhang jie zui xin de ge',
 'wo xiang shou kan jiang su ying shi pin dao de jie mu',
 'qiu yi shou ge de ming zi ge ci li you zhuan shen ti meng wang shi qing shi xiu']
[19]:
%%time

quantized_model.beam_decoder([ceramah, record1, record2, singlish0, singlish1, singlish2,
                      mandarin0, mandarin1, mandarin2], beam_width = 5)
CPU times: user 30.8 s, sys: 3.87 s, total: 34.7 s
Wall time: 16.6 s
[19]:
['jadi dalam perjalanan ini yang susah ini ketika nabi mengajar muaz bin jabal tadi ni allah',
 'kalau nama saya musim saya tak suka mandi kata saya masak',
 'hello n saya sekemandi saya mandi jeti hari',
 'and then see how they break in film okay actually',
 'later to you as',
 'seven seven eight more',
 'gei wo lai ge zhang jie zui xin de ge',
 'wo xiang shou kan jiang su ying shi pin dao de jie mu',
 'qiu yi shou ge de ming zi ge ci li you zhuan shen ti meng wang shi qing shi xiu']

RNNT model beam decoder not able to utilise batch programming, if feed a batch, it will process one by one.

Predict alignment#

We want to know when the speakers speak certain words, so we can use predict_timestamp,

def predict_alignment(self, input, combined = True):
    """
    Transcribe input and get timestamp, only support greedy decoder.

    Parameters
    ----------
    input: np.array
        np.array or malaya_speech.model.frame.Frame.
    combined: bool, optional (default=True)
        If True, will combined subwords to become a word.

    Returns
    -------
    result: List[Dict[text, start, end]]
    """
[20]:
%%time

model.predict_alignment(mandarin0)
CPU times: user 6.12 s, sys: 2.21 s, total: 8.34 s
Wall time: 7.64 s
[20]:
[{'text': 'gei', 'start': 2.6, 'end': 2.61},
 {'text': 'wo', 'start': 2.76, 'end': 2.77},
 {'text': 'lai', 'start': 2.88, 'end': 2.89},
 {'text': 'ge', 'start': 3.12, 'end': 3.13},
 {'text': 'zhang', 'start': 3.24, 'end': 3.49},
 {'text': 'jie', 'start': 3.52, 'end': 3.53},
 {'text': 'zui', 'start': 3.76, 'end': 3.77},
 {'text': 'xin', 'start': 3.88, 'end': 3.89},
 {'text': 'de', 'start': 4.0, 'end': 4.01},
 {'text': 'ge', 'start': 4.04, 'end': 4.05}]
[21]:
%%time

model.predict_alignment(mandarin0, combined = False)
CPU times: user 799 ms, sys: 133 ms, total: 933 ms
Wall time: 288 ms
[21]:
[{'text': 'gei', 'start': 2.6, 'end': 2.61},
 {'text': ' ', 'start': 2.72, 'end': 2.73},
 {'text': 'wo_', 'start': 2.76, 'end': 2.77},
 {'text': 'lai', 'start': 2.88, 'end': 2.89},
 {'text': ' ', 'start': 3.08, 'end': 3.09},
 {'text': 'ge_', 'start': 3.12, 'end': 3.13},
 {'text': 'zha', 'start': 3.24, 'end': 3.25},
 {'text': 'ng_', 'start': 3.48, 'end': 3.49},
 {'text': 'jie', 'start': 3.52, 'end': 3.53},
 {'text': ' ', 'start': 3.72, 'end': 3.73},
 {'text': 'zui', 'start': 3.76, 'end': 3.77},
 {'text': ' ', 'start': 3.84, 'end': 3.85},
 {'text': 'xin', 'start': 3.88, 'end': 3.89},
 {'text': ' ', 'start': 3.96, 'end': 3.97},
 {'text': 'de_', 'start': 4.0, 'end': 4.01},
 {'text': 'ge', 'start': 4.04, 'end': 4.05}]