Speech-to-Text RNNT + MLM#

Encoder model + RNNT loss + MLM

This tutorial is available as an IPython notebook at malaya-speech/example/stt-transducer-model-lm-mlm.

This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.

import os
os.environ['CUDA_VISIBLE_DEVICES'] = ''
import malaya_speech
import numpy as np
from malaya_speech import Pipeline

List available RNNT model#

Size (MB) Quantized Size (MB) WER CER WER-LM CER-LM Language
tiny-conformer 24.4 9.14 0.212811 0.081369 0.199683 0.077004 [malay]
small-conformer 49.2 18.1 0.198533 0.074495 0.185361 0.071143 [malay]
conformer 125 37.1 0.163602 0.058744 0.156182 0.05719 [malay]
large-conformer 404 107 0.156684 0.061971 0.148622 0.05901 [malay]
conformer-stack-2mixed 130 38.5 0.103608 0.050069 0.102911 0.050201 [malay, singlish]
conformer-stack-3mixed 130 38.5 0.234768 0.133944 0.229241 0.130702 [malay, singlish, mandarin]
small-conformer-singlish 49.2 18.1 0.087831 0.045686 0.087333 0.045317 [singlish]
conformer-singlish 125 37.1 0.077792 0.040362 0.077186 0.03987 [singlish]
large-conformer-singlish 404 107 0.070147 0.035872 0.069812 0.035723 [singlish]
xs-squeezeformer 51.9 23.4 0.198092 0.079035 0.198842 0.078122 [malay]
sm-squeezeformer 147 47.4 0.176127 0.068079 0.16873 0.061468 [malay]
m-squeezeformer 261 78.5 0.167008 0.059728 0.156185 0.053639 [malay]

Lower is better. Mixed models tested on different dataset.

Load RNNT model#

def deep_transducer(
    model: str = 'conformer', quantized: bool = False, **kwargs
    Load Encoder-Transducer ASR model.

    model : str, optional (default='conformer')
        Check available models at `malaya_speech.stt.available_transducer()`.
    quantized : bool, optional (default=False)
        if True, will load 8-bit quantized model.
        Quantized model not necessary faster, totally depends on the machine.

    result : malaya_speech.model.transducer.Transducer class
small_model = malaya_speech.stt.deep_transducer(model = 'small-conformer')
2022-09-17 15:11:14.009127: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-09-17 15:11:14.012540: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2022-09-17 15:11:14.012561: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: husein-MS-7D31
2022-09-17 15:11:14.012565: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: husein-MS-7D31
2022-09-17 15:11:14.012661: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program
2022-09-17 15:11:14.012682: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 470.141.3

Load sample#

ceramah, sr = malaya_speech.load('speech/khutbah/wadi-annuar.wav')
record1, sr = malaya_speech.load('speech/record/savewav_2020-11-26_22-36-06_294832.wav')
record2, sr = malaya_speech.load('speech/record/savewav_2020-11-26_22-40-56_929661.wav')
shafiqah_idayu, sr = malaya_speech.load('speech/example-speaker/shafiqah-idayu.wav')
mas_aisyah, sr = malaya_speech.load('speech/example-speaker/mas-aisyah.wav')
khalil, sr = malaya_speech.load('speech/example-speaker/khalil-nooh.wav')
import IPython.display as ipd

ipd.Audio(ceramah, rate = sr)

As we can hear, the speaker speaks in kedahan dialects plus some arabic words, let see how good our model is.

ipd.Audio(record1, rate = sr)
ipd.Audio(record2, rate = sr)
ipd.Audio(shafiqah_idayu, rate = sr)
ipd.Audio(mas_aisyah, rate = sr)
ipd.Audio(khalil, rate = sr)

Load MLM#

To get better performance, you need to have a really good Mask model, we are trying very best to release a really good Mask model.

language_model = malaya_speech.language_model.mlm(alpha = 0.01, beta = 0.2)
/home/husein/.local/lib/python3.8/site-packages/malaya/tokenizer.py:202: FutureWarning: Possible nested set at position 3361
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
/home/husein/.local/lib/python3.8/site-packages/malaya/tokenizer.py:202: FutureWarning: Possible nested set at position 3879
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
<malaya_speech.torch_model.mask_lm.LM at 0x7f23015f5f70>

Predict using beam decoder language model#

def beam_decoder_lm(self, inputs, language_model,
                    beam_width: int = 5,
                    token_min_logp: float = -20.0,
                    beam_prune_logp: float = -5.0,
                    temperature: float = 0.0,
                    score_norm: bool = True):
    Transcribe inputs using beam decoder + KenLM.

    inputs: List[np.array]
        List[np.array] or List[malaya_speech.model.frame.Frame].
    language_model: pyctcdecode.language_model.LanguageModel
        pyctcdecode language model, load from `LanguageModel(kenlm_model, alpha = alpha, beta = beta)`.
    beam_width: int, optional (default=5)
        beam size for beam decoder.
    token_min_logp: float, optional (default=-20.0)
        minimum log probability to select a token.
    beam_prune_logp: float, optional (default=-5.0)
        filter candidates >= max score lm + `beam_prune_logp`.
    temperature: float, optional (default=0.0)
        apply temperature function for logits, can help for certain case,
        logits += -np.log(-np.log(uniform_noise_shape_logits)) * temperature
    score_norm: bool, optional (default=True)
        descending sort beam based on score / length of decoded.

    result: List[str]

small_model.beam_decoder_lm([khalil], language_model, beam_width = 3)
CPU times: user 12min 3s, sys: 339 ms, total: 12min 3s
Wall time: 1min 2s
['tolong sebut anti kata']

RNNT model beam decoder language model not able to utilise batch processing, if feed a batch, it will process one by one.