Speech-to-Text RNNT + MLM#

Encoder model + RNNT loss + MLM

This tutorial is available as an IPython notebook at malaya-speech/example/stt-transducer-model-lm-mlm.

This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.

[1]:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = ''
[2]:
import malaya_speech
import numpy as np
from malaya_speech import Pipeline
`pyaudio` is not available, `malaya_speech.streaming.stream` is not able to use.
[3]:
import logging

logging.basicConfig(level=logging.INFO)

List available RNNT model#

[4]:
malaya_speech.stt.transducer.available_transformer()
INFO:malaya_speech.stt:for `malay-fleur102` language, tested on FLEURS102 `ms_my` test set, https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/prepare-stt
INFO:malaya_speech.stt:for `malay-malaya` language, tested on malaya-speech test set, https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/prepare-stt
INFO:malaya_speech.stt:for `singlish` language, tested on IMDA malaya-speech test set, https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/prepare-stt
[4]:
Size (MB) Quantized Size (MB) malay-malaya malay-fleur102 Language singlish
tiny-conformer 24.4 9.14 {'WER': 0.2128108, 'CER': 0.08136871, 'WER-LM'... {'WER': 0.2682816, 'CER': 0.13052725, 'WER-LM'... [malay] NaN
small-conformer 49.2 18.1 {'WER': 0.19853302, 'CER': 0.07449528, 'WER-LM... {'WER': 0.23412149, 'CER': 0.1138314813, 'WER-... [malay] NaN
conformer 125 37.1 {'WER': 0.16340855635999124, 'CER': 0.05897205... {'WER': 0.20090442596, 'CER': 0.09616901, 'WER... [malay] NaN
large-conformer 404 107 {'WER': 0.1566839, 'CER': 0.0619715, 'WER-LM':... {'WER': 0.1711028238, 'CER': 0.077953559, 'WER... [malay] NaN
conformer-stack-2mixed 130 38.5 {'WER': 0.1889883954, 'CER': 0.0726845531, 'WE... {'WER': 0.244836948, 'CER': 0.117409327, 'WER-... [malay, singlish] {'WER': 0.08535878149, 'CER': 0.0452357273822,...
small-conformer-singlish 49.2 18.1 NaN NaN [singlish] {'WER': 0.087831, 'CER': 0.0456859, 'WER-LM': ...
conformer-singlish 125 37.1 NaN NaN [singlish] {'WER': 0.07779246, 'CER': 0.0403616, 'WER-LM'...
large-conformer-singlish 404 107 NaN NaN [singlish] {'WER': 0.07014733, 'CER': 0.03587201, 'WER-LM...

Load RNNT model#

def transformer(
    model: str = 'conformer',
    quantized: bool = False,
    **kwargs,
):
    """
    Load Encoder-Transducer ASR model.

    Parameters
    ----------
    model : str, optional (default='conformer')
        Check available models at `malaya_speech.stt.transducer.available_transformer()`.
    quantized : bool, optional (default=False)
        if True, will load 8-bit quantized model.
        Quantized model not necessary faster, totally depends on the machine.

    Returns
    -------
    result : malaya_speech.model.transducer.Transducer class
    """
[5]:
small_model = malaya_speech.stt.transducer.transformer(model = 'small-conformer')
2023-02-01 11:51:27.598482: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-01 11:51:27.606945: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2023-02-01 11:51:27.606962: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: husein-MS-7D31
2023-02-01 11:51:27.606966: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: husein-MS-7D31
2023-02-01 11:51:27.607054: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program
2023-02-01 11:51:27.607074: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 470.161.3

Load sample#

[5]:
ceramah, sr = malaya_speech.load('speech/khutbah/wadi-annuar.wav')
record1, sr = malaya_speech.load('speech/record/savewav_2020-11-26_22-36-06_294832.wav')
record2, sr = malaya_speech.load('speech/record/savewav_2020-11-26_22-40-56_929661.wav')
shafiqah_idayu, sr = malaya_speech.load('speech/example-speaker/shafiqah-idayu.wav')
mas_aisyah, sr = malaya_speech.load('speech/example-speaker/mas-aisyah.wav')
khalil, sr = malaya_speech.load('speech/example-speaker/khalil-nooh.wav')
[6]:
import IPython.display as ipd

ipd.Audio(ceramah, rate = sr)
[6]:

As we can hear, the speaker speaks in kedahan dialects plus some arabic words, let see how good our model is.

[7]:
ipd.Audio(record1, rate = sr)
[7]:
[8]:
ipd.Audio(record2, rate = sr)
[8]:
[9]:
ipd.Audio(shafiqah_idayu, rate = sr)
[9]:
[10]:
ipd.Audio(mas_aisyah, rate = sr)
[10]:
[11]:
ipd.Audio(khalil, rate = sr)
[11]:

Load MLM#

To get better performance, you need to have a really good Mask model, we are trying very best to release a really good Mask model.

[12]:
language_model = malaya_speech.language_model.mlm(alpha = 0.01, beta = 0.2)
language_model
/home/husein/.local/lib/python3.8/site-packages/malaya/tokenizer.py:202: FutureWarning: Possible nested set at position 3361
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
/home/husein/.local/lib/python3.8/site-packages/malaya/tokenizer.py:202: FutureWarning: Possible nested set at position 3879
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
[12]:
<malaya_speech.torch_model.mask_lm.LM at 0x7f23015f5f70>

Predict using beam decoder language model#

def beam_decoder_lm(self, inputs, language_model,
                    beam_width: int = 5,
                    token_min_logp: float = -20.0,
                    beam_prune_logp: float = -5.0,
                    temperature: float = 0.0,
                    score_norm: bool = True):
    """
    Transcribe inputs using beam decoder + KenLM.

    Parameters
    ----------
    inputs: List[np.array]
        List[np.array] or List[malaya_speech.model.frame.Frame].
    language_model: pyctcdecode.language_model.LanguageModel
        pyctcdecode language model, load from `LanguageModel(kenlm_model, alpha = alpha, beta = beta)`.
    beam_width: int, optional (default=5)
        beam size for beam decoder.
    token_min_logp: float, optional (default=-20.0)
        minimum log probability to select a token.
    beam_prune_logp: float, optional (default=-5.0)
        filter candidates >= max score lm + `beam_prune_logp`.
    temperature: float, optional (default=0.0)
        apply temperature function for logits, can help for certain case,
        logits += -np.log(-np.log(uniform_noise_shape_logits)) * temperature
    score_norm: bool, optional (default=True)
        descending sort beam based on score / length of decoded.

    Returns
    -------
    result: List[str]
    """
[13]:
%%time

small_model.beam_decoder_lm([khalil], language_model, beam_width = 3)
CPU times: user 12min 3s, sys: 339 ms, total: 12min 3s
Wall time: 1min 2s
[13]:
['tolong sebut anti kata']

RNNT model beam decoder language model not able to utilise batch processing, if feed a batch, it will process one by one.