Speech-to-Text RNNT + MLM#

Encoder model + RNNT loss + MLM

This tutorial is available as an IPython notebook at malaya-speech/example/stt-transducer-model-lm-mlm.

This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.

[1]:

import os
os.environ['CUDA_VISIBLE_DEVICES'] = ''

[2]:

import malaya_speech
import numpy as np
from malaya_speech import Pipeline

List available RNNT model#

[3]:

malaya_speech.stt.available_transducer()

[3]:

	Size (MB)	Quantized Size (MB)	WER	CER	WER-LM	CER-LM	Language
tiny-conformer	24.4	9.14	0.212811	0.081369	0.199683	0.077004	[malay]
small-conformer	49.2	18.1	0.198533	0.074495	0.185361	0.071143	[malay]
conformer	125	37.1	0.163602	0.058744	0.156182	0.05719	[malay]
large-conformer	404	107	0.156684	0.061971	0.148622	0.05901	[malay]
conformer-stack-2mixed	130	38.5	0.103608	0.050069	0.102911	0.050201	[malay, singlish]
conformer-stack-3mixed	130	38.5	0.234768	0.133944	0.229241	0.130702	[malay, singlish, mandarin]
small-conformer-singlish	49.2	18.1	0.087831	0.045686	0.087333	0.045317	[singlish]
conformer-singlish	125	37.1	0.077792	0.040362	0.077186	0.03987	[singlish]
large-conformer-singlish	404	107	0.070147	0.035872	0.069812	0.035723	[singlish]
xs-squeezeformer	51.9	23.4	0.198092	0.079035	0.198842	0.078122	[malay]
sm-squeezeformer	147	47.4	0.176127	0.068079	0.16873	0.061468	[malay]
m-squeezeformer	261	78.5	0.167008	0.059728	0.156185	0.053639	[malay]

Lower is better. Mixed models tested on different dataset.

Load RNNT model#

def deep_transducer(
    model: str = 'conformer', quantized: bool = False, **kwargs
):
    """
    Load Encoder-Transducer ASR model.

    Parameters
    ----------
    model : str, optional (default='conformer')
        Check available models at `malaya_speech.stt.available_transducer()`.
    quantized : bool, optional (default=False)
        if True, will load 8-bit quantized model.
        Quantized model not necessary faster, totally depends on the machine.

    Returns
    -------
    result : malaya_speech.model.transducer.Transducer class
    """

[4]:

small_model = malaya_speech.stt.deep_transducer(model = 'small-conformer')

2022-09-17 15:11:14.009127: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-09-17 15:11:14.012540: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2022-09-17 15:11:14.012561: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: husein-MS-7D31
2022-09-17 15:11:14.012565: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: husein-MS-7D31
2022-09-17 15:11:14.012661: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program
2022-09-17 15:11:14.012682: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 470.141.3

Load sample#

[5]:

ceramah, sr = malaya_speech.load('speech/khutbah/wadi-annuar.wav')
record1, sr = malaya_speech.load('speech/record/savewav_2020-11-26_22-36-06_294832.wav')
record2, sr = malaya_speech.load('speech/record/savewav_2020-11-26_22-40-56_929661.wav')
shafiqah_idayu, sr = malaya_speech.load('speech/example-speaker/shafiqah-idayu.wav')
mas_aisyah, sr = malaya_speech.load('speech/example-speaker/mas-aisyah.wav')
khalil, sr = malaya_speech.load('speech/example-speaker/khalil-nooh.wav')

[6]:

import IPython.display as ipd

ipd.Audio(ceramah, rate = sr)

[6]:

As we can hear, the speaker speaks in kedahan dialects plus some arabic words, let see how good our model is.

[7]:

ipd.Audio(record1, rate = sr)

[7]:

[8]:

ipd.Audio(record2, rate = sr)

[8]:

[9]:

ipd.Audio(shafiqah_idayu, rate = sr)

[9]:

[10]:

ipd.Audio(mas_aisyah, rate = sr)

[10]:

[11]:

ipd.Audio(khalil, rate = sr)

[11]:

Load MLM#

To get better performance, you need to have a really good Mask model, we are trying very best to release a really good Mask model.

[12]:

language_model = malaya_speech.language_model.mlm(alpha = 0.01, beta = 0.2)
language_model

/home/husein/.local/lib/python3.8/site-packages/malaya/tokenizer.py:202: FutureWarning: Possible nested set at position 3361
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
/home/husein/.local/lib/python3.8/site-packages/malaya/tokenizer.py:202: FutureWarning: Possible nested set at position 3879
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))

[12]:

<malaya_speech.torch_model.mask_lm.LM at 0x7f23015f5f70>

Predict using beam decoder language model#

def beam_decoder_lm(self, inputs, language_model,
                    beam_width: int = 5,
                    token_min_logp: float = -20.0,
                    beam_prune_logp: float = -5.0,
                    temperature: float = 0.0,
                    score_norm: bool = True):
    """
    Transcribe inputs using beam decoder + KenLM.

    Parameters
    ----------
    inputs: List[np.array]
        List[np.array] or List[malaya_speech.model.frame.Frame].
    language_model: pyctcdecode.language_model.LanguageModel
        pyctcdecode language model, load from `LanguageModel(kenlm_model, alpha = alpha, beta = beta)`.
    beam_width: int, optional (default=5)
        beam size for beam decoder.
    token_min_logp: float, optional (default=-20.0)
        minimum log probability to select a token.
    beam_prune_logp: float, optional (default=-5.0)
        filter candidates >= max score lm + `beam_prune_logp`.
    temperature: float, optional (default=0.0)
        apply temperature function for logits, can help for certain case,
        logits += -np.log(-np.log(uniform_noise_shape_logits)) * temperature
    score_norm: bool, optional (default=True)
        descending sort beam based on score / length of decoded.

    Returns
    -------
    result: List[str]
    """

[13]:

%%time

small_model.beam_decoder_lm([khalil], language_model, beam_width = 3)

CPU times: user 12min 3s, sys: 339 ms, total: 12min 3s
Wall time: 1min 2s

[13]:

['tolong sebut anti kata']

RNNT model beam decoder language model not able to utilise batch processing, if feed a batch, it will process one by one.

Speech-to-Text RNNT + MLM

Contents

Speech-to-Text RNNT + MLM#

List available RNNT model#

Load RNNT model#

Load sample#

Load MLM#

Predict using beam decoder language model#