Speech-to-Text RNNT + MLM
Contents
Speech-to-Text RNNT + MLM#
Encoder model + RNNT loss + MLM
This tutorial is available as an IPython notebook at malaya-speech/example/stt-transducer-model-lm-mlm.
This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.
[1]:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = ''
[2]:
import malaya_speech
import numpy as np
from malaya_speech import Pipeline
List available RNNT model#
[3]:
malaya_speech.stt.available_transducer()
[3]:
Size (MB) | Quantized Size (MB) | WER | CER | WER-LM | CER-LM | Language | |
---|---|---|---|---|---|---|---|
tiny-conformer | 24.4 | 9.14 | 0.212811 | 0.081369 | 0.199683 | 0.077004 | [malay] |
small-conformer | 49.2 | 18.1 | 0.198533 | 0.074495 | 0.185361 | 0.071143 | [malay] |
conformer | 125 | 37.1 | 0.163602 | 0.058744 | 0.156182 | 0.05719 | [malay] |
large-conformer | 404 | 107 | 0.156684 | 0.061971 | 0.148622 | 0.05901 | [malay] |
conformer-stack-2mixed | 130 | 38.5 | 0.103608 | 0.050069 | 0.102911 | 0.050201 | [malay, singlish] |
conformer-stack-3mixed | 130 | 38.5 | 0.234768 | 0.133944 | 0.229241 | 0.130702 | [malay, singlish, mandarin] |
small-conformer-singlish | 49.2 | 18.1 | 0.087831 | 0.045686 | 0.087333 | 0.045317 | [singlish] |
conformer-singlish | 125 | 37.1 | 0.077792 | 0.040362 | 0.077186 | 0.03987 | [singlish] |
large-conformer-singlish | 404 | 107 | 0.070147 | 0.035872 | 0.069812 | 0.035723 | [singlish] |
xs-squeezeformer | 51.9 | 23.4 | 0.198092 | 0.079035 | 0.198842 | 0.078122 | [malay] |
sm-squeezeformer | 147 | 47.4 | 0.176127 | 0.068079 | 0.16873 | 0.061468 | [malay] |
m-squeezeformer | 261 | 78.5 | 0.167008 | 0.059728 | 0.156185 | 0.053639 | [malay] |
Lower is better. Mixed models tested on different dataset.
Load RNNT model#
def deep_transducer(
model: str = 'conformer', quantized: bool = False, **kwargs
):
"""
Load Encoder-Transducer ASR model.
Parameters
----------
model : str, optional (default='conformer')
Check available models at `malaya_speech.stt.available_transducer()`.
quantized : bool, optional (default=False)
if True, will load 8-bit quantized model.
Quantized model not necessary faster, totally depends on the machine.
Returns
-------
result : malaya_speech.model.transducer.Transducer class
"""
[4]:
small_model = malaya_speech.stt.deep_transducer(model = 'small-conformer')
2022-09-17 15:11:14.009127: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-09-17 15:11:14.012540: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2022-09-17 15:11:14.012561: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: husein-MS-7D31
2022-09-17 15:11:14.012565: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: husein-MS-7D31
2022-09-17 15:11:14.012661: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program
2022-09-17 15:11:14.012682: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 470.141.3
Load sample#
[5]:
ceramah, sr = malaya_speech.load('speech/khutbah/wadi-annuar.wav')
record1, sr = malaya_speech.load('speech/record/savewav_2020-11-26_22-36-06_294832.wav')
record2, sr = malaya_speech.load('speech/record/savewav_2020-11-26_22-40-56_929661.wav')
shafiqah_idayu, sr = malaya_speech.load('speech/example-speaker/shafiqah-idayu.wav')
mas_aisyah, sr = malaya_speech.load('speech/example-speaker/mas-aisyah.wav')
khalil, sr = malaya_speech.load('speech/example-speaker/khalil-nooh.wav')
[6]:
import IPython.display as ipd
ipd.Audio(ceramah, rate = sr)
[6]:
As we can hear, the speaker speaks in kedahan dialects plus some arabic words, let see how good our model is.
[7]:
ipd.Audio(record1, rate = sr)
[7]:
[8]:
ipd.Audio(record2, rate = sr)
[8]:
[9]:
ipd.Audio(shafiqah_idayu, rate = sr)
[9]:
[10]:
ipd.Audio(mas_aisyah, rate = sr)
[10]:
[11]:
ipd.Audio(khalil, rate = sr)
[11]:
Load MLM#
To get better performance, you need to have a really good Mask model, we are trying very best to release a really good Mask model.
[12]:
language_model = malaya_speech.language_model.mlm(alpha = 0.01, beta = 0.2)
language_model
/home/husein/.local/lib/python3.8/site-packages/malaya/tokenizer.py:202: FutureWarning: Possible nested set at position 3361
self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
/home/husein/.local/lib/python3.8/site-packages/malaya/tokenizer.py:202: FutureWarning: Possible nested set at position 3879
self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
[12]:
<malaya_speech.torch_model.mask_lm.LM at 0x7f23015f5f70>
Predict using beam decoder language model#
def beam_decoder_lm(self, inputs, language_model,
beam_width: int = 5,
token_min_logp: float = -20.0,
beam_prune_logp: float = -5.0,
temperature: float = 0.0,
score_norm: bool = True):
"""
Transcribe inputs using beam decoder + KenLM.
Parameters
----------
inputs: List[np.array]
List[np.array] or List[malaya_speech.model.frame.Frame].
language_model: pyctcdecode.language_model.LanguageModel
pyctcdecode language model, load from `LanguageModel(kenlm_model, alpha = alpha, beta = beta)`.
beam_width: int, optional (default=5)
beam size for beam decoder.
token_min_logp: float, optional (default=-20.0)
minimum log probability to select a token.
beam_prune_logp: float, optional (default=-5.0)
filter candidates >= max score lm + `beam_prune_logp`.
temperature: float, optional (default=0.0)
apply temperature function for logits, can help for certain case,
logits += -np.log(-np.log(uniform_noise_shape_logits)) * temperature
score_norm: bool, optional (default=True)
descending sort beam based on score / length of decoded.
Returns
-------
result: List[str]
"""
[13]:
%%time
small_model.beam_decoder_lm([khalil], language_model, beam_width = 3)
CPU times: user 12min 3s, sys: 339 ms, total: 12min 3s
Wall time: 1min 2s
[13]:
['tolong sebut anti kata']
RNNT model beam decoder language model not able to utilise batch processing, if feed a batch, it will process one by one.