Speech-to-Text RNNT + KenLM
Contents
Speech-to-Text RNNT + KenLM#
Encoder model + RNNT loss + KenLM
This tutorial is available as an IPython notebook at malaya-speech/example/stt-transducer-model-lm.
This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.
[2]:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = ''
[3]:
import malaya_speech
import numpy as np
from malaya_speech import Pipeline
[4]:
import logging
logging.basicConfig(level=logging.INFO)
List available RNNT model#
[6]:
malaya_speech.stt.transducer.available_transformer()
INFO:malaya_speech.stt:for `malay-fleur102` language, tested on FLEURS102 `ms_my` test set, https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/prepare-stt
INFO:malaya_speech.stt:for `malay-malaya` language, tested on malaya-speech test set, https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/prepare-stt
INFO:malaya_speech.stt:for `singlish` language, tested on IMDA malaya-speech test set, https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/prepare-stt
[6]:
Size (MB) | Quantized Size (MB) | malay-malaya | malay-fleur102 | Language | singlish | |
---|---|---|---|---|---|---|
tiny-conformer | 24.4 | 9.14 | {'WER': 0.2128108, 'CER': 0.08136871, 'WER-LM'... | {'WER': 0.2682816, 'CER': 0.13052725, 'WER-LM'... | [malay] | NaN |
small-conformer | 49.2 | 18.1 | {'WER': 0.19853302, 'CER': 0.07449528, 'WER-LM... | {'WER': 0.23412149, 'CER': 0.1138314813, 'WER-... | [malay] | NaN |
conformer | 125 | 37.1 | {'WER': 0.16340855635999124, 'CER': 0.05897205... | {'WER': 0.20090442596, 'CER': 0.09616901, 'WER... | [malay] | NaN |
large-conformer | 404 | 107 | {'WER': 0.1566839, 'CER': 0.0619715, 'WER-LM':... | {'WER': 0.1711028238, 'CER': 0.077953559, 'WER... | [malay] | NaN |
conformer-stack-2mixed | 130 | 38.5 | {'WER': 0.1889883954, 'CER': 0.0726845531, 'WE... | {'WER': 0.244836948, 'CER': 0.117409327, 'WER-... | [malay, singlish] | {'WER': 0.08535878149, 'CER': 0.0452357273822,... |
small-conformer-singlish | 49.2 | 18.1 | NaN | NaN | [singlish] | {'WER': 0.087831, 'CER': 0.0456859, 'WER-LM': ... |
conformer-singlish | 125 | 37.1 | NaN | NaN | [singlish] | {'WER': 0.07779246, 'CER': 0.0403616, 'WER-LM'... |
large-conformer-singlish | 404 | 107 | NaN | NaN | [singlish] | {'WER': 0.07014733, 'CER': 0.03587201, 'WER-LM... |
Load RNNT model#
def transformer(
model: str = 'conformer',
quantized: bool = False,
**kwargs,
):
"""
Load Encoder-Transducer ASR model.
Parameters
----------
model : str, optional (default='conformer')
Check available models at `malaya_speech.stt.transducer.available_transformer()`.
quantized : bool, optional (default=False)
if True, will load 8-bit quantized model.
Quantized model not necessary faster, totally depends on the machine.
Returns
-------
result : malaya_speech.model.transducer.Transducer class
"""
[3]:
small_model = malaya_speech.stt.transducer.transformer(model = 'small-conformer')
model = malaya_speech.stt.transducer.transformer(model = 'conformer')
Load Quantized deep model#
To load 8-bit quantized model, simply pass quantized = True
, default is False
.
We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine.
[4]:
quantized_small_model = malaya_speech.stt.transducer.transformer(model = 'small-conformer', quantized = True)
quantized_model = malaya_speech.stt.transducer.transformer(model = 'conformer', quantized = True)
WARNING:root:Load quantized model will cause accuracy drop.
WARNING:root:Load quantized model will cause accuracy drop.
Load sample#
[5]:
ceramah, sr = malaya_speech.load('speech/khutbah/wadi-annuar.wav')
record1, sr = malaya_speech.load('speech/record/savewav_2020-11-26_22-36-06_294832.wav')
record2, sr = malaya_speech.load('speech/record/savewav_2020-11-26_22-40-56_929661.wav')
shafiqah_idayu, sr = malaya_speech.load('speech/example-speaker/shafiqah-idayu.wav')
mas_aisyah, sr = malaya_speech.load('speech/example-speaker/mas-aisyah.wav')
khalil, sr = malaya_speech.load('speech/example-speaker/khalil-nooh.wav')
[6]:
import IPython.display as ipd
ipd.Audio(ceramah, rate = sr)
[6]:
As we can hear, the speaker speaks in kedahan dialects plus some arabic words, let see how good our model is.
[7]:
ipd.Audio(record1, rate = sr)
[7]:
[8]:
ipd.Audio(record2, rate = sr)
[8]:
[9]:
ipd.Audio(shafiqah_idayu, rate = sr)
[9]:
[10]:
ipd.Audio(mas_aisyah, rate = sr)
[10]:
[11]:
ipd.Audio(khalil, rate = sr)
[11]:
Install pyctcdecode#
From PYPI#
pip3 install pyctcdecode==0.1.0 pypi-kenlm==0.1.20210121
From source#
Check https://github.com/kensho-technologies/pyctcdecode how to build from source incase there is no available wheel for your operating system.
Building from source should only take a few minutes.
Load pyctcdecode#
I will use dump-combined
for this example.
[12]:
import kenlm
from pyctcdecode.language_model import LanguageModel
[13]:
lm = malaya_speech.language_model.kenlm(model = 'dump-combined')
[14]:
kenlm_model = kenlm.Model(lm)
language_model = LanguageModel(kenlm_model, alpha = 0.01, beta = 0.5)
Predict using beam decoder language model#
def beam_decoder_lm(self, inputs, language_model,
beam_width: int = 5,
token_min_logp: float = -20.0,
beam_prune_logp: float = -5.0,
temperature: float = 0.0,
score_norm: bool = True):
"""
Transcribe inputs using beam decoder + KenLM.
Parameters
----------
inputs: List[np.array]
List[np.array] or List[malaya_speech.model.frame.Frame].
language_model: pyctcdecode.language_model.LanguageModel
pyctcdecode language model, load from `LanguageModel(kenlm_model, alpha = alpha, beta = beta)`.
beam_width: int, optional (default=5)
beam size for beam decoder.
token_min_logp: float, optional (default=-20.0)
minimum log probability to select a token.
beam_prune_logp: float, optional (default=-5.0)
filter candidates >= max score lm + `beam_prune_logp`.
temperature: float, optional (default=0.0)
apply temperature function for logits, can help for certain case,
logits += -np.log(-np.log(uniform_noise_shape_logits)) * temperature
score_norm: bool, optional (default=True)
descending sort beam based on score / length of decoded.
Returns
-------
result: List[str]
"""
[15]:
%%time
small_model.beam_decoder_lm([ceramah, record1, record2, shafiqah_idayu, mas_aisyah, khalil],
language_model)
CPU times: user 25.6 s, sys: 2.13 s, total: 27.7 s
Wall time: 21.7 s
[15]:
['jadi dalam perjalanan ini dunia yang susah ini ketika nabi mengajar muaz bin jabal tadi ni allah maha ini',
'helo nama saya pusing saya tak suka mandi ketat saya masak',
'helo nama saya husin saya suka mandi saya mandi tetek hari',
'nama saya syafiqah hidayah',
'sebut perkataan uncle',
'tolong sebut anti kata']
[16]:
%%time
model.beam_decoder_lm([ceramah, record1, record2, shafiqah_idayu, mas_aisyah, khalil],
language_model)
CPU times: user 33.5 s, sys: 3.34 s, total: 36.8 s
Wall time: 24.7 s
[16]:
['jadi dalam perjalanan ini dunia yang susah ini ketika nabi mengajar muaz bin jabal tadi ni alah maaf ini',
'helo nama saya pusing saya tak suka mandi ke tak saya masam',
'helo nama saya husin saya suka mandi saya mandi tiap tiap hari',
'nama saya syafiqah idayu',
'sebut perkataan angka',
'tolong sebut antika']
RNNT model beam decoder language model not able to utilise batch processing, if feed a batch, it will process one by one.