Speech-to-Text RNNT + GPT2
Contents
Speech-to-Text RNNT + GPT2#
Encoder model + RNNT loss + GPT2
This tutorial is available as an IPython notebook at malaya-speech/example/stt-transducer-model-lm-gpt2.
This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.
[1]:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = ''
[2]:
import malaya_speech
import numpy as np
from malaya_speech import Pipeline
`pyaudio` is not available, `malaya_speech.streaming.stream` is not able to use.
[3]:
import logging
logging.basicConfig(level=logging.INFO)
List available RNNT model#
[4]:
malaya_speech.stt.transducer.available_transformer()
INFO:malaya_speech.stt:for `malay-fleur102` language, tested on FLEURS102 `ms_my` test set, https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/prepare-stt
INFO:malaya_speech.stt:for `malay-malaya` language, tested on malaya-speech test set, https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/prepare-stt
INFO:malaya_speech.stt:for `singlish` language, tested on IMDA malaya-speech test set, https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/prepare-stt
[4]:
Size (MB) | Quantized Size (MB) | malay-malaya | malay-fleur102 | Language | singlish | |
---|---|---|---|---|---|---|
tiny-conformer | 24.4 | 9.14 | {'WER': 0.2128108, 'CER': 0.08136871, 'WER-LM'... | {'WER': 0.2682816, 'CER': 0.13052725, 'WER-LM'... | [malay] | NaN |
small-conformer | 49.2 | 18.1 | {'WER': 0.19853302, 'CER': 0.07449528, 'WER-LM... | {'WER': 0.23412149, 'CER': 0.1138314813, 'WER-... | [malay] | NaN |
conformer | 125 | 37.1 | {'WER': 0.16340855635999124, 'CER': 0.05897205... | {'WER': 0.20090442596, 'CER': 0.09616901, 'WER... | [malay] | NaN |
large-conformer | 404 | 107 | {'WER': 0.1566839, 'CER': 0.0619715, 'WER-LM':... | {'WER': 0.1711028238, 'CER': 0.077953559, 'WER... | [malay] | NaN |
conformer-stack-2mixed | 130 | 38.5 | {'WER': 0.1889883954, 'CER': 0.0726845531, 'WE... | {'WER': 0.244836948, 'CER': 0.117409327, 'WER-... | [malay, singlish] | {'WER': 0.08535878149, 'CER': 0.0452357273822,... |
small-conformer-singlish | 49.2 | 18.1 | NaN | NaN | [singlish] | {'WER': 0.087831, 'CER': 0.0456859, 'WER-LM': ... |
conformer-singlish | 125 | 37.1 | NaN | NaN | [singlish] | {'WER': 0.07779246, 'CER': 0.0403616, 'WER-LM'... |
large-conformer-singlish | 404 | 107 | NaN | NaN | [singlish] | {'WER': 0.07014733, 'CER': 0.03587201, 'WER-LM... |
Load RNNT model#
def transformer(
model: str = 'conformer',
quantized: bool = False,
**kwargs,
):
"""
Load Encoder-Transducer ASR model.
Parameters
----------
model : str, optional (default='conformer')
Check available models at `malaya_speech.stt.transducer.available_transformer()`.
quantized : bool, optional (default=False)
if True, will load 8-bit quantized model.
Quantized model not necessary faster, totally depends on the machine.
Returns
-------
result : malaya_speech.model.transducer.Transducer class
"""
[4]:
small_model = malaya_speech.stt.transducer.transformer(model = 'small-conformer')
2022-09-15 13:15:52.398400: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-09-15 13:15:52.402512: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2022-09-15 13:15:52.402530: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: husein-MS-7D31
2022-09-15 13:15:52.402533: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: husein-MS-7D31
2022-09-15 13:15:52.402613: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program
2022-09-15 13:15:52.402633: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 470.141.3
Load sample#
[5]:
ceramah, sr = malaya_speech.load('speech/khutbah/wadi-annuar.wav')
record1, sr = malaya_speech.load('speech/record/savewav_2020-11-26_22-36-06_294832.wav')
record2, sr = malaya_speech.load('speech/record/savewav_2020-11-26_22-40-56_929661.wav')
shafiqah_idayu, sr = malaya_speech.load('speech/example-speaker/shafiqah-idayu.wav')
mas_aisyah, sr = malaya_speech.load('speech/example-speaker/mas-aisyah.wav')
khalil, sr = malaya_speech.load('speech/example-speaker/khalil-nooh.wav')
[6]:
import IPython.display as ipd
ipd.Audio(ceramah, rate = sr)
[6]:
As we can hear, the speaker speaks in kedahan dialects plus some arabic words, let see how good our model is.
[7]:
ipd.Audio(record1, rate = sr)
[7]:
[8]:
ipd.Audio(record2, rate = sr)
[8]:
[9]:
ipd.Audio(shafiqah_idayu, rate = sr)
[9]:
[10]:
ipd.Audio(mas_aisyah, rate = sr)
[10]:
[11]:
ipd.Audio(khalil, rate = sr)
[11]:
Load GPT2#
To get better performance, you need to have a really good GPT2 model, we are trying very best to release a really good GPT2 model.
[12]:
language_model = malaya_speech.language_model.gpt2(alpha = 0.01, beta = 0.2)
language_model
[12]:
<malaya_speech.torch_model.gpt2_lm.LM at 0x7f78a84d1610>
Predict using beam decoder language model#
def beam_decoder_lm(self, inputs, language_model,
beam_width: int = 5,
token_min_logp: float = -20.0,
beam_prune_logp: float = -5.0,
temperature: float = 0.0,
score_norm: bool = True):
"""
Transcribe inputs using beam decoder + KenLM.
Parameters
----------
inputs: List[np.array]
List[np.array] or List[malaya_speech.model.frame.Frame].
language_model: pyctcdecode.language_model.LanguageModel
pyctcdecode language model, load from `LanguageModel(kenlm_model, alpha = alpha, beta = beta)`.
beam_width: int, optional (default=5)
beam size for beam decoder.
token_min_logp: float, optional (default=-20.0)
minimum log probability to select a token.
beam_prune_logp: float, optional (default=-5.0)
filter candidates >= max score lm + `beam_prune_logp`.
temperature: float, optional (default=0.0)
apply temperature function for logits, can help for certain case,
logits += -np.log(-np.log(uniform_noise_shape_logits)) * temperature
score_norm: bool, optional (default=True)
descending sort beam based on score / length of decoded.
Returns
-------
result: List[str]
"""
[13]:
%%time
small_model.beam_decoder_lm([khalil], language_model, beam_width = 3)
/home/husein/dev/malaya-speech/malaya_speech/torch_model/gpt2_lm.py:42: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
context = to_tensor_cuda(torch.tensor(tokenized)[0], cuda)
/home/husein/dev/malaya-speech/malaya_speech/torch_model/gpt2_lm.py:56: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
context = to_tensor_cuda(torch.tensor(tokenized), cuda)
CPU times: user 11min 13s, sys: 11.3 s, total: 11min 25s
Wall time: 1min 1s
[13]:
['tolong sebut anti kata']
RNNT model beam decoder language model not able to utilise batch processing, if feed a batch, it will process one by one.