Speech-to-Text RNNT Singlish
Contents
Speech-to-Text RNNT Singlish#
Encoder model + RNNT loss for Singlish language, trained on Singapore National Speech Corpus, https://www.imda.gov.sg/programme-listing/digital-services-lab/national-speech-corpus
This tutorial is available as an IPython notebook at malaya-speech/example/stt-transducer-model-singlish.
This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.
This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at malaya-speech/example/pipeline.
[1]:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = ''
[2]:
import malaya_speech
import numpy as np
from malaya_speech import Pipeline
`pyaudio` is not available, `malaya_speech.streaming.stream` is not able to use.
[6]:
import logging
logging.basicConfig(level=logging.INFO)
List available RNNT model#
[7]:
malaya_speech.stt.transducer.available_transformer()
INFO:malaya_speech.stt:for `malay-fleur102` language, tested on FLEURS102 `ms_my` test set, https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/prepare-stt
INFO:malaya_speech.stt:for `malay-malaya` language, tested on malaya-speech test set, https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/prepare-stt
INFO:malaya_speech.stt:for `singlish` language, tested on IMDA malaya-speech test set, https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/prepare-stt
[7]:
Size (MB) | Quantized Size (MB) | malay-malaya | malay-fleur102 | Language | singlish | |
---|---|---|---|---|---|---|
tiny-conformer | 24.4 | 9.14 | {'WER': 0.2128108, 'CER': 0.08136871, 'WER-LM'... | {'WER': 0.2682816, 'CER': 0.13052725, 'WER-LM'... | [malay] | NaN |
small-conformer | 49.2 | 18.1 | {'WER': 0.19853302, 'CER': 0.07449528, 'WER-LM... | {'WER': 0.23412149, 'CER': 0.1138314813, 'WER-... | [malay] | NaN |
conformer | 125 | 37.1 | {'WER': 0.16340855635999124, 'CER': 0.05897205... | {'WER': 0.20090442596, 'CER': 0.09616901, 'WER... | [malay] | NaN |
large-conformer | 404 | 107 | {'WER': 0.1566839, 'CER': 0.0619715, 'WER-LM':... | {'WER': 0.1711028238, 'CER': 0.077953559, 'WER... | [malay] | NaN |
conformer-stack-2mixed | 130 | 38.5 | {'WER': 0.1889883954, 'CER': 0.0726845531, 'WE... | {'WER': 0.244836948, 'CER': 0.117409327, 'WER-... | [malay, singlish] | {'WER': 0.08535878149, 'CER': 0.0452357273822,... |
small-conformer-singlish | 49.2 | 18.1 | NaN | NaN | [singlish] | {'WER': 0.087831, 'CER': 0.0456859, 'WER-LM': ... |
conformer-singlish | 125 | 37.1 | NaN | NaN | [singlish] | {'WER': 0.07779246, 'CER': 0.0403616, 'WER-LM'... |
large-conformer-singlish | 404 | 107 | NaN | NaN | [singlish] | {'WER': 0.07014733, 'CER': 0.03587201, 'WER-LM... |
[4]:
malaya_speech.stt.google_accuracy
[4]:
{'malay-malaya': {'WER': 0.16477548774, 'CER': 0.05973209121},
'malay-fleur102': {'WER': 0.109588779, 'CER': 0.047891527},
'singlish': {'WER': 0.4941349, 'CER': 0.3026296}}
Load RNNT model#
def transformer(
model: str = 'conformer',
quantized: bool = False,
**kwargs,
):
"""
Load Encoder-Transducer ASR model.
Parameters
----------
model : str, optional (default='conformer')
Check available models at `malaya_speech.stt.transducer.available_transformer()`.
quantized : bool, optional (default=False)
if True, will load 8-bit quantized model.
Quantized model not necessary faster, totally depends on the machine.
Returns
-------
result : malaya_speech.model.transducer.Transducer class
"""
[5]:
model = malaya_speech.stt.transducer.transformer(model = 'conformer-singlish')
2023-02-01 11:50:21.762787: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-01 11:50:21.771009: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2023-02-01 11:50:21.771036: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: husein-MS-7D31
2023-02-01 11:50:21.771042: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: husein-MS-7D31
2023-02-01 11:50:21.771146: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program
2023-02-01 11:50:21.771178: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 470.161.3
Load Quantized deep model#
To load 8-bit quantized model, simply pass quantized = True
, default is False
.
We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine.
[3]:
quantized_model = malaya_speech.stt.transducer.transformer(model = 'conformer-singlish', quantized = True)
WARNING:root:Load quantized model will cause accuracy drop.
Load sample#
[6]:
singlish0, sr = malaya_speech.load('speech/singlish/singlish0.wav')
singlish1, sr = malaya_speech.load('speech/singlish/singlish1.wav')
singlish2, sr = malaya_speech.load('speech/singlish/singlish2.wav')
imda0, sr = malaya_speech.load('speech/imda/221931702.WAV')
imda1, sr = malaya_speech.load('speech/imda/221931727.WAV')
imda2, sr = malaya_speech.load('speech/imda/221931818.WAV')
[7]:
import IPython.display as ipd
ipd.Audio(singlish0, rate = sr)
[7]:
[12]:
ipd.Audio(singlish1, rate = sr)
[12]:
[13]:
ipd.Audio(singlish2, rate = sr)
[13]:
[8]:
ipd.Audio(imda0, rate = sr)
[8]:
[10]:
ipd.Audio(imda1, rate = sr)
[10]:
[14]:
ipd.Audio(imda2, rate = sr)
[14]:
Predict#
We can choose,
greedy
decoder.beam
decoder, by defaultbeam_size
is 5, feel free to edit it.
def predict(
self, inputs, decoder: str = 'greedy', beam_size: int = 5, **kwargs
):
"""
Transcribe inputs, will return list of strings.
Parameters
----------
inputs: List[np.array]
List[np.array] or List[malaya_speech.model.frame.Frame].
decoder: str, optional (default='greedy')
decoder mode, allowed values:
* ``'greedy'`` - will call self.greedy_decoder
* ``'beam'`` - will call self.beam_decoder
beam_size: int, optional (default=5)
beam size for beam decoder.
Returns
-------
result: List[str]
"""
Greedy decoder#
Greedy able to utilize batch processing, and faster than beam decoder.
def greedy_decoder(self, inputs):
"""
Transcribe inputs, will return list of strings.
Parameters
----------
inputs: List[np.array]
List[np.array] or List[malaya_speech.model.frame.Frame].
Returns
-------
result: List[str]
"""
[11]:
%%time
model.greedy_decoder([singlish0, singlish1, singlish2, imda0, imda1, imda2])
CPU times: user 10.3 s, sys: 2.65 s, total: 12.9 s
Wall time: 9.4 s
[11]:
['and then see how they roll it in film okay actually',
'then you tap to your eyes',
'sembawang seven in mal',
'wantan mee is a traditional local cuisine',
'saravanan gopinathan george yeo yong boon and tay kheng soon',
'ahmad khan adelene wee chin suan and robert ibbetson']
[16]:
%%time
quantized_model.greedy_decoder([singlish0, singlish1, singlish2, imda0, imda1, imda2])
CPU times: user 9.92 s, sys: 2.41 s, total: 12.3 s
Wall time: 8.76 s
[16]:
['and then see how they roll it in film okay actually',
'then you tap to your eyes',
'sembawang seven in mal',
'wantan mee is a traditional local cuisine',
'saravanan gopinathan george yeo yong boon and tay kheng soon',
'ahmad khan adelene wee chin suan and robert ibbetson']
Beam decoder#
To get better results, use beam decoder with optimum beam size.
def beam_decoder(self, inputs, beam_size: int = 5):
"""
Transcribe inputs, will return list of strings.
Parameters
----------
inputs: List[np.array]
List[np.array] or List[malaya_speech.model.frame.Frame].
beam_size: int, optional (default=5)
beam size for beam decoder.
Returns
-------
result: List[str]
"""
[17]:
%%time
model.beam_decoder([singlish0, singlish1, singlish2, imda0, imda1, imda2], beam_size = 3)
CPU times: user 27.1 s, sys: 2.52 s, total: 29.6 s
Wall time: 21 s
[17]:
['and then see how they roll it in film okay actually',
'okay then you tap to your eyes',
'sembawang seven in male',
'wantan mee is a traditional local cuisine',
'saravanan gopinathan george yeo yong boon and tay kheng soon',
'ahmad khan adelene wee chin suan and robert ibbetson']
RNNT model beam decoder not able to utilise batch programming, if feed a batch, it will process one by one.
Predict alignment#
We want to know when the speakers speak certain words, so we can use predict_timestamp
,
def predict_alignment(self, input, combined = True):
"""
Transcribe input and get timestamp, only support greedy decoder.
Parameters
----------
input: np.array
np.array or malaya_speech.model.frame.Frame.
combined: bool, optional (default=True)
If True, will combined subwords to become a word.
Returns
-------
result: List[Dict[text, start, end]]
"""
[18]:
%%time
model.predict_alignment(singlish0)
CPU times: user 6.08 s, sys: 2.01 s, total: 8.09 s
Wall time: 7.22 s
[18]:
[{'text': 'and', 'start': 0.2, 'end': 0.21},
{'text': 'then', 'start': 0.36, 'end': 0.45},
{'text': 'see', 'start': 0.6, 'end': 0.61},
{'text': 'how', 'start': 0.88, 'end': 0.89},
{'text': 'they', 'start': 1.36, 'end': 1.49},
{'text': 'roll', 'start': 1.96, 'end': 2.09},
{'text': 'it', 'start': 2.16, 'end': 2.17},
{'text': 'in', 'start': 2.4, 'end': 2.41},
{'text': 'film', 'start': 2.6, 'end': 2.85},
{'text': 'okay', 'start': 3.68, 'end': 3.85},
{'text': 'actually', 'start': 3.92, 'end': 4.21}]
[19]:
%%time
model.predict_alignment(singlish0, combined = False)
CPU times: user 1.12 s, sys: 200 ms, total: 1.32 s
Wall time: 306 ms
[19]:
[{'text': 'and', 'start': 0.2, 'end': 0.21},
{'text': ' ', 'start': 0.28, 'end': 0.29},
{'text': 'the', 'start': 0.36, 'end': 0.37},
{'text': 'n_', 'start': 0.44, 'end': 0.45},
{'text': 'see', 'start': 0.6, 'end': 0.61},
{'text': ' ', 'start': 0.76, 'end': 0.77},
{'text': 'how', 'start': 0.88, 'end': 0.89},
{'text': ' ', 'start': 1.08, 'end': 1.09},
{'text': 'the', 'start': 1.36, 'end': 1.37},
{'text': 'y_', 'start': 1.48, 'end': 1.49},
{'text': 'ro', 'start': 1.96, 'end': 1.97},
{'text': 'll_', 'start': 2.08, 'end': 2.09},
{'text': 'it_', 'start': 2.2, 'end': 2.21},
{'text': 'in_', 'start': 2.4, 'end': 2.41},
{'text': 'fil', 'start': 2.6, 'end': 2.61},
{'text': 'm_', 'start': 2.84, 'end': 2.85},
{'text': 'oka', 'start': 3.68, 'end': 3.69},
{'text': 'y_', 'start': 3.84, 'end': 3.85},
{'text': 'act', 'start': 3.92, 'end': 3.93},
{'text': 'ual', 'start': 4.0, 'end': 4.01},
{'text': 'ly', 'start': 4.2, 'end': 4.21}]
[20]:
%%time
model.predict_alignment(imda0)
CPU times: user 1.05 s, sys: 192 ms, total: 1.25 s
Wall time: 278 ms
[20]:
[{'text': 'wantan', 'start': 0.92, 'end': 1.05},
{'text': 'mee', 'start': 1.4, 'end': 1.53},
{'text': 'is', 'start': 1.64, 'end': 1.65},
{'text': 'a', 'start': 1.84, 'end': 1.85},
{'text': 'traditional', 'start': 2.08, 'end': 2.69},
{'text': 'local', 'start': 2.8, 'end': 2.93},
{'text': 'cuisine', 'start': 3.12, 'end': 3.45}]