Speech-to-Text HuggingFace + pyctcdecode
Contents
Speech-to-Text HuggingFace + pyctcdecode#
Finetuned hyperlocal languages on pretrained HuggingFace models + pyctcdecode with KenLM, https://huggingface.co/mesolitica
This tutorial is available as an IPython notebook at malaya-speech/example/stt-huggingface-pyctcdecode.
This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.
This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at malaya-speech/example/pipeline.
Required Tensorflow >= 2.0 due to group convolution is not available for Tensorflow 1.
[1]:
import malaya_speech
import numpy as np
from malaya_speech import Pipeline
Install pyctcdecode#
From PYPI#
pip3 install pyctcdecode==0.1.0 pypi-kenlm==0.1.20210121
From source#
Check https://github.com/kensho-technologies/pyctcdecode how to build from source incase there is no available wheel for your operating system.
Building from source should only take a few minutes.
Benefit#
pyctcdecode accurate than ctc-decoders for certain cases, but slower than pyctcdecode.
pip install and done, no need to compile.
List available HuggingFace model#
[2]:
malaya_speech.stt.available_huggingface()
[2]:
CER | CER-LM | Language | Size (MB) | WER | WER-LM | |
---|---|---|---|---|---|---|
mesolitica/wav2vec2-xls-r-300m-mixed | 0.048105 | 0.041196 | [malay, singlish, mandarin] | 1180 | 0.13222 | 0.098802 |
Load HuggingFace model#
def huggingface(model: str = 'mesolitica/wav2vec2-xls-r-300m-mixed', **kwargs):
"""
Load Finetuned models from HuggingFace. Required Tensorflow >= 2.0.
Parameters
----------
model : str, optional (default='mesolitica/wav2vec2-xls-r-300m-mixed')
Model architecture supported. Allowed values:
* ``'mesolitica/wav2vec2-xls-r-300m-mixed'`` - wav2vec2 XLS-R 300M finetuned on (Malay + Singlish + Mandarin) languages.
Returns
-------
result : malaya_speech.model.huggingface.CTC class
"""
[3]:
model = malaya_speech.stt.huggingface(model = 'mesolitica/wav2vec2-xls-r-300m-mixed')
Load sample#
[4]:
ceramah, sr = malaya_speech.load('speech/khutbah/wadi-annuar.wav')
record1, sr = malaya_speech.load('speech/record/savewav_2020-11-26_22-36-06_294832.wav')
record2, sr = malaya_speech.load('speech/record/savewav_2020-11-26_22-40-56_929661.wav')
singlish0, sr = malaya_speech.load('speech/singlish/singlish0.wav')
singlish1, sr = malaya_speech.load('speech/singlish/singlish1.wav')
singlish2, sr = malaya_speech.load('speech/singlish/singlish2.wav')
mandarin0, sr = malaya_speech.load('speech/mandarin/597.wav')
mandarin1, sr = malaya_speech.load('speech/mandarin/584.wav')
mandarin2, sr = malaya_speech.load('speech/mandarin/509.wav')
Predict logits#
def predict_logits(self, inputs, norm_func=softmax):
"""
Predict logits from inputs.
Parameters
----------
input: List[np.array]
List[np.array] or List[malaya_speech.model.frame.Frame].
norm_func: Callable, optional (default=malaya.utils.activation.softmax)
Returns
-------
result: List[np.array]
"""
[5]:
%%time
logits = model.predict_logits([ceramah, record1, record2])
CPU times: user 35.2 s, sys: 16.7 s, total: 51.9 s
Wall time: 9.79 s
[6]:
logits.shape
[6]:
(3, 499, 40)
Load ctc-decoders#
I will use dump-combined
for this example.
[7]:
from pyctcdecode import build_ctcdecoder
from malaya_speech.utils.char import HF_CTC_VOCAB
import kenlm
[8]:
lm = malaya_speech.language_model.kenlm(model = 'dump-combined')
[42]:
kenlm_model = kenlm.Model(lm)
decoder = build_ctcdecoder(
HF_CTC_VOCAB + ['_'],
kenlm_model,
alpha=0.2,
beta=1.0,
ctc_token_idx=len(HF_CTC_VOCAB)
)
[34]:
len(HF_CTC_VOCAB)
[34]:
39
[43]:
out = decoder.decode_beams(logits[0], prune_history=True)
d_lm, lm_state, timesteps, logit_score, lm_score = out[0]
d_lm
[43]:
'jadi dalam perjalanan ini dunia yang susah ini ketika nabi mengajar muaz bin jabal tadi ni allah ma ini'