Speech-to-Text CTC HuggingFace + CTC Decoders
Contents
Speech-to-Text CTC HuggingFace + CTC Decoders#
Finetuned hyperlocal languages on pretrained HuggingFace models + CTC Decoders with KenLM, https://huggingface.co/mesolitica
This tutorial is available as an IPython notebook at malaya-speech/example/stt-ctc-huggingface-ctc-decoders.
This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.
This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at malaya-speech/example/pipeline.
[1]:
import malaya_speech
import numpy as np
from malaya_speech import Pipeline
`pyaudio` is not available, `malaya_speech.streaming.stream` is not able to use.
[2]:
import logging
logging.basicConfig(level=logging.INFO)
Install ctc-decoders#
From PYPI#
pip3 install ctc-decoders
But if you use linux, we unable to upload linux wheels to pypi repository, so download linux wheel at malaya-speech/ctc-decoders.
From source#
Check malaya-speech/ctc-decoders how to build from source incase there is no available wheel for your operating system.
Building from source should only take a few minutes.
Benefit#
ctc-decoders faster than pyctcdecode, ~26x faster based on husein benchmark, but very slightly less accurate than pyctcdecode.
List available HuggingFace model#
[3]:
malaya_speech.stt.ctc.available_huggingface()
INFO:malaya_speech.stt:for `malay-fleur102` language, tested on FLEURS102 `ms_my` test set, https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/prepare-stt
INFO:malaya_speech.stt:for `malay-malaya` language, tested on malaya-speech test set, https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/prepare-stt
INFO:malaya_speech.stt:for `singlish` language, tested on IMDA malaya-speech test set, https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/prepare-stt
[3]:
Size (MB) | malay-malaya | malay-fleur102 | singlish | Language | |
---|---|---|---|---|---|
mesolitica/wav2vec2-xls-r-300m-mixed | 1180 | {'WER': 0.194655128, 'CER': 0.04775798, 'WER-L... | {'WER': 0.2373861259, 'CER': 0.07055478, 'WER-... | {'WER': 0.127588595, 'CER': 0.0494924979, 'WER... | [malay, singlish] |
mesolitica/wav2vec2-xls-r-300m-mixed-v2 | 1180 | {'WER': 0.154782923, 'CER': 0.035164031, 'WER-... | {'WER': 0.2013994374, 'CER': 0.0518170369, 'WE... | {'WER': 0.2258822139, 'CER': 0.082982312, 'WER... | [malay, singlish] |
mesolitica/wav2vec2-xls-r-300m-12layers-ms | 657 | {'WER': 0.1494983789, 'CER': 0.0342059992, 'WE... | {'WER': 0.217107489, 'CER': 0.0546614199, 'WER... | NaN | [malay] |
mesolitica/wav2vec2-xls-r-300m-6layers-ms | 339 | {'WER': 0.1494983789, 'CER': 0.0342059992, 'WE... | {'WER': 0.217107489, 'CER': 0.0546614199, 'WER... | NaN | [malay] |
mesolitica/wav2vec2-xls-r-300m-3layers-ms | 195 | {'WER': 0.1494983789, 'CER': 0.0342059992, 'WE... | {'WER': 0.217107489, 'CER': 0.0546614199, 'WER... | NaN | [malay] |
Load HuggingFace model#
def huggingface(
model: str = 'mesolitica/wav2vec2-xls-r-300m-mixed',
force_check: bool = True,
**kwargs,
):
"""
Load Finetuned models from HuggingFace.
Parameters
----------
model : str, optional (default='mesolitica/wav2vec2-xls-r-300m-mixed')
Check available models at `malaya_speech.stt.ctc.available_huggingface()`.
force_check: bool, optional (default=True)
Force check model one of malaya model.
Set to False if you have your own huggingface model.
Returns
-------
result : malaya_speech.torch_model.huggingface.CTC class
"""
[3]:
model = malaya_speech.stt.ctc.huggingface(model = 'mesolitica/wav2vec2-xls-r-300m-mixed')
Load sample#
[4]:
ceramah, sr = malaya_speech.load('speech/khutbah/wadi-annuar.wav')
record1, sr = malaya_speech.load('speech/record/savewav_2020-11-26_22-36-06_294832.wav')
record2, sr = malaya_speech.load('speech/record/savewav_2020-11-26_22-40-56_929661.wav')
singlish0, sr = malaya_speech.load('speech/singlish/singlish0.wav')
singlish1, sr = malaya_speech.load('speech/singlish/singlish1.wav')
singlish2, sr = malaya_speech.load('speech/singlish/singlish2.wav')
mandarin0, sr = malaya_speech.load('speech/mandarin/597.wav')
mandarin1, sr = malaya_speech.load('speech/mandarin/584.wav')
mandarin2, sr = malaya_speech.load('speech/mandarin/509.wav')
Predict logits#
def predict_logits(self, inputs, norm_func=softmax):
"""
Predict logits from inputs.
Parameters
----------
input: List[np.array]
List[np.array] or List[malaya_speech.model.frame.Frame].
norm_func: Callable, optional (default=malaya.utils.activation.softmax)
Returns
-------
result: List[np.array]
"""
[5]:
%%time
logits = model.predict_logits([ceramah, record1, record2])
CPU times: user 36.4 s, sys: 3.22 s, total: 39.6 s
Wall time: 3.67 s
[6]:
logits.shape
[6]:
(3, 499, 40)
Load ctc-decoders#
I will use dump-combined
for this example.
[7]:
from ctc_decoders import Scorer
from ctc_decoders import ctc_beam_search_decoder
from malaya_speech.utils.char import HF_CTC_VOCAB
[8]:
lm = malaya_speech.language_model.kenlm(model = 'dump-combined')
[9]:
scorer = Scorer(0.5, 1.0, lm, HF_CTC_VOCAB)
[10]:
o = ctc_beam_search_decoder(logits[0], HF_CTC_VOCAB, 20, ext_scoring_func = scorer)[0][1]
o
[10]:
'jadi dalam perjalanan ini dunia yang susah ini ketika nabi mengajar muaz bin jabal tadi ni allah ma ini'
[11]:
o = ctc_beam_search_decoder(logits[1], HF_CTC_VOCAB, 20, ext_scoring_func = scorer)[0][1]
o
[11]:
'hello nama saya husin saya tak skema ke tiap saya masam'
[12]:
o = ctc_beam_search_decoder(logits[2], HF_CTC_VOCAB, 20, ext_scoring_func = scorer)[0][1]
o
[12]:
'hello nama saya hussein saya sekoman saya mandi dia tiap hari'