Speech-to-Text CTC HuggingFace + pyctcdecode#

Finetuned hyperlocal languages on pretrained CTC HuggingFace models + pyctcdecode with KenLM, https://huggingface.co/mesolitica

This tutorial is available as an IPython notebook at malaya-speech/example/stt-ctc-huggingface-pyctcdecode.

This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.

This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at malaya-speech/example/pipeline.

[1]:

import malaya_speech
import numpy as np
from malaya_speech import Pipeline

`pyaudio` is not available, `malaya_speech.streaming.stream` is not able to use.

[2]:

import logging

logging.basicConfig(level=logging.INFO)

Install pyctcdecode#

From PYPI#

pip3 install pyctcdecode==0.1.0 pypi-kenlm==0.1.20210121

From source#

Check https://github.com/kensho-technologies/pyctcdecode how to build from source incase there is no available wheel for your operating system.

Building from source should only take a few minutes.

Benefit#

pyctcdecode accurate than ctc-decoders for certain cases, but slower than pyctcdecode.
pip install and done, no need to compile.

List available HuggingFace model#

[3]:

malaya_speech.stt.ctc.available_huggingface()

INFO:malaya_speech.stt:for `malay-fleur102` language, tested on FLEURS102 `ms_my` test set, https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/prepare-stt
INFO:malaya_speech.stt:for `malay-malaya` language, tested on malaya-speech test set, https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/prepare-stt
INFO:malaya_speech.stt:for `singlish` language, tested on IMDA malaya-speech test set, https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/prepare-stt

[3]:

	Size (MB)	malay-malaya	malay-fleur102	singlish	Language
mesolitica/wav2vec2-xls-r-300m-mixed	1180	{'WER': 0.194655128, 'CER': 0.04775798, 'WER-L...	{'WER': 0.2373861259, 'CER': 0.07055478, 'WER-...	{'WER': 0.127588595, 'CER': 0.0494924979, 'WER...	[malay, singlish]
mesolitica/wav2vec2-xls-r-300m-mixed-v2	1180	{'WER': 0.154782923, 'CER': 0.035164031, 'WER-...	{'WER': 0.2013994374, 'CER': 0.0518170369, 'WE...	{'WER': 0.2258822139, 'CER': 0.082982312, 'WER...	[malay, singlish]
mesolitica/wav2vec2-xls-r-300m-12layers-ms	657	{'WER': 0.1494983789, 'CER': 0.0342059992, 'WE...	{'WER': 0.217107489, 'CER': 0.0546614199, 'WER...	NaN	[malay]
mesolitica/wav2vec2-xls-r-300m-6layers-ms	339	{'WER': 0.1494983789, 'CER': 0.0342059992, 'WE...	{'WER': 0.217107489, 'CER': 0.0546614199, 'WER...	NaN	[malay]
mesolitica/wav2vec2-xls-r-300m-3layers-ms	195	{'WER': 0.1494983789, 'CER': 0.0342059992, 'WE...	{'WER': 0.217107489, 'CER': 0.0546614199, 'WER...	NaN	[malay]

Load HuggingFace model#

def huggingface(
    model: str = 'mesolitica/wav2vec2-xls-r-300m-mixed',
    force_check: bool = True,
    **kwargs,
):
    """
    Load Finetuned models from HuggingFace.

    Parameters
    ----------
    model : str, optional (default='mesolitica/wav2vec2-xls-r-300m-mixed')
        Check available models at `malaya_speech.stt.ctc.available_huggingface()`.
    force_check: bool, optional (default=True)
        Force check model one of malaya model.
        Set to False if you have your own huggingface model.

    Returns
    -------
    result : malaya_speech.torch_model.huggingface.CTC class
    """

[3]:

model = malaya_speech.stt.ctc.huggingface(model = 'mesolitica/wav2vec2-xls-r-300m-mixed')

Load sample#

[4]:

ceramah, sr = malaya_speech.load('speech/khutbah/wadi-annuar.wav')
record1, sr = malaya_speech.load('speech/record/savewav_2020-11-26_22-36-06_294832.wav')
record2, sr = malaya_speech.load('speech/record/savewav_2020-11-26_22-40-56_929661.wav')
singlish0, sr = malaya_speech.load('speech/singlish/singlish0.wav')
singlish1, sr = malaya_speech.load('speech/singlish/singlish1.wav')
singlish2, sr = malaya_speech.load('speech/singlish/singlish2.wav')
mandarin0, sr = malaya_speech.load('speech/mandarin/597.wav')
mandarin1, sr = malaya_speech.load('speech/mandarin/584.wav')
mandarin2, sr = malaya_speech.load('speech/mandarin/509.wav')

Predict logits#

def predict_logits(self, inputs, norm_func=softmax):
    """
    Predict logits from inputs.

    Parameters
    ----------
    input: List[np.array]
        List[np.array] or List[malaya_speech.model.frame.Frame].
    norm_func: Callable, optional (default=malaya.utils.activation.softmax)


    Returns
    -------
    result: List[np.array]
    """

[5]:

%%time

logits = model.predict_logits([ceramah, record1, record2])

CPU times: user 35.7 s, sys: 3.31 s, total: 39 s
Wall time: 3.63 s

[6]:

logits.shape

[6]:

(3, 499, 40)

Load ctc-decoders#

I will use dump-combined for this example.

[7]:

from pyctcdecode import build_ctcdecoder
from malaya_speech.utils.char import HF_CTC_VOCAB
import kenlm

[8]:

lm = malaya_speech.language_model.kenlm(model = 'dump-combined')

[9]:

kenlm_model = kenlm.Model(lm)
decoder = build_ctcdecoder(
    HF_CTC_VOCAB + ['_'],
    kenlm_model,
    alpha=0.2,
    beta=1.0,
    ctc_token_idx=len(HF_CTC_VOCAB)
)

[10]:

len(HF_CTC_VOCAB)

[10]:

[14]:

out = decoder.decode_beams(logits[0], prune_history=True)
d_lm, lm_state, timesteps, logit_score, lm_score = out[0]
d_lm

[14]:

'jadi dalam perjalanan ini dunia yang susah ini ketika nabi mengajar muaz bin jabal tadi ni allah ma ini'

Speech-to-Text CTC HuggingFace + pyctcdecode

Contents

Speech-to-Text CTC HuggingFace + pyctcdecode#

Install pyctcdecode#

From PYPI#

From source#

Benefit#

List available HuggingFace model#

Load HuggingFace model#

Load sample#

Predict logits#

Load ctc-decoders#