Speech-to-Text HuggingFace + CTC Decoders#

Finetuned hyperlocal languages on pretrained HuggingFace models + CTC Decoders with KenLM, https://huggingface.co/mesolitica

This tutorial is available as an IPython notebook at malaya-speech/example/stt-huggingface-ctc-decoders.

This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.

This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at malaya-speech/example/pipeline.

Required Tensorflow >= 2.0 due to group convolution is not available for Tensorflow 1.

[1]:

import malaya_speech
import numpy as np
from malaya_speech import Pipeline

Install ctc-decoders#

From PYPI#

pip3 install ctc-decoders

But if you use linux, we unable to upload linux wheels to pypi repository, so download linux wheel at malaya-speech/ctc-decoders.

From source#

Check malaya-speech/ctc-decoders how to build from source incase there is no available wheel for your operating system.

Building from source should only take a few minutes.

Benefit#

ctc-decoders faster than pyctcdecode, ~26x faster based on husein benchmark, but very slightly less accurate than pyctcdecode.

List available HuggingFace model#

[2]:

malaya_speech.stt.available_huggingface()

[2]:

	CER	CER-LM	Language	Size (MB)	WER	WER-LM
mesolitica/wav2vec2-xls-r-300m-mixed	0.048105	0.041196	[malay, singlish, mandarin]	1180	0.13222	0.098802

Load HuggingFace model#

def huggingface(model: str = 'mesolitica/wav2vec2-xls-r-300m-mixed', **kwargs):
    """
    Load Finetuned models from HuggingFace. Required Tensorflow >= 2.0.

    Parameters
    ----------
    model : str, optional (default='mesolitica/wav2vec2-xls-r-300m-mixed')
        Model architecture supported. Allowed values:

        * ``'mesolitica/wav2vec2-xls-r-300m-mixed'`` - wav2vec2 XLS-R 300M finetuned on (Malay + Singlish + Mandarin) languages.

    Returns
    -------
    result : malaya_speech.model.huggingface.CTC class
    """

[3]:

model = malaya_speech.stt.huggingface(model = 'mesolitica/wav2vec2-xls-r-300m-mixed')

Load sample#

[6]:

ceramah, sr = malaya_speech.load('speech/khutbah/wadi-annuar.wav')
record1, sr = malaya_speech.load('speech/record/savewav_2020-11-26_22-36-06_294832.wav')
record2, sr = malaya_speech.load('speech/record/savewav_2020-11-26_22-40-56_929661.wav')
singlish0, sr = malaya_speech.load('speech/singlish/singlish0.wav')
singlish1, sr = malaya_speech.load('speech/singlish/singlish1.wav')
singlish2, sr = malaya_speech.load('speech/singlish/singlish2.wav')
mandarin0, sr = malaya_speech.load('speech/mandarin/597.wav')
mandarin1, sr = malaya_speech.load('speech/mandarin/584.wav')
mandarin2, sr = malaya_speech.load('speech/mandarin/509.wav')

Predict logits#

def predict_logits(self, inputs, norm_func=softmax):
    """
    Predict logits from inputs.

    Parameters
    ----------
    input: List[np.array]
        List[np.array] or List[malaya_speech.model.frame.Frame].
    norm_func: Callable, optional (default=malaya.utils.activation.softmax)


    Returns
    -------
    result: List[np.array]
    """

[7]:

%%time

logits = model.predict_logits([ceramah, record1, record2])

CPU times: user 36 s, sys: 19.7 s, total: 55.7 s
Wall time: 10.6 s

[8]:

logits.shape

[8]:

(3, 499, 40)

Load ctc-decoders#

I will use dump-combined for this example.

[9]:

from ctc_decoders import Scorer
from ctc_decoders import ctc_beam_search_decoder
from malaya_speech.utils.char import HF_CTC_VOCAB

[10]:

lm = malaya_speech.language_model.kenlm(model = 'dump-combined')

[11]:

scorer = Scorer(0.5, 1.0, lm, HF_CTC_VOCAB)

[13]:

o = ctc_beam_search_decoder(logits[0], HF_CTC_VOCAB, 20, ext_scoring_func = scorer)[0][1]
o

[13]:

'jadi dalam perjalanan ini dunia yang susah ini ketika nabi mengajar muaz bin jabal tadi ni allah ma ini'

[14]:

o = ctc_beam_search_decoder(logits[1], HF_CTC_VOCAB, 20, ext_scoring_func = scorer)[0][1]
o

[14]:

'hello nama saya husin saya tak skema ke tiap saya masam'

[15]:

o = ctc_beam_search_decoder(logits[2], HF_CTC_VOCAB, 20, ext_scoring_func = scorer)[0][1]
o

[15]:

'hello nama saya hussein saya sekoman saya mandi dia tiap hari'

Speech-to-Text HuggingFace + CTC Decoders

Contents

Speech-to-Text HuggingFace + CTC Decoders#

Install ctc-decoders#

From PYPI#

From source#

Benefit#

List available HuggingFace model#

Load HuggingFace model#

Load sample#

Predict logits#

Load ctc-decoders#