Speech-to-Text CTC + CTC Decoders#

Encoder model + CTC loss + CTC Decoders with KenLM

This tutorial is available as an IPython notebook at malaya-speech/example/stt-ctc-model-ctc-decoders.

This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.

This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at malaya-speech/example/pipeline.

[1]:
import malaya_speech
import numpy as np
from malaya_speech import Pipeline

Install ctc-decoders#

From PYPI#

pip3 install ctc-decoders

But if you use linux, we unable to upload linux wheels to pypi repository, so download linux wheel at malaya-speech/ctc-decoders.

From source#

Check malaya-speech/ctc-decoders how to build from source incase there is no available wheel for your operating system.

Building from source should only take a few minutes.

Benefit#

  1. ctc-decoders faster than pyctcdecode, ~26x faster based on husein benchmark, but very slightly less accurate than pyctcdecode.

List available CTC model#

[2]:
malaya_speech.stt.available_ctc()
[2]:
Size (MB) Quantized Size (MB) WER CER WER-LM CER-LM Language
hubert-conformer-tiny 36.6 10.3 0.335968 0.0882573 0.199227 0.0635223 [malay]
hubert-conformer 115 31.1 0.238714 0.0608998 0.141479 0.0450751 [malay]
hubert-conformer-large 392 100 0.220314 0.054927 0.128006 0.0385329 [malay]
hubert-conformer-large-3mixed 392 100 0.241126 0.0787939 0.132761 0.057482 [malay, singlish, mandarin]
best-rq-conformer-tiny 36.6 10.3 0.319291 0.078988 0.179582 0.055521 [malay]
best-rq-conformer 115 31.1 0.253678 0.0658045 0.154206 0.0482278 [malay]
best-rq-conformer-large 392 100 0.234651 0.0601605 0.130082 0.044521 [malay]

Load CTC model#

def deep_ctc(
    model: str = 'hubert-conformer', quantized: bool = False, **kwargs
):
    """
    Load Encoder-CTC ASR model.

    Parameters
    ----------
    model : str, optional (default='hubert-conformer')
        Model architecture supported. Allowed values:

        * ``'hubert-conformer-tiny'`` - Finetuned HuBERT Conformer TINY.
        * ``'hubert-conformer'`` - Finetuned HuBERT Conformer.
        * ``'hubert-conformer-large'`` - Finetuned HuBERT Conformer LARGE.
        * ``'hubert-conformer-large-3mixed'`` - Finetuned HuBERT Conformer LARGE for (Malay + Singlish + Mandarin) languages.
        * ``'best-rq-conformer-tiny'`` - Finetuned BEST-RQ Conformer TINY.
        * ``'best-rq-conformer'`` - Finetuned BEST-RQ Conformer.
        * ``'best-rq-conformer-large'`` - Finetuned BEST-RQ Conformer LARGE.


    quantized : bool, optional (default=False)
        if True, will load 8-bit quantized model.
        Quantized model not necessary faster, totally depends on the machine.

    Returns
    -------
    result : malaya_speech.model.tf.Wav2Vec2_CTC class
    """
[3]:
model = malaya_speech.stt.deep_ctc(model = 'hubert-conformer-large')

Load sample#

[4]:
ceramah, sr = malaya_speech.load('speech/khutbah/wadi-annuar.wav')
record1, sr = malaya_speech.load('speech/record/savewav_2020-11-26_22-36-06_294832.wav')
record2, sr = malaya_speech.load('speech/record/savewav_2020-11-26_22-40-56_929661.wav')
[5]:
import IPython.display as ipd

ipd.Audio(ceramah, rate = sr)
[5]:

As we can hear, the speaker speaks in kedahan dialects plus some arabic words, let see how good our model is.

[6]:
ipd.Audio(record1, rate = sr)
[6]:
[7]:
ipd.Audio(record2, rate = sr)
[7]:

As you can see, below is the output from beam decoder without language model,

['jadi dalam perjalanan ini dunia yang susah ini ketika nabi mengajar muaz bin jabal tadi ni alah ma ini',
 'helo nama saya esin saya tak suka mandi ketak saya masak',
 'helo nama saya musin saya suka mandi saya mandi titiap hari']

Predict logits#

def predict_logits(self, inputs):
    """
    Predict logits from inputs.

    Parameters
    ----------
    input: List[np.array]
        List[np.array] or List[malaya_speech.model.frame.Frame].


    Returns
    -------
    result: List[np.array]
    """
[8]:
%%time

logits = model.predict_logits([ceramah, record1, record2])
CPU times: user 26.5 s, sys: 10.4 s, total: 36.9 s
Wall time: 20.3 s
[9]:
logits[0].shape
[9]:
(499, 39)

Load ctc-decoders#

I will use dump-combined for this example.

[10]:
from ctc_decoders import Scorer
from ctc_decoders import ctc_beam_search_decoder
from malaya_speech.utils.char import CTC_VOCAB
[11]:
lm = malaya_speech.language_model.kenlm(model = 'dump-combined')
[12]:
scorer = Scorer(0.5, 1.0, lm, CTC_VOCAB)
[13]:
o = ctc_beam_search_decoder(logits[0], CTC_VOCAB, 20, ext_scoring_func = scorer)[0][1]
o
[13]:
'jadi dalam perjalanan ini dunia yang susah ini ketika nabi mengajar muaz bin jabal tadi ni allah maini'
[14]:
o = ctc_beam_search_decoder(logits[1], CTC_VOCAB, 20, ext_scoring_func = scorer)[0][1]
o
[14]:
'helo nama saya mesin saya tak suka mandi ketat saya masak'
[15]:
o = ctc_beam_search_decoder(logits[2], CTC_VOCAB, 20, ext_scoring_func = scorer)[0][1]
o
[15]:
'helo nama saya mesin saya suka mandi saya mandi titik hari'