Speech-to-Text CTC + pyctcdecode + MLM#

Encoder model + CTC loss + pyctcdecode with Masked Model

This tutorial is available as an IPython notebook at malaya-speech/example/stt-ctc-model-pyctcdecode-mlm.

This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.

This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at malaya-speech/example/pipeline.

import os
os.environ['CUDA_VISIBLE_DEVICES'] = ''
import malaya_speech
import numpy as np
from malaya_speech import Pipeline

Install pyctcdecode#

From PYPI#

pip3 install pyctcdecode==0.1.0 pypi-kenlm==0.1.20210121

From source#

Check https://github.com/kensho-technologies/pyctcdecode how to build from source incase there is no available wheel for your operating system.

Building from source should only take a few minutes.


  1. pyctcdecode accurate than ctc-decoders for certain cases, but slower than pyctcdecode.

  2. pip install and done, no need to compile.

List available CTC model#

Size (MB) Quantized Size (MB) WER CER WER-LM CER-LM Language
hubert-conformer-tiny 36.6 10.3 0.335968 0.088257 0.199227 0.063522 [malay]
hubert-conformer 115 31.1 0.238714 0.0609 0.141479 0.045075 [malay]
hubert-conformer-large 392 100 0.220314 0.054927 0.128006 0.038533 [malay]
hubert-conformer-large-3mixed 392 100 0.241126 0.078794 0.132761 0.057482 [malay, singlish, mandarin]
best-rq-conformer-tiny 36.6 10.3 0.319291 0.078988 0.179582 0.055521 [malay]
best-rq-conformer 115 31.1 0.253678 0.065805 0.154206 0.048228 [malay]
best-rq-conformer-large 392 100 0.234651 0.06016 0.130082 0.044521 [malay]

Load CTC model#

def deep_ctc(
    model: str = 'hubert-conformer', quantized: bool = False, **kwargs
    Load Encoder-CTC ASR model.

    model : str, optional (default='hubert-conformer')
        Check available models at `malaya_speech.stt.available_ctc()`.
    quantized : bool, optional (default=False)
        if True, will load 8-bit quantized model.
        Quantized model not necessary faster, totally depends on the machine.

    result : malaya_speech.model.wav2vec.Wav2Vec2_CTC class
model = malaya_speech.stt.deep_ctc(model = 'hubert-conformer-large')
2022-09-17 15:08:48.914279: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-09-17 15:08:48.918355: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2022-09-17 15:08:48.918373: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: husein-MS-7D31
2022-09-17 15:08:48.918376: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: husein-MS-7D31
2022-09-17 15:08:48.918455: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program
2022-09-17 15:08:48.918471: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 470.141.3

Load sample#

ceramah, sr = malaya_speech.load('speech/khutbah/wadi-annuar.wav')
record1, sr = malaya_speech.load('speech/record/savewav_2020-11-26_22-36-06_294832.wav')
record2, sr = malaya_speech.load('speech/record/savewav_2020-11-26_22-40-56_929661.wav')
import IPython.display as ipd

ipd.Audio(ceramah, rate = sr)

As we can hear, the speaker speaks in kedahan dialects plus some arabic words, let see how good our model is.

ipd.Audio(record1, rate = sr)
ipd.Audio(record2, rate = sr)

As you can see, below is the output from beam decoder without language model,

['jadi dalam perjalanan ini dunia yang susah ini ketika nabi mengajar muaz bin jabal tadi ni alah ma ini',
 'helo nama saya esin saya tak suka mandi ketak saya masak',
 'helo nama saya musin saya suka mandi saya mandi titiap hari']

Predict logits#

def predict_logits(self, inputs):
    Predict logits from inputs.

    input: List[np.array]
        List[np.array] or List[malaya_speech.model.frame.Frame].

    result: List[np.array]

logits = model.predict_logits([ceramah, record1, record2])
CPU times: user 22.1 s, sys: 3.28 s, total: 25.4 s
Wall time: 5.43 s
(499, 39)

Load pyctcdecode + MLM#

To get better performance, you need to have a really good Mask model, we are trying very best to release a really good Mask model.

lm = malaya_speech.language_model.mlm(alpha = 0.01, beta = 0.5)
/home/husein/.local/lib/python3.8/site-packages/malaya/tokenizer.py:202: FutureWarning: Possible nested set at position 3361
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
/home/husein/.local/lib/python3.8/site-packages/malaya/tokenizer.py:202: FutureWarning: Possible nested set at position 3879
  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))
<malaya_speech.torch_model.mask_lm.LM at 0x7f40e6921970>
from pyctcdecode import Alphabet, BeamSearchDecoderCTC
from malaya_speech.utils.char import CTC_VOCAB

labels = CTC_VOCAB + ['_']
ctc_token_idx = len(CTC_VOCAB)
alphabet = Alphabet.build_alphabet(labels, ctc_token_idx=ctc_token_idx)
decoder = BeamSearchDecoderCTC(alphabet, lm)
out = decoder.decode_beams(logits[0], prune_history=True, beam_width = 10)
d_lm, lm_state, timesteps, logit_score, lm_score = out[0]
'jadi dalam perjalanan ini dunia yang susah ini ketika nabi mengajar muaz bin jabal tadi ni allah maini'
out = decoder.decode_beams(logits[1], prune_history=True, beam_width = 10)
d_lm, lm_state, timesteps, logit_score, lm_score = out[0]
'helo nama saya besin saya tak suka mandi ketat saya masak'
out = decoder.decode_beams(logits[2], prune_history=True, beam_width = 10)
d_lm, lm_state, timesteps, logit_score, lm_score = out[0]
'helo nama saya musin saya suka mandi saya mandi titip hari'