This tutorial is available as an IPython notebook at malaya-speech/example/gpt2-lm.

This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.


When doing CTC or RNNT beam decoding, we want to add language bias during find the optimum alignment using GPT2 Causal model.

List available GPT2#

We provided a few GPT2 models for our ASR models,

import malaya_speech
Size (MB)
mesolitica/gpt2-117m-bahasa-cased 454

Load KenLM Model#

def gpt2(model: str = 'mesolitica/gpt2-117m-bahasa-cased', force_check: bool = True, **kwargs):
    Load GPT2 language model.

    model: str, optional (default='mesolitica/gpt2-117m-bahasa-cased')
        Check available models at `malaya_speech.language_model.available_gpt2()`.
    force_check: bool, optional (default=True)
        Force check model one of malaya model.
        Set to False if you have your own huggingface model.

    result: malaya.torch_model.gpt2_lm.LM class
lm = malaya_speech.language_model.gpt2()
<malaya_speech.torch_model.gpt2_lm.LM at 0x7f9ca3419f70>

malaya-speech GPT2 LM need to combine with pyctcdecode to decode CTC logits.

Use pyctcdecode#

From PYPI#

pip3 install pyctcdecode==0.1.0 pypi-kenlm==0.1.20210121

From source#

Check how to build from source incase there is no available wheel for your operating system.

Building from source should only take a few minutes.

from pyctcdecode import Alphabet, BeamSearchDecoderCTC
from malaya_speech.utils.char import CTC_VOCAB

labels = CTC_VOCAB + ['_']
ctc_token_idx = len(CTC_VOCAB)
alphabet = Alphabet.build_alphabet(labels, ctc_token_idx=ctc_token_idx)
decoder = BeamSearchDecoderCTC(alphabet, lm)