GPT2 LM#

This tutorial is available as an IPython notebook at malaya-speech/example/gpt2-lm.

This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.

Purpose#

When doing CTC or RNNT beam decoding, we want to add language bias during find the optimum alignment using GPT2 Causal model.

List available GPT2#

We provided a few GPT2 models for our ASR models,

[1]:
import malaya_speech
[2]:
malaya_speech.language_model.available_gpt2()
[2]:
Size (MB)
mesolitica/gpt2-117m-bahasa-cased 454

Load KenLM Model#

def gpt2(model: str = 'mesolitica/gpt2-117m-bahasa-cased', force_check: bool = True, **kwargs):
    """
    Load GPT2 language model.

    Parameters
    ----------
    model: str, optional (default='mesolitica/gpt2-117m-bahasa-cased')
        Check available models at `malaya_speech.language_model.available_gpt2()`.
    force_check: bool, optional (default=True)
        Force check model one of malaya model.
        Set to False if you have your own huggingface model.

    Returns
    -------
    result: malaya.torch_model.gpt2_lm.LM class
    """
[3]:
lm = malaya_speech.language_model.gpt2()
lm
[3]:
<malaya_speech.torch_model.gpt2_lm.LM at 0x7f9ca3419f70>

malaya-speech GPT2 LM need to combine with pyctcdecode to decode CTC logits.

Use pyctcdecode#

From PYPI#

pip3 install pyctcdecode==0.1.0 pypi-kenlm==0.1.20210121

From source#

Check https://github.com/kensho-technologies/pyctcdecode how to build from source incase there is no available wheel for your operating system.

Building from source should only take a few minutes.

[5]:
from pyctcdecode import Alphabet, BeamSearchDecoderCTC
from malaya_speech.utils.char import CTC_VOCAB

labels = CTC_VOCAB + ['_']
ctc_token_idx = len(CTC_VOCAB)
alphabet = Alphabet.build_alphabet(labels, ctc_token_idx=ctc_token_idx)
decoder = BeamSearchDecoderCTC(alphabet, lm)