GPT2 LM
Contents
GPT2 LM#
This tutorial is available as an IPython notebook at malaya-speech/example/gpt2-lm.
This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.
Purpose#
When doing CTC or RNNT beam decoding, we want to add language bias during find the optimum alignment using GPT2 Causal model.
List available GPT2#
We provided a few GPT2 models for our ASR models,
[1]:
import malaya_speech
[2]:
malaya_speech.language_model.available_gpt2()
[2]:
Size (MB) | |
---|---|
mesolitica/gpt2-117m-bahasa-cased | 454 |
Load KenLM Model#
def gpt2(model: str = 'mesolitica/gpt2-117m-bahasa-cased', force_check: bool = True, **kwargs):
"""
Load GPT2 language model.
Parameters
----------
model: str, optional (default='mesolitica/gpt2-117m-bahasa-cased')
Check available models at `malaya_speech.language_model.available_gpt2()`.
force_check: bool, optional (default=True)
Force check model one of malaya model.
Set to False if you have your own huggingface model.
Returns
-------
result: malaya.torch_model.gpt2_lm.LM class
"""
[3]:
lm = malaya_speech.language_model.gpt2()
lm
[3]:
<malaya_speech.torch_model.gpt2_lm.LM at 0x7f9ca3419f70>
malaya-speech GPT2 LM need to combine with pyctcdecode
to decode CTC logits.
Use pyctcdecode#
From PYPI#
pip3 install pyctcdecode==0.1.0 pypi-kenlm==0.1.20210121
From source#
Check https://github.com/kensho-technologies/pyctcdecode how to build from source incase there is no available wheel for your operating system.
Building from source should only take a few minutes.
[5]:
from pyctcdecode import Alphabet, BeamSearchDecoderCTC
from malaya_speech.utils.char import CTC_VOCAB
labels = CTC_VOCAB + ['_']
ctc_token_idx = len(CTC_VOCAB)
alphabet = Alphabet.build_alphabet(labels, ctc_token_idx=ctc_token_idx)
decoder = BeamSearchDecoderCTC(alphabet, lm)