Masked LM
Contents
Masked LM#
This tutorial is available as an IPython notebook at malaya-speech/example/mlm.
This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.
Purpose#
When doing CTC or RNNT beam decoding, we want to add language bias during find the optimum alignment using Mask Model.
List available Masked models#
We provided a few Masked models for our ASR models,
[1]:
import malaya_speech
[2]:
malaya_speech.language_model.available_mlm()
[2]:
Size (MB) | |
---|---|
mesolitica/bert-base-standard-bahasa-cased | 443 |
mesolitica/roberta-base-standard-bahasa-cased | 310 |
Load Masked Model#
def mlm(model: str = 'mesolitica/bert-base-standard-bahasa-cased', force_check: bool = True, **kwargs):
"""
Load Masked language model.
Parameters
----------
model: str, optional (default='mesolitica/bert-base-standard-bahasa-cased')
Check available models at `malaya_speech.language_model.available_mlm()`.
force_check: bool, optional (default=True)
Force check model one of malaya model.
Set to False if you have your own huggingface model.
Returns
-------
result: malaya_speech.torch_model.mask_lm.LM class
"""
[4]:
lm = malaya_speech.language_model.mlm()
lm
malaya-speech Masked LM need to combine with pyctcdecode
to decode CTC logits.
Use pyctcdecode#
From PYPI#
pip3 install pyctcdecode==0.1.0 pypi-kenlm==0.1.20210121
From source#
Check https://github.com/kensho-technologies/pyctcdecode how to build from source incase there is no available wheel for your operating system.
Building from source should only take a few minutes.
[5]:
from pyctcdecode import Alphabet, BeamSearchDecoderCTC
from malaya_speech.utils.char import CTC_VOCAB
labels = CTC_VOCAB + ['_']
ctc_token_idx = len(CTC_VOCAB)
alphabet = Alphabet.build_alphabet(labels, ctc_token_idx=ctc_token_idx)
decoder = BeamSearchDecoderCTC(alphabet, lm)