Masked LM#

This tutorial is available as an IPython notebook at malaya-speech/example/mlm.

This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.

Purpose#

When doing CTC or RNNT beam decoding, we want to add language bias during find the optimum alignment using Mask Model.

List available Masked models#

We provided a few Masked models for our ASR models,

[1]:
import malaya_speech
[2]:
malaya_speech.language_model.available_mlm()
[2]:
Size (MB)
mesolitica/bert-base-standard-bahasa-cased 443
mesolitica/roberta-base-standard-bahasa-cased 310

Load Masked Model#

def mlm(model: str = 'mesolitica/bert-base-standard-bahasa-cased', force_check: bool = True, **kwargs):
    """
    Load Masked language model.

    Parameters
    ----------
    model: str, optional (default='mesolitica/bert-base-standard-bahasa-cased')
        Check available models at `malaya_speech.language_model.available_mlm()`.
    force_check: bool, optional (default=True)
        Force check model one of malaya model.
        Set to False if you have your own huggingface model.

    Returns
    -------
    result: malaya_speech.torch_model.mask_lm.LM class
    """
[4]:
lm = malaya_speech.language_model.mlm()
lm

malaya-speech Masked LM need to combine with pyctcdecode to decode CTC logits.

Use pyctcdecode#

From PYPI#

pip3 install pyctcdecode==0.1.0 pypi-kenlm==0.1.20210121

From source#

Check https://github.com/kensho-technologies/pyctcdecode how to build from source incase there is no available wheel for your operating system.

Building from source should only take a few minutes.

[5]:
from pyctcdecode import Alphabet, BeamSearchDecoderCTC
from malaya_speech.utils.char import CTC_VOCAB

labels = CTC_VOCAB + ['_']
ctc_token_idx = len(CTC_VOCAB)
alphabet = Alphabet.build_alphabet(labels, ctc_token_idx=ctc_token_idx)
decoder = BeamSearchDecoderCTC(alphabet, lm)