Language Model

This tutorial is available as an IPython notebook at malaya-speech/example/ctc-language-model.

This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.

Purpose

When doing CTC or RNNT beam decoding, we want to add language bias during find the optimum alignment.

List available Language Model

We provided language model for our ASR models,

[1]:
import malaya_speech
[2]:
malaya_speech.stt.available_language_model()
[2]:
Size (MB) LM order Description Command
bahasa 17 3 Gathered from malaya-speech ASR bahasa transcript [./lmplz --text text.txt --arpa out.arpa -o 3 ...
bahasa-news 24 3 Gathered from malaya-speech bahasa ASR transcr... [./lmplz --text text.txt --arpa out.arpa -o 3 ...
bahasa-combined 29 3 Gathered from malaya-speech ASR bahasa transcr... [./lmplz --text text.txt --arpa out.arpa -o 3 ...
redape-community 887.1 4 Mirror for https://github.com/redapesolutions/... [./lmplz --text text.txt --arpa out.arpa -o 4 ...
dump-combined 310 3 Academia + News + IIUM + Parliament + Watpadd ... [./lmplz --text text.txt --arpa out.arpa -o 3 ...
manglish 202 3 Manglish News + Manglish Reddit + Manglish for... [./lmplz --text text.txt --arpa out.arpa -o 3 ...
bahasa-manglish-combined 608 3 Combined `dump-combined` and `manglish`. [./lmplz --text text.txt --arpa out.arpa -o 3 ...

redape-community got from https://github.com/redapesolutions/suara-kami-community, another good malay speech-to-text repository.

Load Language Model

def language_model(
    model: str = 'dump-combined', **kwargs
):
    """
    Load KenLM language model.

    Parameters
    ----------
    model : str, optional (default='dump-combined')
        Model architecture supported. Allowed values:

        * ``'bahasa'`` - Gathered from malaya-speech ASR bahasa transcript.
        * ``'bahasa-news'`` - Gathered from malaya-speech ASR bahasa transcript + Bahasa News (Random sample 300k sentences).
        * ``'bahasa-combined'`` - Gathered from malaya-speech ASR bahasa transcript + Bahasa News (Random sample 300k sentences) + Bahasa Wikipedia (Random sample 150k sentences).
        * ``'redape-community'`` - Mirror for https://github.com/redapesolutions/suara-kami-community
        * ``'dump-combined'`` - Academia + News + IIUM + Parliament + Watpadd + Wikipedia + Common Crawl + training set from https://github.com/huseinzol05/malay-dataset/tree/master/dumping/clean.
        * ``'manglish'`` - Manglish News + Manglish Reddit + Manglish forum + training set from https://github.com/huseinzol05/malay-dataset/tree/master/dumping/clean.
        * ``'bahasa-manglish-combined'`` - Combined `dump-combined` and `manglish`.

    Returns
    -------
    result : str
    """
[3]:
lm = malaya_speech.stt.language_model(model = 'bahasa')
lm
[3]:
'/Users/huseinzolkepli/Malaya-Speech/language-model/bahasa/model.trie.klm'

Build custom Language Model

  1. Build KenLM,

wget -O - https://kheafield.com/code/kenlm.tar.gz |tar xz
mkdir kenlm/build
cd kenlm/build
cmake ..
make -j2
  1. Prepare newlines text file. Feel free to use some from https://github.com/huseinzol05/Malay-Dataset/tree/master/dumping.

kenlm/build/bin/lmplz --text text.txt --arpa out.arpa -o 3 --prune 0 1 1
kenlm/build/bin/build_binary -q 8 -b 7 -a 256 trie out.arpa out.trie.klm
  1. Once you have out.trie.klm, you can load to scorer interface.

from ctc_decoders import Scorer

scorer = Scorer(alpha, beta, 'out.trie.klm', vocab_list)

Use ctc-decoders

From PYPI

pip3 install ctc-decoders

But if you use linux, we unable to upload linux wheels to pypi repository, so download linux wheel at malaya-speech/ctc-decoders.

From source

Check malaya-speech/ctc-decoders how to build from source incase there is no available wheel for your operating system.

Building from source should only take a few minutes.

Load ctc-decoders

[15]:
from ctc_decoders import Scorer
from malaya_speech.utils.char import CTC_VOCAB
Init signature: Scorer(alpha, beta, model_path, vocabulary)
Docstring:
Wrapper for Scorer.

:param alpha: Parameter associated with language model. Don't use
              language model when alpha = 0.
:type alpha: float
:param beta: Parameter associated with word count. Don't use word
             count when beta = 0.
:type beta: float
:model_path: Path to load language model.
:type model_path: basestring
[16]:
scorer = Scorer(0.5, 1.0, lm, CTC_VOCAB)
scorer
[16]:
<ctc_decoders.Scorer; proxy of <Swig Object of type 'Scorer *' at 0x14ffe3c00> >

Test

[2]:
from ctc_decoders import ctc_greedy_decoder, ctc_beam_search_decoder
import numpy as np
import malaya_speech
[19]:
# https://github.com/PaddlePaddle/DeepSpeech/blob/master/decoders/tests/test_decoders.py

vocab_list = ["\'", ' ', 'a', 'b', 'c', 'dk ']
beam_size = 20
probs_seq1 = [[
    0.06390443, 0.21124858, 0.27323887, 0.06870235, 0.0361254,
    0.18184413, 0.16493624
], [
    0.03309247, 0.22866108, 0.24390638, 0.09699597, 0.31895462,
    0.0094893, 0.06890021
], [
    0.218104, 0.19992557, 0.18245131, 0.08503348, 0.14903535,
    0.08424043, 0.08120984
], [
    0.12094152, 0.19162472, 0.01473646, 0.28045061, 0.24246305,
    0.05206269, 0.09772094
], [
    0.1333387, 0.00550838, 0.00301669, 0.21745861, 0.20803985,
    0.41317442, 0.01946335
], [
    0.16468227, 0.1980699, 0.1906545, 0.18963251, 0.19860937,
    0.04377724, 0.01457421
]]
probs_seq2 = [[
    0.08034842, 0.22671944, 0.05799633, 0.36814645, 0.11307441,
    0.04468023, 0.10903471
], [
    0.09742457, 0.12959763, 0.09435383, 0.21889204, 0.15113123,
    0.10219457, 0.20640612
], [
    0.45033529, 0.09091417, 0.15333208, 0.07939558, 0.08649316,
    0.12298585, 0.01654384
], [
    0.02512238, 0.22079203, 0.19664364, 0.11906379, 0.07816055,
    0.22538587, 0.13483174
], [
    0.17928453, 0.06065261, 0.41153005, 0.1172041, 0.11880313,
    0.07113197, 0.04139363
], [
    0.15882358, 0.1235788, 0.23376776, 0.20510435, 0.00279306,
    0.05294827, 0.22298418
]]
greedy_result = ["ac'bdk c", "b'dk a"]
beam_search_result = ['acdk c', "b'a"]
[20]:
ctc_greedy_decoder(np.array(probs_seq1), vocab_list) == greedy_result[0]
[20]:
True
[21]:
ctc_greedy_decoder(np.array(probs_seq2), vocab_list) == greedy_result[1]
[21]:
True
[22]:
ctc_beam_search_decoder(probs_seq = np.array(probs_seq1),
                        beam_size = beam_size,
                        vocabulary = vocab_list)
[22]:
[(-6.480283737182617, 'acdk c'),
 (-6.483003616333008, 'acdk  '),
 (-6.52116060256958, 'acdk a'),
 (-6.526535511016846, 'acdk b'),
 (-6.570488452911377, 'a dk c'),
 (-6.573208332061768, 'a dk  '),
 (-6.61136531829834, 'a dk a'),
 (-6.6167402267456055, 'a dk b'),
 (-6.630837440490723, 'acbc'),
 (-6.63310432434082, 'acb'),
 (-6.633557319641113, 'acb '),
 (-6.644730091094971, 'a bc'),
 (-6.647449970245361, 'a b '),
 (-6.650537490844727, 'a b'),
 (-6.667605400085449, "acdk '"),
 (-6.6717143058776855, 'acba'),
 (-6.685606956481934, 'a ba'),
 (-6.686768531799316, ' cdk c'),
 (-6.689488410949707, ' cdk  '),
 (-6.709468364715576, 'a c')]
[23]:
ctc_beam_search_decoder(probs_seq = np.array(probs_seq2),
                        beam_size = beam_size,
                        vocabulary = vocab_list)
[23]:
[(-4.989980220794678, "b'a"),
 (-5.298550128936768, "b'dk a"),
 (-5.3370184898376465, "b' a"),
 (-5.585845470428467, "b'a'"),
 (-5.652693271636963, " 'a"),
 (-5.7635698318481445, "b'ab"),
 (-5.788026332855225, "b'ba"),
 (-6.0385026931762695, 'bdk a'),
 (-6.132683753967285, "b'ca"),
 (-6.137714385986328, " 'dk a"),
 (-6.158307075500488, " ' a"),
 (-6.171831130981445, "b'dk '"),
 (-6.221673011779785, "b' '"),
 (-6.240574359893799, 'b a'),
 (-6.270209312438965, "b'a "),
 (-6.2848052978515625, "b'dk ab"),
 (-6.304642200469971, 'ba'),
 (-6.305397987365723, "b' ab"),
 (-6.426036834716797, " 'ab"),
 (-6.505356311798096, "b'b")]

Use pyctcdecode

From PYPI

pip3 install pyctcdecode==0.1.0 pypi-kenlm==0.1.20210121

From source

Check https://github.com/kensho-technologies/pyctcdecode how to build from source incase there is no available wheel for your operating system.

Building from source should only take a few minutes.

[17]:
import kenlm
from pyctcdecode import build_ctcdecoder

kenlm_model = kenlm.Model(lm)
decoder = build_ctcdecoder(
    CTC_VOCAB,
    kenlm_model,
    alpha=0.5,
    beta=1.0,
)