Language Model#

This tutorial is available as an IPython notebook at malaya-speech/example/ctc-language-model.

This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.

Purpose#

When doing CTC or RNNT beam decoding, we want to add language bias during find the optimum alignment.

List available Language Model#

We provided language model for our ASR models,

[1]:
import malaya_speech
[2]:
malaya_speech.stt.available_language_model()
[2]:
Size (MB) LM order Description Command
bahasa 17 3 Gathered from malaya-speech ASR bahasa transcript [./lmplz --text text.txt --arpa out.arpa -o 3 ...
bahasa-news 24 3 Gathered from malaya-speech bahasa ASR transcr... [./lmplz --text text.txt --arpa out.arpa -o 3 ...
bahasa-combined 29 3 Gathered from malaya-speech ASR bahasa transcr... [./lmplz --text text.txt --arpa out.arpa -o 3 ...
redape-community 887.1 4 Mirror for https://github.com/redapesolutions/... [./lmplz --text text.txt --arpa out.arpa -o 4 ...
dump-combined 310 3 Academia + News + IIUM + Parliament + Watpadd ... [./lmplz --text text.txt --arpa out.arpa -o 3 ...
manglish 202 3 Manglish News + Manglish Reddit + Manglish for... [./lmplz --text text.txt --arpa out.arpa -o 3 ...
bahasa-manglish-combined 608 3 Combined `dump-combined` and `manglish`. [./lmplz --text text.txt --arpa out.arpa -o 3 ...

redape-community got from https://github.com/redapesolutions/suara-kami-community, another good malay speech-to-text repository.

Load Language Model#

def language_model(
    model: str = 'dump-combined', **kwargs
):
    """
    Load KenLM language model.

    Parameters
    ----------
    model : str, optional (default='dump-combined')
        Model architecture supported. Allowed values:

        * ``'bahasa'`` - Gathered from malaya-speech ASR bahasa transcript.
        * ``'bahasa-news'`` - Gathered from malaya-speech ASR bahasa transcript + Bahasa News (Random sample 300k sentences).
        * ``'bahasa-combined'`` - Gathered from malaya-speech ASR bahasa transcript + Bahasa News (Random sample 300k sentences) + Bahasa Wikipedia (Random sample 150k sentences).
        * ``'redape-community'`` - Mirror for https://github.com/redapesolutions/suara-kami-community
        * ``'dump-combined'`` - Academia + News + IIUM + Parliament + Watpadd + Wikipedia + Common Crawl + training set from https://github.com/huseinzol05/malay-dataset/tree/master/dumping/clean.
        * ``'manglish'`` - Manglish News + Manglish Reddit + Manglish forum + training set from https://github.com/huseinzol05/malay-dataset/tree/master/dumping/clean.
        * ``'bahasa-manglish-combined'`` - Combined `dump-combined` and `manglish`.

    Returns
    -------
    result : str
    """
[3]:
lm = malaya_speech.stt.language_model(model = 'bahasa')
lm
[3]:
'/Users/huseinzolkepli/Malaya-Speech/language-model/bahasa/model.trie.klm'

Build custom Language Model#

  1. Build KenLM,

wget -O - https://kheafield.com/code/kenlm.tar.gz |tar xz
mkdir kenlm/build
cd kenlm/build
cmake ..
make -j2
  1. Prepare newlines text file. Feel free to use some from https://github.com/huseinzol05/Malay-Dataset/tree/master/dumping.

kenlm/build/bin/lmplz --text text.txt --arpa out.arpa -o 3 --prune 0 1 1
kenlm/build/bin/build_binary -q 8 -b 7 -a 256 trie out.arpa out.trie.klm
  1. Once you have out.trie.klm, you can load to scorer interface.

from ctc_decoders import Scorer

scorer = Scorer(alpha, beta, 'out.trie.klm', vocab_list)

Use ctc-decoders#

From PYPI#

pip3 install ctc-decoders

But if you use linux, we unable to upload linux wheels to pypi repository, so download linux wheel at malaya-speech/ctc-decoders.

From source#

Check malaya-speech/ctc-decoders how to build from source incase there is no available wheel for your operating system.

Building from source should only take a few minutes.

Load ctc-decoders#

[15]:
from ctc_decoders import Scorer
from malaya_speech.utils.char import CTC_VOCAB
Init signature: Scorer(alpha, beta, model_path, vocabulary)
Docstring:
Wrapper for Scorer.

:param alpha: Parameter associated with language model. Don't use
              language model when alpha = 0.
:type alpha: float
:param beta: Parameter associated with word count. Don't use word
             count when beta = 0.
:type beta: float
:model_path: Path to load language model.
:type model_path: basestring
[16]:
scorer = Scorer(0.5, 1.0, lm, CTC_VOCAB)
scorer
[16]:
<ctc_decoders.Scorer; proxy of <Swig Object of type 'Scorer *' at 0x14ffe3c00> >

Test#

[2]:
from ctc_decoders import ctc_greedy_decoder, ctc_beam_search_decoder
import numpy as np
import malaya_speech
[19]:
# https://github.com/PaddlePaddle/DeepSpeech/blob/master/decoders/tests/test_decoders.py

vocab_list = ["\'", ' ', 'a', 'b', 'c', 'dk ']
beam_size = 20
probs_seq1 = [[
    0.06390443, 0.21124858, 0.27323887, 0.06870235, 0.0361254,
    0.18184413, 0.16493624
], [
    0.03309247, 0.22866108, 0.24390638, 0.09699597, 0.31895462,
    0.0094893, 0.06890021
], [
    0.218104, 0.19992557, 0.18245131, 0.08503348, 0.14903535,
    0.08424043, 0.08120984
], [
    0.12094152, 0.19162472, 0.01473646, 0.28045061, 0.24246305,
    0.05206269, 0.09772094
], [
    0.1333387, 0.00550838, 0.00301669, 0.21745861, 0.20803985,
    0.41317442, 0.01946335
], [
    0.16468227, 0.1980699, 0.1906545, 0.18963251, 0.19860937,
    0.04377724, 0.01457421
]]
probs_seq2 = [[
    0.08034842, 0.22671944, 0.05799633, 0.36814645, 0.11307441,
    0.04468023, 0.10903471
], [
    0.09742457, 0.12959763, 0.09435383, 0.21889204, 0.15113123,
    0.10219457, 0.20640612
], [
    0.45033529, 0.09091417, 0.15333208, 0.07939558, 0.08649316,
    0.12298585, 0.01654384
], [
    0.02512238, 0.22079203, 0.19664364, 0.11906379, 0.07816055,
    0.22538587, 0.13483174
], [
    0.17928453, 0.06065261, 0.41153005, 0.1172041, 0.11880313,
    0.07113197, 0.04139363
], [
    0.15882358, 0.1235788, 0.23376776, 0.20510435, 0.00279306,
    0.05294827, 0.22298418
]]
greedy_result = ["ac'bdk c", "b'dk a"]
beam_search_result = ['acdk c', "b'a"]
[20]:
ctc_greedy_decoder(np.array(probs_seq1), vocab_list) == greedy_result[0]
[20]:
True
[21]:
ctc_greedy_decoder(np.array(probs_seq2), vocab_list) == greedy_result[1]
[21]:
True
[22]:
ctc_beam_search_decoder(probs_seq = np.array(probs_seq1),
                        beam_size = beam_size,
                        vocabulary = vocab_list)
[22]:
[(-6.480283737182617, 'acdk c'),
 (-6.483003616333008, 'acdk  '),
 (-6.52116060256958, 'acdk a'),
 (-6.526535511016846, 'acdk b'),
 (-6.570488452911377, 'a dk c'),
 (-6.573208332061768, 'a dk  '),
 (-6.61136531829834, 'a dk a'),
 (-6.6167402267456055, 'a dk b'),
 (-6.630837440490723, 'acbc'),
 (-6.63310432434082, 'acb'),
 (-6.633557319641113, 'acb '),
 (-6.644730091094971, 'a bc'),
 (-6.647449970245361, 'a b '),
 (-6.650537490844727, 'a b'),
 (-6.667605400085449, "acdk '"),
 (-6.6717143058776855, 'acba'),
 (-6.685606956481934, 'a ba'),
 (-6.686768531799316, ' cdk c'),
 (-6.689488410949707, ' cdk  '),
 (-6.709468364715576, 'a c')]
[23]:
ctc_beam_search_decoder(probs_seq = np.array(probs_seq2),
                        beam_size = beam_size,
                        vocabulary = vocab_list)
[23]:
[(-4.989980220794678, "b'a"),
 (-5.298550128936768, "b'dk a"),
 (-5.3370184898376465, "b' a"),
 (-5.585845470428467, "b'a'"),
 (-5.652693271636963, " 'a"),
 (-5.7635698318481445, "b'ab"),
 (-5.788026332855225, "b'ba"),
 (-6.0385026931762695, 'bdk a'),
 (-6.132683753967285, "b'ca"),
 (-6.137714385986328, " 'dk a"),
 (-6.158307075500488, " ' a"),
 (-6.171831130981445, "b'dk '"),
 (-6.221673011779785, "b' '"),
 (-6.240574359893799, 'b a'),
 (-6.270209312438965, "b'a "),
 (-6.2848052978515625, "b'dk ab"),
 (-6.304642200469971, 'ba'),
 (-6.305397987365723, "b' ab"),
 (-6.426036834716797, " 'ab"),
 (-6.505356311798096, "b'b")]

Use pyctcdecode#

From PYPI#

pip3 install pyctcdecode==0.1.0 pypi-kenlm==0.1.20210121

From source#

Check https://github.com/kensho-technologies/pyctcdecode how to build from source incase there is no available wheel for your operating system.

Building from source should only take a few minutes.

[17]:
import kenlm
from pyctcdecode import build_ctcdecoder

kenlm_model = kenlm.Model(lm)
decoder = build_ctcdecoder(
    CTC_VOCAB,
    kenlm_model,
    alpha=0.5,
    beta=1.0,
)