Language Model¶

This tutorial is available as an IPython notebook at malaya-speech/example/ctc-language-model.

This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.

Purpose¶

When doing CTC or RNNT beam decoding, we want to add language bias during find the optimum alignment.

List available Language Model¶

We provided language model for our ASR models,

[1]:

import malaya_speech

[2]:

malaya_speech.stt.available_language_model()

[2]:

	Size (MB)	LM order	Description	Command
bahasa	17	3	Gathered from malaya-speech ASR bahasa transcript	[./lmplz --text text.txt --arpa out.arpa -o 3 ...
bahasa-news	24	3	Gathered from malaya-speech bahasa ASR transcr...	[./lmplz --text text.txt --arpa out.arpa -o 3 ...
bahasa-combined	29	3	Gathered from malaya-speech ASR bahasa transcr...	[./lmplz --text text.txt --arpa out.arpa -o 3 ...
redape-community	887.1	4	Mirror for https://github.com/redapesolutions/...	[./lmplz --text text.txt --arpa out.arpa -o 4 ...
dump-combined	310	3	Academia + News + IIUM + Parliament + Watpadd ...	[./lmplz --text text.txt --arpa out.arpa -o 3 ...
manglish	202	3	Manglish News + Manglish Reddit + Manglish for...	[./lmplz --text text.txt --arpa out.arpa -o 3 ...
bahasa-manglish-combined	608	3	Combined `dump-combined` and `manglish`.	[./lmplz --text text.txt --arpa out.arpa -o 3 ...

redape-community got from https://github.com/redapesolutions/suara-kami-community, another good malay speech-to-text repository.

Load Language Model¶

def language_model(
    model: str = 'dump-combined', **kwargs
):
    """
    Load KenLM language model.

    Parameters
    ----------
    model : str, optional (default='dump-combined')
        Model architecture supported. Allowed values:

        * ``'bahasa'`` - Gathered from malaya-speech ASR bahasa transcript.
        * ``'bahasa-news'`` - Gathered from malaya-speech ASR bahasa transcript + Bahasa News (Random sample 300k sentences).
        * ``'bahasa-combined'`` - Gathered from malaya-speech ASR bahasa transcript + Bahasa News (Random sample 300k sentences) + Bahasa Wikipedia (Random sample 150k sentences).
        * ``'redape-community'`` - Mirror for https://github.com/redapesolutions/suara-kami-community
        * ``'dump-combined'`` - Academia + News + IIUM + Parliament + Watpadd + Wikipedia + Common Crawl + training set from https://github.com/huseinzol05/malay-dataset/tree/master/dumping/clean.
        * ``'manglish'`` - Manglish News + Manglish Reddit + Manglish forum + training set from https://github.com/huseinzol05/malay-dataset/tree/master/dumping/clean.
        * ``'bahasa-manglish-combined'`` - Combined `dump-combined` and `manglish`.

    Returns
    -------
    result : str
    """

[3]:

lm = malaya_speech.stt.language_model(model = 'bahasa')
lm

[3]:

'/Users/huseinzolkepli/Malaya-Speech/language-model/bahasa/model.trie.klm'

Build custom Language Model¶

Build KenLM,

wget -O - https://kheafield.com/code/kenlm.tar.gz |tar xz
mkdir kenlm/build
cd kenlm/build
cmake ..
make -j2

Prepare newlines text file. Feel free to use some from https://github.com/huseinzol05/Malay-Dataset/tree/master/dumping.

kenlm/build/bin/lmplz --text text.txt --arpa out.arpa -o 3 --prune 0 1 1
kenlm/build/bin/build_binary -q 8 -b 7 -a 256 trie out.arpa out.trie.klm

Once you have out.trie.klm, you can load to scorer interface.

from ctc_decoders import Scorer

scorer = Scorer(alpha, beta, 'out.trie.klm', vocab_list)

Use ctc-decoders¶

From PYPI¶

pip3 install ctc-decoders

But if you use linux, we unable to upload linux wheels to pypi repository, so download linux wheel at malaya-speech/ctc-decoders.

From source¶

Check malaya-speech/ctc-decoders how to build from source incase there is no available wheel for your operating system.

Building from source should only take a few minutes.

Load ctc-decoders¶

[15]:

from ctc_decoders import Scorer
from malaya_speech.utils.char import CTC_VOCAB

Init signature: Scorer(alpha, beta, model_path, vocabulary)
Docstring:
Wrapper for Scorer.

:param alpha: Parameter associated with language model. Don't use
              language model when alpha = 0.
:type alpha: float
:param beta: Parameter associated with word count. Don't use word
             count when beta = 0.
:type beta: float
:model_path: Path to load language model.
:type model_path: basestring

[16]:

scorer = Scorer(0.5, 1.0, lm, CTC_VOCAB)
scorer

[16]:

<ctc_decoders.Scorer; proxy of <Swig Object of type 'Scorer *' at 0x14ffe3c00> >

Test¶

[2]:

from ctc_decoders import ctc_greedy_decoder, ctc_beam_search_decoder
import numpy as np
import malaya_speech

[19]:

# https://github.com/PaddlePaddle/DeepSpeech/blob/master/decoders/tests/test_decoders.py

vocab_list = ["\'", ' ', 'a', 'b', 'c', 'dk ']
beam_size = 20
probs_seq1 = [[
    0.06390443, 0.21124858, 0.27323887, 0.06870235, 0.0361254,
    0.18184413, 0.16493624
], [
    0.03309247, 0.22866108, 0.24390638, 0.09699597, 0.31895462,
    0.0094893, 0.06890021
], [
    0.218104, 0.19992557, 0.18245131, 0.08503348, 0.14903535,
    0.08424043, 0.08120984
], [
    0.12094152, 0.19162472, 0.01473646, 0.28045061, 0.24246305,
    0.05206269, 0.09772094
], [
    0.1333387, 0.00550838, 0.00301669, 0.21745861, 0.20803985,
    0.41317442, 0.01946335
], [
    0.16468227, 0.1980699, 0.1906545, 0.18963251, 0.19860937,
    0.04377724, 0.01457421
]]
probs_seq2 = [[
    0.08034842, 0.22671944, 0.05799633, 0.36814645, 0.11307441,
    0.04468023, 0.10903471
], [
    0.09742457, 0.12959763, 0.09435383, 0.21889204, 0.15113123,
    0.10219457, 0.20640612
], [
    0.45033529, 0.09091417, 0.15333208, 0.07939558, 0.08649316,
    0.12298585, 0.01654384
], [
    0.02512238, 0.22079203, 0.19664364, 0.11906379, 0.07816055,
    0.22538587, 0.13483174
], [
    0.17928453, 0.06065261, 0.41153005, 0.1172041, 0.11880313,
    0.07113197, 0.04139363
], [
    0.15882358, 0.1235788, 0.23376776, 0.20510435, 0.00279306,
    0.05294827, 0.22298418
]]
greedy_result = ["ac'bdk c", "b'dk a"]
beam_search_result = ['acdk c', "b'a"]

[20]:

ctc_greedy_decoder(np.array(probs_seq1), vocab_list) == greedy_result[0]

[20]:

True

[21]:

ctc_greedy_decoder(np.array(probs_seq2), vocab_list) == greedy_result[1]

[21]:

True

[22]:

ctc_beam_search_decoder(probs_seq = np.array(probs_seq1),
                        beam_size = beam_size,
                        vocabulary = vocab_list)

[22]:

[(-6.480283737182617, 'acdk c'),
 (-6.483003616333008, 'acdk  '),
 (-6.52116060256958, 'acdk a'),
 (-6.526535511016846, 'acdk b'),
 (-6.570488452911377, 'a dk c'),
 (-6.573208332061768, 'a dk  '),
 (-6.61136531829834, 'a dk a'),
 (-6.6167402267456055, 'a dk b'),
 (-6.630837440490723, 'acbc'),
 (-6.63310432434082, 'acb'),
 (-6.633557319641113, 'acb '),
 (-6.644730091094971, 'a bc'),
 (-6.647449970245361, 'a b '),
 (-6.650537490844727, 'a b'),
 (-6.667605400085449, "acdk '"),
 (-6.6717143058776855, 'acba'),
 (-6.685606956481934, 'a ba'),
 (-6.686768531799316, ' cdk c'),
 (-6.689488410949707, ' cdk  '),
 (-6.709468364715576, 'a c')]

[23]:

ctc_beam_search_decoder(probs_seq = np.array(probs_seq2),
                        beam_size = beam_size,
                        vocabulary = vocab_list)

[23]:

[(-4.989980220794678, "b'a"),
 (-5.298550128936768, "b'dk a"),
 (-5.3370184898376465, "b' a"),
 (-5.585845470428467, "b'a'"),
 (-5.652693271636963, " 'a"),
 (-5.7635698318481445, "b'ab"),
 (-5.788026332855225, "b'ba"),
 (-6.0385026931762695, 'bdk a'),
 (-6.132683753967285, "b'ca"),
 (-6.137714385986328, " 'dk a"),
 (-6.158307075500488, " ' a"),
 (-6.171831130981445, "b'dk '"),
 (-6.221673011779785, "b' '"),
 (-6.240574359893799, 'b a'),
 (-6.270209312438965, "b'a "),
 (-6.2848052978515625, "b'dk ab"),
 (-6.304642200469971, 'ba'),
 (-6.305397987365723, "b' ab"),
 (-6.426036834716797, " 'ab"),
 (-6.505356311798096, "b'b")]

Use pyctcdecode¶

From PYPI¶

pip3 install pyctcdecode==0.1.0 pypi-kenlm==0.1.20210121

From source¶

Check https://github.com/kensho-technologies/pyctcdecode how to build from source incase there is no available wheel for your operating system.

Building from source should only take a few minutes.

[17]:

import kenlm
from pyctcdecode import build_ctcdecoder

kenlm_model = kenlm.Model(lm)
decoder = build_ctcdecoder(
    CTC_VOCAB,
    kenlm_model,
    alpha=0.5,
    beta=1.0,
)