Language Model#

This tutorial is available as an IPython notebook at malaya-speech/example/ctc-language-model.

This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.

Purpose#

When doing CTC or RNNT beam decoding, we want to add language bias during find the optimum alignment.

List available Language Model#

We provided language model for our ASR models,

[1]:

import malaya_speech

[2]:

malaya_speech.stt.available_language_model()

[2]:

	Size (MB)	LM order	Description	Command
bahasa	17	3	Gathered from malaya-speech ASR bahasa transcript	[./lmplz --text text.txt --arpa out.arpa -o 3 ...
bahasa-news	24	3	Gathered from malaya-speech bahasa ASR transcr...	[./lmplz --text text.txt --arpa out.arpa -o 3 ...
bahasa-combined	29	3	Gathered from malaya-speech ASR bahasa transcr...	[./lmplz --text text.txt --arpa out.arpa -o 3 ...
redape-community	887.1	4	Mirror for https://github.com/redapesolutions/...	[./lmplz --text text.txt --arpa out.arpa -o 4 ...
dump-combined	310	3	Academia + News + IIUM + Parliament + Watpadd ...	[./lmplz --text text.txt --arpa out.arpa -o 3 ...
manglish	202	3	Manglish News + Manglish Reddit + Manglish for...	[./lmplz --text text.txt --arpa out.arpa -o 3 ...
bahasa-manglish-combined	608	3	Combined `dump-combined` and `manglish`.	[./lmplz --text text.txt --arpa out.arpa -o 3 ...

redape-community got from https://github.com/redapesolutions/suara-kami-community, another good malay speech-to-text repository.

Load Language Model#

def language_model(
    model: str = 'dump-combined', **kwargs
):
    """
    Load KenLM language model.

    Parameters
    ----------
    model : str, optional (default='dump-combined')
        Model architecture supported. Allowed values:

        * ``'bahasa'`` - Gathered from malaya-speech ASR bahasa transcript.
        * ``'bahasa-news'`` - Gathered from malaya-speech ASR bahasa transcript + Bahasa News (Random sample 300k sentences).
        * ``'bahasa-combined'`` - Gathered from malaya-speech ASR bahasa transcript + Bahasa News (Random sample 300k sentences) + Bahasa Wikipedia (Random sample 150k sentences).
        * ``'redape-community'`` - Mirror for https://github.com/redapesolutions/suara-kami-community
        * ``'dump-combined'`` - Academia + News + IIUM + Parliament + Watpadd + Wikipedia + Common Crawl + training set from https://github.com/huseinzol05/malay-dataset/tree/master/dumping/clean.
        * ``'manglish'`` - Manglish News + Manglish Reddit + Manglish forum + training set from https://github.com/huseinzol05/malay-dataset/tree/master/dumping/clean.
        * ``'bahasa-manglish-combined'`` - Combined `dump-combined` and `manglish`.

    Returns
    -------
    result : str
    """

[3]:

lm = malaya_speech.stt.language_model(model = 'bahasa')
lm

[3]:

'/Users/huseinzolkepli/Malaya-Speech/language-model/bahasa/model.trie.klm'

Build custom Language Model#

Build KenLM,

wget -O - https://kheafield.com/code/kenlm.tar.gz |tar xz
mkdir kenlm/build
cd kenlm/build
cmake ..
make -j2

Prepare newlines text file. Feel free to use some from https://github.com/huseinzol05/Malay-Dataset/tree/master/dumping.

kenlm/build/bin/lmplz --text text.txt --arpa out.arpa -o 3 --prune 0 1 1
kenlm/build/bin/build_binary -q 8 -b 7 -a 256 trie out.arpa out.trie.klm

Once you have out.trie.klm, you can load to scorer interface.

from ctc_decoders import Scorer

scorer = Scorer(alpha, beta, 'out.trie.klm', vocab_list)

Use ctc-decoders#

From PYPI#

pip3 install ctc-decoders

But if you use linux, we unable to upload linux wheels to pypi repository, so download linux wheel at malaya-speech/ctc-decoders.

From source#

Check malaya-speech/ctc-decoders how to build from source incase there is no available wheel for your operating system.

Building from source should only take a few minutes.

Load ctc-decoders#

[15]:

from ctc_decoders import Scorer
from malaya_speech.utils.char import CTC_VOCAB

Init signature: Scorer(alpha, beta, model_path, vocabulary)
Docstring:
Wrapper for Scorer.

:param alpha: Parameter associated with language model. Don't use
              language model when alpha = 0.
:type alpha: float
:param beta: Parameter associated with word count. Don't use word
             count when beta = 0.
:type beta: float
:model_path: Path to load language model.
:type model_path: basestring

[16]:

scorer = Scorer(0.5, 1.0, lm, CTC_VOCAB)
scorer

[16]:

<ctc_decoders.Scorer; proxy of <Swig Object of type 'Scorer *' at 0x14ffe3c00> >

Test#

[2]:

from ctc_decoders import ctc_greedy_decoder, ctc_beam_search_decoder
import numpy as np
import malaya_speech

[19]:

# https://github.com/PaddlePaddle/DeepSpeech/blob/master/decoders/tests/test_decoders.py

vocab_list = ["\'", ' ', 'a', 'b', 'c', 'dk ']
beam_size = 20
probs_seq1 = [[
    0.06390443, 0.21124858, 0.27323887, 0.06870235, 0.0361254,
    0.18184413, 0.16493624
], [
    0.03309247, 0.22866108, 0.24390638, 0.09699597, 0.31895462,
    0.0094893, 0.06890021
], [
    0.218104, 0.19992557, 0.18245131, 0.08503348, 0.14903535,
    0.08424043, 0.08120984
], [
    0.12094152, 0.19162472, 0.01473646, 0.28045061, 0.24246305,
    0.05206269, 0.09772094
], [
    0.1333387, 0.00550838, 0.00301669, 0.21745861, 0.20803985,
    0.41317442, 0.01946335
], [
    0.16468227, 0.1980699, 0.1906545, 0.18963251, 0.19860937,
    0.04377724, 0.01457421
]]
probs_seq2 = [[
    0.08034842, 0.22671944, 0.05799633, 0.36814645, 0.11307441,
    0.04468023, 0.10903471
], [
    0.09742457, 0.12959763, 0.09435383, 0.21889204, 0.15113123,
    0.10219457, 0.20640612
], [
    0.45033529, 0.09091417, 0.15333208, 0.07939558, 0.08649316,
    0.12298585, 0.01654384
], [
    0.02512238, 0.22079203, 0.19664364, 0.11906379, 0.07816055,
    0.22538587, 0.13483174
], [
    0.17928453, 0.06065261, 0.41153005, 0.1172041, 0.11880313,
    0.07113197, 0.04139363
], [
    0.15882358, 0.1235788, 0.23376776, 0.20510435, 0.00279306,
    0.05294827, 0.22298418
]]
greedy_result = ["ac'bdk c", "b'dk a"]
beam_search_result = ['acdk c', "b'a"]

[20]:

ctc_greedy_decoder(np.array(probs_seq1), vocab_list) == greedy_result[0]

[20]:

True

[21]:

ctc_greedy_decoder(np.array(probs_seq2), vocab_list) == greedy_result[1]

[21]:

True

[22]:

ctc_beam_search_decoder(probs_seq = np.array(probs_seq1),
                        beam_size = beam_size,
                        vocabulary = vocab_list)

[22]:

[(-6.480283737182617, 'acdk c'),
 (-6.483003616333008, 'acdk  '),
 (-6.52116060256958, 'acdk a'),
 (-6.526535511016846, 'acdk b'),
 (-6.570488452911377, 'a dk c'),
 (-6.573208332061768, 'a dk  '),
 (-6.61136531829834, 'a dk a'),
 (-6.6167402267456055, 'a dk b'),
 (-6.630837440490723, 'acbc'),
 (-6.63310432434082, 'acb'),
 (-6.633557319641113, 'acb '),
 (-6.644730091094971, 'a bc'),
 (-6.647449970245361, 'a b '),
 (-6.650537490844727, 'a b'),
 (-6.667605400085449, "acdk '"),
 (-6.6717143058776855, 'acba'),
 (-6.685606956481934, 'a ba'),
 (-6.686768531799316, ' cdk c'),
 (-6.689488410949707, ' cdk  '),
 (-6.709468364715576, 'a c')]

[23]:

ctc_beam_search_decoder(probs_seq = np.array(probs_seq2),
                        beam_size = beam_size,
                        vocabulary = vocab_list)

[23]:

[(-4.989980220794678, "b'a"),
 (-5.298550128936768, "b'dk a"),
 (-5.3370184898376465, "b' a"),
 (-5.585845470428467, "b'a'"),
 (-5.652693271636963, " 'a"),
 (-5.7635698318481445, "b'ab"),
 (-5.788026332855225, "b'ba"),
 (-6.0385026931762695, 'bdk a'),
 (-6.132683753967285, "b'ca"),
 (-6.137714385986328, " 'dk a"),
 (-6.158307075500488, " ' a"),
 (-6.171831130981445, "b'dk '"),
 (-6.221673011779785, "b' '"),
 (-6.240574359893799, 'b a'),
 (-6.270209312438965, "b'a "),
 (-6.2848052978515625, "b'dk ab"),
 (-6.304642200469971, 'ba'),
 (-6.305397987365723, "b' ab"),
 (-6.426036834716797, " 'ab"),
 (-6.505356311798096, "b'b")]

Use pyctcdecode#

From PYPI#

pip3 install pyctcdecode==0.1.0 pypi-kenlm==0.1.20210121

From source#

Check https://github.com/kensho-technologies/pyctcdecode how to build from source incase there is no available wheel for your operating system.

Building from source should only take a few minutes.

[17]:

import kenlm
from pyctcdecode import build_ctcdecoder

kenlm_model = kenlm.Model(lm)
decoder = build_ctcdecoder(
    CTC_VOCAB,
    kenlm_model,
    alpha=0.5,
    beta=1.0,
)

Language Model

Contents

Language Model#

Purpose#

List available Language Model#

Load Language Model#

Build custom Language Model#

Use ctc-decoders#

From PYPI#

From source#

Load ctc-decoders#

Test#

Use pyctcdecode#

From PYPI#

From source#