Language Model¶
This tutorial is available as an IPython notebook at malaya-speech/example/ctc-language-model.
This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.
Purpose¶
When doing CTC or RNNT beam decoding, we want to add language bias during find the optimum alignment.
List available Language Model¶
We provided language model for our ASR models,
[1]:
import malaya_speech
[2]:
malaya_speech.stt.available_language_model()
[2]:
| Size (MB) | LM order | Description | Command | |
|---|---|---|---|---|
| bahasa | 17 | 3 | Gathered from malaya-speech ASR bahasa transcript | [./lmplz --text text.txt --arpa out.arpa -o 3 ... |
| bahasa-news | 24 | 3 | Gathered from malaya-speech bahasa ASR transcr... | [./lmplz --text text.txt --arpa out.arpa -o 3 ... |
| bahasa-combined | 29 | 3 | Gathered from malaya-speech ASR bahasa transcr... | [./lmplz --text text.txt --arpa out.arpa -o 3 ... |
| redape-community | 887.1 | 4 | Mirror for https://github.com/redapesolutions/... | [./lmplz --text text.txt --arpa out.arpa -o 4 ... |
| dump-combined | 310 | 3 | Academia + News + IIUM + Parliament + Watpadd ... | [./lmplz --text text.txt --arpa out.arpa -o 3 ... |
| manglish | 202 | 3 | Manglish News + Manglish Reddit + Manglish for... | [./lmplz --text text.txt --arpa out.arpa -o 3 ... |
| bahasa-manglish-combined | 608 | 3 | Combined `dump-combined` and `manglish`. | [./lmplz --text text.txt --arpa out.arpa -o 3 ... |
redape-community got from https://github.com/redapesolutions/suara-kami-community, another good malay speech-to-text repository.
Load Language Model¶
def language_model(
model: str = 'dump-combined', **kwargs
):
"""
Load KenLM language model.
Parameters
----------
model : str, optional (default='dump-combined')
Model architecture supported. Allowed values:
* ``'bahasa'`` - Gathered from malaya-speech ASR bahasa transcript.
* ``'bahasa-news'`` - Gathered from malaya-speech ASR bahasa transcript + Bahasa News (Random sample 300k sentences).
* ``'bahasa-combined'`` - Gathered from malaya-speech ASR bahasa transcript + Bahasa News (Random sample 300k sentences) + Bahasa Wikipedia (Random sample 150k sentences).
* ``'redape-community'`` - Mirror for https://github.com/redapesolutions/suara-kami-community
* ``'dump-combined'`` - Academia + News + IIUM + Parliament + Watpadd + Wikipedia + Common Crawl + training set from https://github.com/huseinzol05/malay-dataset/tree/master/dumping/clean.
* ``'manglish'`` - Manglish News + Manglish Reddit + Manglish forum + training set from https://github.com/huseinzol05/malay-dataset/tree/master/dumping/clean.
* ``'bahasa-manglish-combined'`` - Combined `dump-combined` and `manglish`.
Returns
-------
result : str
"""
[3]:
lm = malaya_speech.stt.language_model(model = 'bahasa')
lm
[3]:
'/Users/huseinzolkepli/Malaya-Speech/language-model/bahasa/model.trie.klm'
Build custom Language Model¶
Build KenLM,
wget -O - https://kheafield.com/code/kenlm.tar.gz |tar xz
mkdir kenlm/build
cd kenlm/build
cmake ..
make -j2
Prepare newlines text file. Feel free to use some from https://github.com/huseinzol05/Malay-Dataset/tree/master/dumping.
kenlm/build/bin/lmplz --text text.txt --arpa out.arpa -o 3 --prune 0 1 1
kenlm/build/bin/build_binary -q 8 -b 7 -a 256 trie out.arpa out.trie.klm
Once you have
out.trie.klm, you can load to scorer interface.
from ctc_decoders import Scorer
scorer = Scorer(alpha, beta, 'out.trie.klm', vocab_list)
Use ctc-decoders¶
From PYPI¶
pip3 install ctc-decoders
But if you use linux, we unable to upload linux wheels to pypi repository, so download linux wheel at malaya-speech/ctc-decoders.
From source¶
Check malaya-speech/ctc-decoders how to build from source incase there is no available wheel for your operating system.
Building from source should only take a few minutes.
Load ctc-decoders¶
[15]:
from ctc_decoders import Scorer
from malaya_speech.utils.char import CTC_VOCAB
Init signature: Scorer(alpha, beta, model_path, vocabulary)
Docstring:
Wrapper for Scorer.
:param alpha: Parameter associated with language model. Don't use
language model when alpha = 0.
:type alpha: float
:param beta: Parameter associated with word count. Don't use word
count when beta = 0.
:type beta: float
:model_path: Path to load language model.
:type model_path: basestring
[16]:
scorer = Scorer(0.5, 1.0, lm, CTC_VOCAB)
scorer
[16]:
<ctc_decoders.Scorer; proxy of <Swig Object of type 'Scorer *' at 0x14ffe3c00> >
Test¶
[2]:
from ctc_decoders import ctc_greedy_decoder, ctc_beam_search_decoder
import numpy as np
import malaya_speech
[19]:
# https://github.com/PaddlePaddle/DeepSpeech/blob/master/decoders/tests/test_decoders.py
vocab_list = ["\'", ' ', 'a', 'b', 'c', 'dk ']
beam_size = 20
probs_seq1 = [[
0.06390443, 0.21124858, 0.27323887, 0.06870235, 0.0361254,
0.18184413, 0.16493624
], [
0.03309247, 0.22866108, 0.24390638, 0.09699597, 0.31895462,
0.0094893, 0.06890021
], [
0.218104, 0.19992557, 0.18245131, 0.08503348, 0.14903535,
0.08424043, 0.08120984
], [
0.12094152, 0.19162472, 0.01473646, 0.28045061, 0.24246305,
0.05206269, 0.09772094
], [
0.1333387, 0.00550838, 0.00301669, 0.21745861, 0.20803985,
0.41317442, 0.01946335
], [
0.16468227, 0.1980699, 0.1906545, 0.18963251, 0.19860937,
0.04377724, 0.01457421
]]
probs_seq2 = [[
0.08034842, 0.22671944, 0.05799633, 0.36814645, 0.11307441,
0.04468023, 0.10903471
], [
0.09742457, 0.12959763, 0.09435383, 0.21889204, 0.15113123,
0.10219457, 0.20640612
], [
0.45033529, 0.09091417, 0.15333208, 0.07939558, 0.08649316,
0.12298585, 0.01654384
], [
0.02512238, 0.22079203, 0.19664364, 0.11906379, 0.07816055,
0.22538587, 0.13483174
], [
0.17928453, 0.06065261, 0.41153005, 0.1172041, 0.11880313,
0.07113197, 0.04139363
], [
0.15882358, 0.1235788, 0.23376776, 0.20510435, 0.00279306,
0.05294827, 0.22298418
]]
greedy_result = ["ac'bdk c", "b'dk a"]
beam_search_result = ['acdk c', "b'a"]
[20]:
ctc_greedy_decoder(np.array(probs_seq1), vocab_list) == greedy_result[0]
[20]:
True
[21]:
ctc_greedy_decoder(np.array(probs_seq2), vocab_list) == greedy_result[1]
[21]:
True
[22]:
ctc_beam_search_decoder(probs_seq = np.array(probs_seq1),
beam_size = beam_size,
vocabulary = vocab_list)
[22]:
[(-6.480283737182617, 'acdk c'),
(-6.483003616333008, 'acdk '),
(-6.52116060256958, 'acdk a'),
(-6.526535511016846, 'acdk b'),
(-6.570488452911377, 'a dk c'),
(-6.573208332061768, 'a dk '),
(-6.61136531829834, 'a dk a'),
(-6.6167402267456055, 'a dk b'),
(-6.630837440490723, 'acbc'),
(-6.63310432434082, 'acb'),
(-6.633557319641113, 'acb '),
(-6.644730091094971, 'a bc'),
(-6.647449970245361, 'a b '),
(-6.650537490844727, 'a b'),
(-6.667605400085449, "acdk '"),
(-6.6717143058776855, 'acba'),
(-6.685606956481934, 'a ba'),
(-6.686768531799316, ' cdk c'),
(-6.689488410949707, ' cdk '),
(-6.709468364715576, 'a c')]
[23]:
ctc_beam_search_decoder(probs_seq = np.array(probs_seq2),
beam_size = beam_size,
vocabulary = vocab_list)
[23]:
[(-4.989980220794678, "b'a"),
(-5.298550128936768, "b'dk a"),
(-5.3370184898376465, "b' a"),
(-5.585845470428467, "b'a'"),
(-5.652693271636963, " 'a"),
(-5.7635698318481445, "b'ab"),
(-5.788026332855225, "b'ba"),
(-6.0385026931762695, 'bdk a'),
(-6.132683753967285, "b'ca"),
(-6.137714385986328, " 'dk a"),
(-6.158307075500488, " ' a"),
(-6.171831130981445, "b'dk '"),
(-6.221673011779785, "b' '"),
(-6.240574359893799, 'b a'),
(-6.270209312438965, "b'a "),
(-6.2848052978515625, "b'dk ab"),
(-6.304642200469971, 'ba'),
(-6.305397987365723, "b' ab"),
(-6.426036834716797, " 'ab"),
(-6.505356311798096, "b'b")]
Use pyctcdecode¶
From PYPI¶
pip3 install pyctcdecode==0.1.0 pypi-kenlm==0.1.20210121
From source¶
Check https://github.com/kensho-technologies/pyctcdecode how to build from source incase there is no available wheel for your operating system.
Building from source should only take a few minutes.
[17]:
import kenlm
from pyctcdecode import build_ctcdecoder
kenlm_model = kenlm.Model(lm)
decoder = build_ctcdecoder(
CTC_VOCAB,
kenlm_model,
alpha=0.5,
beta=1.0,
)