Language Model
Contents
Language Model#
This tutorial is available as an IPython notebook at malaya-speech/example/ctc-language-model.
This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.
Purpose#
When doing CTC or RNNT beam decoding, we want to add language bias during find the optimum alignment.
List available Language Model#
We provided language model for our ASR models,
[1]:
import malaya_speech
[2]:
malaya_speech.stt.available_language_model()
[2]:
Size (MB) | LM order | Description | Command | |
---|---|---|---|---|
bahasa | 17 | 3 | Gathered from malaya-speech ASR bahasa transcript | [./lmplz --text text.txt --arpa out.arpa -o 3 ... |
bahasa-news | 24 | 3 | Gathered from malaya-speech bahasa ASR transcr... | [./lmplz --text text.txt --arpa out.arpa -o 3 ... |
bahasa-combined | 29 | 3 | Gathered from malaya-speech ASR bahasa transcr... | [./lmplz --text text.txt --arpa out.arpa -o 3 ... |
redape-community | 887.1 | 4 | Mirror for https://github.com/redapesolutions/... | [./lmplz --text text.txt --arpa out.arpa -o 4 ... |
dump-combined | 310 | 3 | Academia + News + IIUM + Parliament + Watpadd ... | [./lmplz --text text.txt --arpa out.arpa -o 3 ... |
manglish | 202 | 3 | Manglish News + Manglish Reddit + Manglish for... | [./lmplz --text text.txt --arpa out.arpa -o 3 ... |
bahasa-manglish-combined | 608 | 3 | Combined `dump-combined` and `manglish`. | [./lmplz --text text.txt --arpa out.arpa -o 3 ... |
redape-community
got from https://github.com/redapesolutions/suara-kami-community, another good malay speech-to-text repository.
Load Language Model#
def language_model(
model: str = 'dump-combined', **kwargs
):
"""
Load KenLM language model.
Parameters
----------
model : str, optional (default='dump-combined')
Model architecture supported. Allowed values:
* ``'bahasa'`` - Gathered from malaya-speech ASR bahasa transcript.
* ``'bahasa-news'`` - Gathered from malaya-speech ASR bahasa transcript + Bahasa News (Random sample 300k sentences).
* ``'bahasa-combined'`` - Gathered from malaya-speech ASR bahasa transcript + Bahasa News (Random sample 300k sentences) + Bahasa Wikipedia (Random sample 150k sentences).
* ``'redape-community'`` - Mirror for https://github.com/redapesolutions/suara-kami-community
* ``'dump-combined'`` - Academia + News + IIUM + Parliament + Watpadd + Wikipedia + Common Crawl + training set from https://github.com/huseinzol05/malay-dataset/tree/master/dumping/clean.
* ``'manglish'`` - Manglish News + Manglish Reddit + Manglish forum + training set from https://github.com/huseinzol05/malay-dataset/tree/master/dumping/clean.
* ``'bahasa-manglish-combined'`` - Combined `dump-combined` and `manglish`.
Returns
-------
result : str
"""
[3]:
lm = malaya_speech.stt.language_model(model = 'bahasa')
lm
[3]:
'/Users/huseinzolkepli/Malaya-Speech/language-model/bahasa/model.trie.klm'
Build custom Language Model#
Build KenLM,
wget -O - https://kheafield.com/code/kenlm.tar.gz |tar xz
mkdir kenlm/build
cd kenlm/build
cmake ..
make -j2
Prepare newlines text file. Feel free to use some from https://github.com/huseinzol05/Malay-Dataset/tree/master/dumping.
kenlm/build/bin/lmplz --text text.txt --arpa out.arpa -o 3 --prune 0 1 1
kenlm/build/bin/build_binary -q 8 -b 7 -a 256 trie out.arpa out.trie.klm
Once you have
out.trie.klm
, you can load to scorer interface.
from ctc_decoders import Scorer
scorer = Scorer(alpha, beta, 'out.trie.klm', vocab_list)
Use ctc-decoders#
From PYPI#
pip3 install ctc-decoders
But if you use linux, we unable to upload linux wheels to pypi repository, so download linux wheel at malaya-speech/ctc-decoders.
From source#
Check malaya-speech/ctc-decoders how to build from source incase there is no available wheel for your operating system.
Building from source should only take a few minutes.
Load ctc-decoders#
[15]:
from ctc_decoders import Scorer
from malaya_speech.utils.char import CTC_VOCAB
Init signature: Scorer(alpha, beta, model_path, vocabulary)
Docstring:
Wrapper for Scorer.
:param alpha: Parameter associated with language model. Don't use
language model when alpha = 0.
:type alpha: float
:param beta: Parameter associated with word count. Don't use word
count when beta = 0.
:type beta: float
:model_path: Path to load language model.
:type model_path: basestring
[16]:
scorer = Scorer(0.5, 1.0, lm, CTC_VOCAB)
scorer
[16]:
<ctc_decoders.Scorer; proxy of <Swig Object of type 'Scorer *' at 0x14ffe3c00> >
Test#
[2]:
from ctc_decoders import ctc_greedy_decoder, ctc_beam_search_decoder
import numpy as np
import malaya_speech
[19]:
# https://github.com/PaddlePaddle/DeepSpeech/blob/master/decoders/tests/test_decoders.py
vocab_list = ["\'", ' ', 'a', 'b', 'c', 'dk ']
beam_size = 20
probs_seq1 = [[
0.06390443, 0.21124858, 0.27323887, 0.06870235, 0.0361254,
0.18184413, 0.16493624
], [
0.03309247, 0.22866108, 0.24390638, 0.09699597, 0.31895462,
0.0094893, 0.06890021
], [
0.218104, 0.19992557, 0.18245131, 0.08503348, 0.14903535,
0.08424043, 0.08120984
], [
0.12094152, 0.19162472, 0.01473646, 0.28045061, 0.24246305,
0.05206269, 0.09772094
], [
0.1333387, 0.00550838, 0.00301669, 0.21745861, 0.20803985,
0.41317442, 0.01946335
], [
0.16468227, 0.1980699, 0.1906545, 0.18963251, 0.19860937,
0.04377724, 0.01457421
]]
probs_seq2 = [[
0.08034842, 0.22671944, 0.05799633, 0.36814645, 0.11307441,
0.04468023, 0.10903471
], [
0.09742457, 0.12959763, 0.09435383, 0.21889204, 0.15113123,
0.10219457, 0.20640612
], [
0.45033529, 0.09091417, 0.15333208, 0.07939558, 0.08649316,
0.12298585, 0.01654384
], [
0.02512238, 0.22079203, 0.19664364, 0.11906379, 0.07816055,
0.22538587, 0.13483174
], [
0.17928453, 0.06065261, 0.41153005, 0.1172041, 0.11880313,
0.07113197, 0.04139363
], [
0.15882358, 0.1235788, 0.23376776, 0.20510435, 0.00279306,
0.05294827, 0.22298418
]]
greedy_result = ["ac'bdk c", "b'dk a"]
beam_search_result = ['acdk c', "b'a"]
[20]:
ctc_greedy_decoder(np.array(probs_seq1), vocab_list) == greedy_result[0]
[20]:
True
[21]:
ctc_greedy_decoder(np.array(probs_seq2), vocab_list) == greedy_result[1]
[21]:
True
[22]:
ctc_beam_search_decoder(probs_seq = np.array(probs_seq1),
beam_size = beam_size,
vocabulary = vocab_list)
[22]:
[(-6.480283737182617, 'acdk c'),
(-6.483003616333008, 'acdk '),
(-6.52116060256958, 'acdk a'),
(-6.526535511016846, 'acdk b'),
(-6.570488452911377, 'a dk c'),
(-6.573208332061768, 'a dk '),
(-6.61136531829834, 'a dk a'),
(-6.6167402267456055, 'a dk b'),
(-6.630837440490723, 'acbc'),
(-6.63310432434082, 'acb'),
(-6.633557319641113, 'acb '),
(-6.644730091094971, 'a bc'),
(-6.647449970245361, 'a b '),
(-6.650537490844727, 'a b'),
(-6.667605400085449, "acdk '"),
(-6.6717143058776855, 'acba'),
(-6.685606956481934, 'a ba'),
(-6.686768531799316, ' cdk c'),
(-6.689488410949707, ' cdk '),
(-6.709468364715576, 'a c')]
[23]:
ctc_beam_search_decoder(probs_seq = np.array(probs_seq2),
beam_size = beam_size,
vocabulary = vocab_list)
[23]:
[(-4.989980220794678, "b'a"),
(-5.298550128936768, "b'dk a"),
(-5.3370184898376465, "b' a"),
(-5.585845470428467, "b'a'"),
(-5.652693271636963, " 'a"),
(-5.7635698318481445, "b'ab"),
(-5.788026332855225, "b'ba"),
(-6.0385026931762695, 'bdk a'),
(-6.132683753967285, "b'ca"),
(-6.137714385986328, " 'dk a"),
(-6.158307075500488, " ' a"),
(-6.171831130981445, "b'dk '"),
(-6.221673011779785, "b' '"),
(-6.240574359893799, 'b a'),
(-6.270209312438965, "b'a "),
(-6.2848052978515625, "b'dk ab"),
(-6.304642200469971, 'ba'),
(-6.305397987365723, "b' ab"),
(-6.426036834716797, " 'ab"),
(-6.505356311798096, "b'b")]
Use pyctcdecode#
From PYPI#
pip3 install pyctcdecode==0.1.0 pypi-kenlm==0.1.20210121
From source#
Check https://github.com/kensho-technologies/pyctcdecode how to build from source incase there is no available wheel for your operating system.
Building from source should only take a few minutes.
[17]:
import kenlm
from pyctcdecode import build_ctcdecoder
kenlm_model = kenlm.Model(lm)
decoder = build_ctcdecoder(
CTC_VOCAB,
kenlm_model,
alpha=0.5,
beta=1.0,
)