Speech-to-Text CTC HuggingFace#

Finetuned hyperlocal languages on pretrained HuggingFace models, https://huggingface.co/mesolitica

This tutorial is available as an IPython notebook at malaya-speech/example/stt-ctc-huggingface.

This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.

This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at malaya-speech/example/pipeline.

[1]:

import malaya_speech
import numpy as np
from malaya_speech import Pipeline

`pyaudio` is not available, `malaya_speech.streaming.stream` is not able to use.

[2]:

import logging

logging.basicConfig(level=logging.INFO)

List available HuggingFace model#

[3]:

malaya_speech.stt.ctc.available_huggingface()

INFO:malaya_speech.stt:for `malay-fleur102` language, tested on FLEURS102 `ms_my` test set, https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/prepare-stt
INFO:malaya_speech.stt:for `malay-malaya` language, tested on malaya-speech test set, https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/prepare-stt
INFO:malaya_speech.stt:for `singlish` language, tested on IMDA malaya-speech test set, https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/prepare-stt

[3]:

	Size (MB)	malay-malaya	malay-fleur102	singlish	Language
mesolitica/wav2vec2-xls-r-300m-mixed	1180	{'WER': 0.194655128, 'CER': 0.04775798, 'WER-L...	{'WER': 0.2373861259, 'CER': 0.07055478, 'WER-...	{'WER': 0.127588595, 'CER': 0.0494924979, 'WER...	[malay, singlish]
mesolitica/wav2vec2-xls-r-300m-mixed-v2	1180	{'WER': 0.154782923, 'CER': 0.035164031, 'WER-...	{'WER': 0.2013994374, 'CER': 0.0518170369, 'WE...	{'WER': 0.2258822139, 'CER': 0.082982312, 'WER...	[malay, singlish]
mesolitica/wav2vec2-xls-r-300m-12layers-ms	657	{'WER': 0.1494983789, 'CER': 0.0342059992, 'WE...	{'WER': 0.217107489, 'CER': 0.0546614199, 'WER...	NaN	[malay]
mesolitica/wav2vec2-xls-r-300m-6layers-ms	339	{'WER': 0.1494983789, 'CER': 0.0342059992, 'WE...	{'WER': 0.217107489, 'CER': 0.0546614199, 'WER...	NaN	[malay]
mesolitica/wav2vec2-xls-r-300m-3layers-ms	195	{'WER': 0.1494983789, 'CER': 0.0342059992, 'WE...	{'WER': 0.217107489, 'CER': 0.0546614199, 'WER...	NaN	[malay]

Load HuggingFace model#

def huggingface(
    model: str = 'mesolitica/wav2vec2-xls-r-300m-mixed',
    force_check: bool = True,
    **kwargs,
):
    """
    Load Finetuned models from HuggingFace.

    Parameters
    ----------
    model : str, optional (default='mesolitica/wav2vec2-xls-r-300m-mixed')
        Check available models at `malaya_speech.stt.ctc.available_huggingface()`.
    force_check: bool, optional (default=True)
        Force check model one of malaya model.
        Set to False if you have your own huggingface model.

    Returns
    -------
    result : malaya_speech.torch_model.huggingface.CTC class
    """

[3]:

model = malaya_speech.stt.ctc.huggingface(model = 'mesolitica/wav2vec2-xls-r-300m-mixed')

Load sample#

[4]:

ceramah, sr = malaya_speech.load('speech/khutbah/wadi-annuar.wav')
record1, sr = malaya_speech.load('speech/record/savewav_2020-11-26_22-36-06_294832.wav')
record2, sr = malaya_speech.load('speech/record/savewav_2020-11-26_22-40-56_929661.wav')
singlish0, sr = malaya_speech.load('speech/singlish/singlish0.wav')
singlish1, sr = malaya_speech.load('speech/singlish/singlish1.wav')
singlish2, sr = malaya_speech.load('speech/singlish/singlish2.wav')
mandarin0, sr = malaya_speech.load('speech/mandarin/597.wav')
mandarin1, sr = malaya_speech.load('speech/mandarin/584.wav')
mandarin2, sr = malaya_speech.load('speech/mandarin/509.wav')

[5]:

import IPython.display as ipd

ipd.Audio(ceramah, rate = sr)

[5]:

As we can hear, the speaker speaks in kedahan dialects plus some arabic words, let see how good our model is.

[6]:

ipd.Audio(record1, rate = sr)

[6]:

Predict using greedy decoder#

def greedy_decoder(self, inputs):
    """
    Transcribe inputs using greedy decoder.

    Parameters
    ----------
    input: List[np.array]
        List[np.array] or List[malaya_speech.model.frame.Frame].

    Returns
    -------
    result: List[str]
    """

[7]:

%%time

model.greedy_decoder([ceramah, record1, record2, singlish0, singlish1, singlish2,
                      mandarin0, mandarin1, mandarin2])

CPU times: user 1min 30s, sys: 35.7 s, total: 2min 6s
Wall time: 11.6 s

[7]:

['jadi dalam perjalanan ini dunia yang susah ini ketika nabi mengajar muaz bin jabal tadi ini allah maaini',
 'hello nama saya husin saya tak beskemandi ketiap saya masam',
 'hello nama saya hussein saya sukomandi saya mandi diatia hari',
 'and then see how they roll it in film okay actually',
 'atat to your eyes',
 'sa versa in bal',
 'gei wo lai ge zhang jie zui xin de ge',
 'wo xiang shou kan zhiang shu ying shi pin dao de jie mu',
 'qiu yi shou ge de ming zhe ge ci li you zhuan sheng yi meng wang si qin shi xiu']

Predict using beam decoder#

Natively the model is not able to do beam_decoder, so we need to use ctc_decoders using output from predict_logits,

def predict_logits(self, inputs, norm_func=softmax):
    """
    Predict logits from inputs.

    Parameters
    ----------
    input: List[np.array]
        List[np.array] or List[malaya_speech.model.frame.Frame].
    norm_func: Callable, optional (default=malaya.utils.activation.softmax)


    Returns
    -------
    result: List[np.array]
    """

[8]:

from ctc_decoders import ctc_beam_search_decoder
from malaya_speech.utils.char import HF_CTC_VOCAB

[9]:

%%time

logits = model.predict_logits([ceramah, record1, record2, singlish0, singlish1, singlish2,
                      mandarin0, mandarin1, mandarin2])

CPU times: user 1min 34s, sys: 27.7 s, total: 2min 2s
Wall time: 10.9 s

[10]:

logits.shape, len(HF_CTC_VOCAB)

[10]:

((9, 499, 40), 39)

[11]:

for no, l in enumerate(logits):
    o = ctc_beam_search_decoder(l, HF_CTC_VOCAB, 20)[0][1]
    print(no, o)

0 jadi dalam perjalanan ini dunia yang susah ini ketika nabi mengajar muaz bin jabal tadi ini allah maa ini
1 hello nama saya husin saya tak beskemandi ketiap saya masam
2 hello nama saya hussein saya sukomandi saya mandi diatia hari
3 and then see how they roll it in film okay actually
4 an tat to your eyes
5 sa versa in bal
6 gei wo lai ge zhang jie zui xin de ge
7 wo xiang shou kan ziang shu ying shi pin dao de jie mu
8 qiu yi shou ge de ming zhe ge ci li you zhuan sheng yi meng wang si qin shi xiu

Speech-to-Text CTC HuggingFace

Contents

Speech-to-Text CTC HuggingFace#

List available HuggingFace model#

Load HuggingFace model#

Load sample#

Predict using greedy decoder#

Predict using beam decoder#