Speech-to-Text HuggingFace¶

Finetuned hyperlocal languages on pretrained HuggingFace models, https://huggingface.co/mesolitica

This tutorial is available as an IPython notebook at malaya-speech/example/stt-huggingface.

This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.

This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at malaya-speech/example/pipeline.

Required Tensorflow >= 2.0 due to group convolution is not available for Tensorflow 1.

[1]:

import malaya_speech
import numpy as np
from malaya_speech import Pipeline

List available HuggingFace model¶

[2]:

malaya_speech.stt.available_huggingface()

[2]:

	CER	CER-LM	Language	Size (MB)	WER	WER-LM
mesolitica/wav2vec2-xls-r-300m-mixed	0.048105	0.041196	[malay, singlish, mandarin]	1180	0.13222	0.098802

Load HuggingFace model¶

def huggingface(model: str = 'mesolitica/wav2vec2-xls-r-300m-mixed', **kwargs):
    """
    Load Finetuned models from HuggingFace. Required Tensorflow >= 2.0.

    Parameters
    ----------
    model : str, optional (default='mesolitica/wav2vec2-xls-r-300m-mixed')
        Model architecture supported. Allowed values:

        * ``'mesolitica/wav2vec2-xls-r-300m-mixed'`` - wav2vec2 XLS-R 300M finetuned on (Malay + Singlish + Mandarin) languages.

    Returns
    -------
    result : malaya_speech.model.huggingface.CTC class
    """

[3]:

model = malaya_speech.stt.huggingface(model = 'mesolitica/wav2vec2-xls-r-300m-mixed')

Load sample¶

[4]:

ceramah, sr = malaya_speech.load('speech/khutbah/wadi-annuar.wav')
record1, sr = malaya_speech.load('speech/record/savewav_2020-11-26_22-36-06_294832.wav')
record2, sr = malaya_speech.load('speech/record/savewav_2020-11-26_22-40-56_929661.wav')
singlish0, sr = malaya_speech.load('speech/singlish/singlish0.wav')
singlish1, sr = malaya_speech.load('speech/singlish/singlish1.wav')
singlish2, sr = malaya_speech.load('speech/singlish/singlish2.wav')
mandarin0, sr = malaya_speech.load('speech/mandarin/597.wav')
mandarin1, sr = malaya_speech.load('speech/mandarin/584.wav')
mandarin2, sr = malaya_speech.load('speech/mandarin/509.wav')

[5]:

import IPython.display as ipd

ipd.Audio(ceramah, rate = sr)

[5]:

As we can hear, the speaker speaks in kedahan dialects plus some arabic words, let see how good our model is.

[6]:

ipd.Audio(record1, rate = sr)

[6]:

Predict using greedy decoder¶

def greedy_decoder(self, inputs):
    """
    Transcribe inputs using greedy decoder.

    Parameters
    ----------
    input: List[np.array]
        List[np.array] or List[malaya_speech.model.frame.Frame].

    Returns
    -------
    result: List[str]
    """

[7]:

%%time

model.greedy_decoder([ceramah, record1, record2, singlish0, singlish1, singlish2,
                      mandarin0, mandarin1, mandarin2])

CPU times: user 1min 52s, sys: 58.6 s, total: 2min 51s
Wall time: 33.7 s

[7]:

['jadi dalam perjalanan ini dunia yang susah ini ketika nabi mengajar muaz bin jabal tadi ini allah maaini',
 'hello nama saya husin saya tak beskemandi ketiap saya masam',
 'hello nama saya hussein saya sukomandi saya mandi diatia hari',
 'and then see how they roll it in film okay actually',
 'atat to your eyes',
 'sa versa in bal',
 'gei wo lai ge zhang jie zui xin de ge',
 'wo xiang shou kan zhiang shu ying shi pin dao de jie mu',
 'qiu yi shou ge de ming zhe ge ci li you zhuan sheng yi meng wang si qin shi xiu']

Predict using beam decoder¶

Natively the model is not able to do beam_decoder, so we need to use ctc_decoders using output from predict_logits,

def predict_logits(self, inputs, norm_func=softmax):
    """
    Predict logits from inputs.

    Parameters
    ----------
    input: List[np.array]
        List[np.array] or List[malaya_speech.model.frame.Frame].
    norm_func: Callable, optional (default=malaya.utils.activation.softmax)


    Returns
    -------
    result: List[np.array]
    """

[8]:

from ctc_decoders import ctc_beam_search_decoder
from malaya_speech.utils.char import HF_CTC_VOCAB

[9]:

%%time

logits = model.predict_logits([ceramah, record1, record2, singlish0, singlish1, singlish2,
                      mandarin0, mandarin1, mandarin2])

CPU times: user 1min 54s, sys: 53 s, total: 2min 47s
Wall time: 29.1 s

[10]:

logits.shape, len(HF_CTC_VOCAB)

[10]:

((9, 499, 40), 39)

[11]:

for no, l in enumerate(logits):
    o = ctc_beam_search_decoder(l, HF_CTC_VOCAB, 20)[0][1]
    print(no, o)

0 jadi dalam perjalanan ini dunia yang susah ini ketika nabi mengajar muaz bin jabal tadi ini allah maa ini
1 hello nama saya husin saya tak beskemandi ketiap saya masam
2 hello nama saya hussein saya sukomandi saya mandi diatia hari
3 and then see how they roll it in film okay actually
4 an tat to your eyes
5 sa versa in bal
6 gei wo lai ge zhang jie zui xin de ge
7 wo xiang shou kan ziang shu ying shi pin dao de jie mu
8 qiu yi shou ge de ming zhe ge ci li you zhuan sheng yi meng wang si qin shi xiu