Speech-to-Text Seq2Seq HuggingFace#

Finetuned hyperlocal languages on pretrained HuggingFace models, https://huggingface.co/mesolitica

This tutorial is available as an IPython notebook at malaya-speech/example/stt-seq2seq-huggingface.

This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.

This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at malaya-speech/example/pipeline.

[1]:
import malaya_speech
import numpy as np
from malaya_speech import Pipeline
`pyaudio` is not available, `malaya_speech.streaming.stream` is not able to use.
[2]:
import logging

logging.basicConfig(level=logging.INFO)

List available HuggingFace model#

[3]:
malaya_speech.stt.seq2seq.available_huggingface()
INFO:malaya_speech.stt:for `malay-fleur102` language, tested on FLEURS102 `ms_my` test set, https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/prepare-stt
INFO:malaya_speech.stt:for `malay-malaya` language, tested on malaya-speech test set, https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/prepare-stt
INFO:malaya_speech.stt:for `singlish` language, tested on IMDA malaya-speech test set, https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/prepare-stt
[3]:
Size (MB) malay-malaya malay-fleur102 singlish Language
mesolitica/finetune-whisper-tiny-ms-singlish 151 {'WER': 0.20141585, 'CER': 0.071964908} {'WER': 0.235680975, 'CER': 0.0986880877} {'WER': 0.09045121, 'CER': 0.0481965} [malay, singlish]
mesolitica/finetune-whisper-tiny-ms-singlish-v2 151 {'WER': 0.20141585, 'CER': 0.071964908} {'WER': 0.22459602, 'CER': 0.089406469} {'WER': 0.138882971, 'CER': 0.074929807} [malay, singlish]
mesolitica/finetune-whisper-base-ms-singlish-v2 290 {'WER': 0.172632664, 'CER': 0.0680027682} {'WER': 0.1837319118, 'CER': 0.0599804251} {'WER': 0.111506313, 'CER': 0.05852830724} [malay, singlish]
mesolitica/finetune-whisper-small-ms-singlish-v2 967 {'WER': 0.13189875561, 'CER': 0.0434602169} {'WER': 0.13277694, 'CER': 0.0478108612} {'WER': 0.09489335668, 'CER': 0.05045327551} [malay, singlish]

Load HuggingFace model#

def huggingface(
    model: str = 'mesolitica/finetune-whisper-base-ms-singlish-v2',
    force_check: bool = True,
    **kwargs,
):
    """
    Load Finetuned models from HuggingFace.

    Parameters
    ----------
    model : str, optional (default='mesolitica/finetune-whisper-base-ms-singlish-v2')
        Check available models at `malaya_speech.stt.seq2seq.available_huggingface()`.
    force_check: bool, optional (default=True)
        Force check model one of malaya model.
        Set to False if you have your own huggingface model.

    Returns
    -------
    result : malaya_speech.model.huggingface.Seq2Seq class
    """
[1]:
model = malaya_speech.stt.seq2seq.huggingface(model = 'mesolitica/finetune-whisper-base-ms-singlish-v2')

Load sample#

[8]:
ceramah, sr = malaya_speech.load('speech/khutbah/wadi-annuar.wav')
record1, sr = malaya_speech.load('speech/record/savewav_2020-11-26_22-36-06_294832.wav')
record2, sr = malaya_speech.load('speech/record/savewav_2020-11-26_22-40-56_929661.wav')
singlish0, sr = malaya_speech.load('speech/singlish/singlish0.wav')
singlish1, sr = malaya_speech.load('speech/singlish/singlish1.wav')
singlish2, sr = malaya_speech.load('speech/singlish/singlish2.wav')
[6]:
import IPython.display as ipd

ipd.Audio(ceramah, rate = sr)
[6]:

As we can hear, the speaker speaks in kedahan dialects plus some arabic words, let see how good our model is.

[7]:
ipd.Audio(record1, rate = sr)
[7]:

Generate#

You can read more about seq2seq generate function, https://huggingface.co/blog/how-to-generate

def generate(self, inputs, skip_special_tokens: bool = True, **kwargs):
    """
    Transcribe inputs.

    Returns
    -------
    result: List[str]

    Parameters
    ----------
    input: List[np.array]
        List[np.array] or List[malaya_speech.model.frame.Frame].
    skip_special_tokens: bool, optional (default=True)
        skip special tokens during decoding.
    **kwargs: vector arguments pass to huggingface `generate` method.
        Read more at https://huggingface.co/docs/transformers/main_classes/text_generation

    Returns
    -------
    result: List[str]
    """
[11]:
model.use_whisper_processor = False
[12]:
%%time

model.generate([ceramah, record1, record2, singlish0, singlish1, singlish2], max_length = 256)
CPU times: user 31.2 s, sys: 6.95 s, total: 38.1 s
Wall time: 3.26 s
[12]:
['jadi dalam perjalanan ini dunia yang susah ini ketika nabi mengajar muaz bin jabal tadi ni alah maha',
 'hello nama saya hussein saya tak suka mandi ketat saya masam',
 'hello nama saya hussein saya suka mandi saya mandi tetek hari',
 'and then see how they roll it in film okay actually',
 'then you tell your eyes',
 'savanza in mal']

Predict using beam decoder#

https://huggingface.co/blog/how-to-generate#beam-search

[14]:
model.generate([ceramah, record1, record2, singlish0, singlish1, singlish2], max_length = 256,
               num_beams=5,
    no_repeat_ngram_size=2,
    early_stopping=True)
[14]:
['jadi dalam perjalanan ini dunia yang susah ini ketika nabi mengajar muaz bin jabal tadi ni allah maha ini',
 'hello nama saya hussein saya tak suka mandi ketat saya masam',
 'hello nama saya hussin saya suka mandi semandi tetek hari',
 'and then see how they roll it in film okay actually',
 'then you tat to your eyes',
 'seversa limau']
[ ]: