Speech-to-Text Seq2Seq Whisper#

Finetuned hyperlocal languages on pretrained HuggingFace models, https://huggingface.co/mesolitica

This tutorial is available as an IPython notebook at malaya-speech/example/stt-seq2seq-whisper.

This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.

This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at malaya-speech/example/pipeline.

Why official OpenAI Whisper instead HuggingFace?#

Some implementation from official repository is much better and evolved into better features, eg, https://github.com/m-bain/whisperX

Install OpenAI Whisper#

Simply,

pip install openai-whisper
[1]:
import malaya_speech
import numpy as np
from malaya_speech import Pipeline
`pyaudio` is not available, `malaya_speech.streaming.stream` is not able to use.
[2]:
import logging

logging.basicConfig(level=logging.INFO)

List available Whisper model#

[3]:
malaya_speech.stt.seq2seq.available_whisper()
INFO:malaya_speech.stt:for `malay-fleur102` language, tested on FLEURS102 `ms_my` test set, https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/prepare-stt
INFO:malaya_speech.stt:for `malay-malaya` language, tested on malaya-speech test set, https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/prepare-stt
INFO:malaya_speech.stt:for `singlish` language, tested on IMDA malaya-speech test set, https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/prepare-stt
[3]:
Size (MB) malay-malaya malay-fleur102 singlish Language
mesolitica/finetune-whisper-tiny-ms-singlish 151 {'WER': 0.20141585, 'CER': 0.071964908} {'WER': 0.235680975, 'CER': 0.0986880877} {'WER': 0.09045121, 'CER': 0.0481965} [malay, singlish]
mesolitica/finetune-whisper-tiny-ms-singlish-v2 151 {'WER': 0.20141585, 'CER': 0.071964908} {'WER': 0.22459602, 'CER': 0.089406469} {'WER': 0.138882971, 'CER': 0.074929807} [malay, singlish]
mesolitica/finetune-whisper-base-ms-singlish-v2 290 {'WER': 0.172632664, 'CER': 0.0680027682} {'WER': 0.1837319118, 'CER': 0.0599804251} {'WER': 0.111506313, 'CER': 0.05852830724} [malay, singlish]
mesolitica/finetune-whisper-small-ms-singlish-v2 967 {'WER': 0.13189875561, 'CER': 0.0434602169} {'WER': 0.13277694, 'CER': 0.0478108612} {'WER': 0.09489335668, 'CER': 0.05045327551} [malay, singlish]

Load Whisper model#

def whisper(
    model: str = 'mesolitica/finetune-whisper-base-ms-singlish-v2',
    force_check: bool = True,
    **kwargs,
):
    """
    Load Finetuned models from HuggingFace.

    Parameters
    ----------
    model : str, optional (default='mesolitica/finetune-whisper-base-ms-singlish-v2')
        Check available models at `malaya_speech.stt.seq2seq.available_whisper()`.
    force_check: bool, optional (default=True)
        Force check model one of malaya model.
        Set to False if you have your own huggingface model.

    Returns
    -------
    result : whisper.model.Whisper class
    """
[9]:
model = malaya_speech.stt.seq2seq.whisper(model = 'mesolitica/finetune-whisper-base-ms-singlish-v2')

Generate#

You can read more at official repository, https://github.com/openai/whisper

[11]:
model = model.to('cpu')
[13]:
import whisper
[14]:
audio = whisper.load_audio('speech/khutbah/wadi-annuar.wav')
audio = whisper.pad_or_trim(audio)

mel = whisper.log_mel_spectrogram(audio).to(model.device)
options = whisper.DecodingOptions(fp16 = False)
result = whisper.decode(model, mel, options)
result.text
[14]:
'dalam perjalanan ini dunia yang susah ini ketika nabi mengajar muaz bin jabal tadi ni alah maha'
[16]:
audio = whisper.load_audio('speech/singlish/singlish0.wav')
audio = whisper.pad_or_trim(audio)

mel = whisper.log_mel_spectrogram(audio).to(model.device)
options = whisper.DecodingOptions(fp16 = False)
result = whisper.decode(model, mel, options)
result.text
[16]:
'how they roll it in film okay actually'