Speech-to-Text Seq2Seq Whisper
Contents
Speech-to-Text Seq2Seq Whisper#
Finetuned hyperlocal languages on pretrained HuggingFace models, https://huggingface.co/mesolitica
This tutorial is available as an IPython notebook at malaya-speech/example/stt-seq2seq-whisper.
This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.
This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at malaya-speech/example/pipeline.
Why official OpenAI Whisper instead HuggingFace?#
Some implementation from official repository is much better and evolved into better features, eg, https://github.com/m-bain/whisperX
Install OpenAI Whisper#
Simply,
pip install openai-whisper
[1]:
import malaya_speech
import numpy as np
from malaya_speech import Pipeline
`pyaudio` is not available, `malaya_speech.streaming.stream` is not able to use.
[2]:
import logging
logging.basicConfig(level=logging.INFO)
List available Whisper model#
[3]:
malaya_speech.stt.seq2seq.available_whisper()
INFO:malaya_speech.stt:for `malay-fleur102` language, tested on FLEURS102 `ms_my` test set, https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/prepare-stt
INFO:malaya_speech.stt:for `malay-malaya` language, tested on malaya-speech test set, https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/prepare-stt
INFO:malaya_speech.stt:for `singlish` language, tested on IMDA malaya-speech test set, https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/prepare-stt
[3]:
Size (MB) | malay-malaya | malay-fleur102 | singlish | Language | |
---|---|---|---|---|---|
mesolitica/finetune-whisper-tiny-ms-singlish | 151 | {'WER': 0.20141585, 'CER': 0.071964908} | {'WER': 0.235680975, 'CER': 0.0986880877} | {'WER': 0.09045121, 'CER': 0.0481965} | [malay, singlish] |
mesolitica/finetune-whisper-tiny-ms-singlish-v2 | 151 | {'WER': 0.20141585, 'CER': 0.071964908} | {'WER': 0.22459602, 'CER': 0.089406469} | {'WER': 0.138882971, 'CER': 0.074929807} | [malay, singlish] |
mesolitica/finetune-whisper-base-ms-singlish-v2 | 290 | {'WER': 0.172632664, 'CER': 0.0680027682} | {'WER': 0.1837319118, 'CER': 0.0599804251} | {'WER': 0.111506313, 'CER': 0.05852830724} | [malay, singlish] |
mesolitica/finetune-whisper-small-ms-singlish-v2 | 967 | {'WER': 0.13189875561, 'CER': 0.0434602169} | {'WER': 0.13277694, 'CER': 0.0478108612} | {'WER': 0.09489335668, 'CER': 0.05045327551} | [malay, singlish] |
Load Whisper model#
def whisper(
model: str = 'mesolitica/finetune-whisper-base-ms-singlish-v2',
force_check: bool = True,
**kwargs,
):
"""
Load Finetuned models from HuggingFace.
Parameters
----------
model : str, optional (default='mesolitica/finetune-whisper-base-ms-singlish-v2')
Check available models at `malaya_speech.stt.seq2seq.available_whisper()`.
force_check: bool, optional (default=True)
Force check model one of malaya model.
Set to False if you have your own huggingface model.
Returns
-------
result : whisper.model.Whisper class
"""
[9]:
model = malaya_speech.stt.seq2seq.whisper(model = 'mesolitica/finetune-whisper-base-ms-singlish-v2')
Generate#
You can read more at official repository, https://github.com/openai/whisper
[11]:
model = model.to('cpu')
[13]:
import whisper
[14]:
audio = whisper.load_audio('speech/khutbah/wadi-annuar.wav')
audio = whisper.pad_or_trim(audio)
mel = whisper.log_mel_spectrogram(audio).to(model.device)
options = whisper.DecodingOptions(fp16 = False)
result = whisper.decode(model, mel, options)
result.text
[14]:
'dalam perjalanan ini dunia yang susah ini ketika nabi mengajar muaz bin jabal tadi ni alah maha'
[16]:
audio = whisper.load_audio('speech/singlish/singlish0.wav')
audio = whisper.pad_or_trim(audio)
mel = whisper.log_mel_spectrogram(audio).to(model.device)
options = whisper.DecodingOptions(fp16 = False)
result = whisper.decode(model, mel, options)
result.text
[16]:
'how they roll it in film okay actually'