Speech-to-Text RNNT PyTorch Multilanguage#

Encoder model + RNNT loss using PyTorch

This tutorial is available as an IPython notebook at malaya-speech/example/stt-transducer-model-pt-multilanguage.

This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.

[1]:
import os

os.environ['CUDA_VISIBLE_DEVICES'] = ''
[2]:
import malaya_speech
import numpy as np
from malaya_speech import Pipeline
`pyaudio` is not available, `malaya_speech.streaming.pyaudio` is not able to use.
[3]:
import logging

logging.basicConfig(level=logging.INFO)

List available RNNT model#

[4]:
malaya_speech.stt.transducer.available_pt_transformer()
INFO:malaya_speech.stt:for `malay-fleur102` language, tested on FLEURS102 `ms_my` test set, https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/prepare-stt
INFO:malaya_speech.stt:for `malay-malaya` language, tested on malaya-speech test set, https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/prepare-stt
INFO:malaya_speech.stt:for `singlish` language, tested on IMDA malaya-speech test set, https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/prepare-stt
INFO:malaya_speech.stt:for `whisper-mixed` language, tested on semisupervised Whisper Large V2 test set, https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/prepare-stt
[4]:
Size (MB) malay-malaya malay-fleur102 Language singlish whisper-mixed
mesolitica/conformer-tiny 38.5 {'WER': 0.17341180814, 'CER': 0.05957485024} {'WER': 0.19524478979, 'CER': 0.0830808938} [malay] NaN NaN
mesolitica/conformer-base 121 {'WER': 0.122076123261, 'CER': 0.03879606324} {'WER': 0.1326737206665, 'CER': 0.05032914857} [malay] NaN NaN
mesolitica/conformer-medium 243 {'WER': 0.1054817492564, 'CER': 0.0313518992842} {'WER': 0.1172708897486, 'CER': 0.0431050488} [malay] NaN NaN
mesolitica/emformer-base 162 {'WER': 0.175762423786, 'CER': 0.06233919000537} {'WER': 0.18303839134, 'CER': 0.0773853362} [malay] NaN NaN
mesolitica/conformer-base-singlish 121 NaN NaN [singlish] {'WER': 0.06517537334361, 'CER': 0.03265430876} NaN
mesolitica/conformer-medium-mixed 243 {'WER': 0.111166517935, 'CER': 0.03410958328} {'WER': 0.108354748, 'CER': 0.037785722} [malay, singlish] {'WER': 0.091969755225, 'CER': 0.044627194623} NaN
mesolitica/conformer-medium-malay-whisper 243 {'WER': 0.092561502, 'CER': 0.0245421736} {'WER': 0.097128574, 'CER': 0.03392603} [malay, mixed] NaN {'WER': 0.1705298134, 'CER': 0.10580679153}
mesolitica/conformer-large-malay-whisper 413 {'WER': 0.10028492039, 'CER': 0.0310868406} {'WER': 0.09544850396, 'CER': 0.03258454692} [malay, mixed] NaN {'WER': 0.20429079189, 'CER': 0.12111372327}
[5]:
malaya_speech.stt.google_accuracy
[5]:
{'malay-malaya': {'WER': 0.16477548774, 'CER': 0.05973209121},
 'malay-fleur102': {'WER': 0.109588779, 'CER': 0.047891527},
 'singlish': {'WER': 0.4941349, 'CER': 0.3026296}}
[6]:
malaya_speech.stt.whisper_accuracy
[6]:
{'tiny': {'Size (MB)': 72.1,
  'malay-malaya': {'WER': 0.7897730947, 'CER': 0.341671582346},
  'malay-fleur102': {'WER': 0.640224185, 'CER': 0.2869274323},
  'singlish': {'WER': 0.4751720563, 'CER': 0.35132630877}},
 'base': {'Size (MB)': 139,
  'malay-malaya': {'WER': 0.5138481614, 'CER': 0.19487665487},
  'malay-fleur102': {'WER': 0.4268323797, 'CER': 0.1545261803},
  'singlish': {'WER': 0.5354453439, 'CER': 0.4287910359}},
 'small': {'Size (MB)': 461,
  'malay-malaya': {'WER': 0.2818371132, 'CER': 0.09588120693},
  'malay-fleur102': {'WER': 0.2436472703, 'CER': 0.0913692568},
  'singlish': {'WER': 0.5971608337, 'CER': 0.5003890601}},
 'medium': {'Size (MB)': 1400,
  'malay-malaya': {'WER': 0.18945585961, 'CER': 0.0658303076},
  'malay-fleur102': {'WER': 0.1647166507, 'CER': 0.065537127},
  'singlish': {'WER': 0.68563087121, 'CER': 0.601676254253}},
 'large-v2': {'Size (MB)': 2900,
  'malay-malaya': {'WER': 0.1585939185, 'CER': 0.054978161091},
  'malay-fleur102': {'WER': 0.127483122485, 'CER': 0.05648688907},
  'singlish': {'WER': 0.6174993839, 'CER': 0.54582068858}}}

You should be skeptical with google and whisper accuracies, test set been applied with malaya-speech postprocessing, this can cause higher WER and CER.

Load RNNT model#

def pt_transformer(
    model: str = 'mesolitica/conformer-base',
    **kwargs,
):
    """
    Load Encoder-Transducer ASR model using Pytorch.

    Parameters
    ----------
    model : str, optional (default='mesolitica/conformer-base')
        Check available models at `malaya_speech.stt.transducer.available_pt_transformer()`.

    Returns
    -------
    result : malaya_speech.torch_model.torchaudio.Conformer class
    """
[7]:
model_mixed = malaya_speech.stt.transducer.pt_transformer(model = 'mesolitica/conformer-medium-mixed')
INFO:malaya_boilerplate.huggingface:downloading frozen mesolitica/conformer-medium-mixed/model.pt
INFO:malaya_boilerplate.huggingface:downloading frozen mesolitica/conformer-medium-mixed/malay-stt.model
INFO:malaya_boilerplate.huggingface:downloading frozen mesolitica/conformer-medium-mixed/malay-stats.json
[8]:
medium_whisper = malaya_speech.stt.transducer.pt_transformer(model = 'mesolitica/conformer-medium-malay-whisper')
INFO:malaya_boilerplate.huggingface:downloading frozen mesolitica/conformer-medium-malay-whisper/model.pt
INFO:malaya_boilerplate.huggingface:downloading frozen mesolitica/conformer-medium-malay-whisper/malay-stt.model
INFO:malaya_boilerplate.huggingface:downloading frozen mesolitica/conformer-medium-malay-whisper/malay-stats.json
[9]:
_ = model_mixed.eval()
_ = medium_whisper.eval()

Load sample#

[10]:
from datasets import Audio

sr = 16000
audio = Audio(sampling_rate=sr)
[11]:
y, _ = malaya_speech.load('speech/example-speaker/husein-zolkepli.wav')
y1 = audio.decode_example(audio.encode_example('speech/example-speaker/husein-zolkepli-mixed-1.mp3'))['array']
y2 = audio.decode_example(audio.encode_example('speech/example-speaker/husein-zolkepli-mixed-2.mp3'))['array']
[12]:
import IPython.display as ipd

ipd.Audio(y, rate = sr)
[12]:
[13]:
ipd.Audio(y1, rate = sr)
[13]:
[14]:
ipd.Audio(y2, rate = sr)
[14]:

Predict using beam decoder#

def beam_decoder(self, inputs, beam_width: int = 20):
    """
    Transcribe inputs using beam decoder.

    Parameters
    ----------
    inputs: List[np.array]
        List[np.array] or List[malaya_speech.model.frame.Frame].
    beam_width: int, optional (default=20)
        beam size for beam decoder.

    Returns
    -------
    result: List[str]
    """
[15]:
%%time

model_mixed.beam_decoder([y, y1, y2])
CPU times: user 26.5 s, sys: 758 ms, total: 27.2 s
Wall time: 2.38 s
[15]:
['testing nama saya hussein bin zulkifli',
 'hello nama saya mesin i hate fish but like three chicken thank you',
 'oh hari ini saya nak cakap tentang harian saya sampai is good something is bad but most of the day is good markets avanition sister mainan di ruang']
[16]:
%%time

medium_whisper.beam_decoder([y, y1, y2])
CPU times: user 48.1 s, sys: 550 ms, total: 48.7 s
Wall time: 4.5 s
[16]:
['testing nama saya hussein bin zulcaply',
 'hello nama saya hussein i hate fish but lighty chicken thank you',
 'hari ini saya nak cakap tentang harian saya something is good something is bad but most of the day is good markets affanny electoral dan saya suka main dengan di ruang']

Compare with Google STT#

[17]:
import speech_recognition as sr

r = sr.Recognizer()
[24]:
import soundfile as sf

sf.write('test-mixed1.wav', y1, 16000)
sf.write('test-mixed2.wav', y2, 16000)
[26]:
with sr.AudioFile('speech/example-speaker/husein-zolkepli.wav') as source:
    a = r.record(source)

text = r.recognize_google(a, language = 'ms')
text
[26]:
'testing Nama saya Hussein bin Zulkifli'
[22]:
with sr.AudioFile('test-mixed1.wav') as source:
    a = r.record(source)

text = r.recognize_google(a, language = 'ms')
text
[22]:
'Helo nama saya Hussein Aidil Hafiz lagi pun Thank you'
[25]:
with sr.AudioFile('test-mixed2.wav') as source:
    a = r.record(source)

text = r.recognize_google(a, language = 'ms')
text
[25]:
'sains nak cakap dengan angah harian saya macam saya juga nak tengok cepat sebab musuh boleh diskaun semakin Zam pahala kita hujan Saya suka main dia orang'

Straight bad.