Language Detection

This tutorial is available as an IPython notebook at malaya-speech/example/language-detection.

This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.

This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at malaya-speech/example/pipeline.

Dataset

Handpicked Youtube videos, gathered at https://github.com/huseinzol05/malaya-speech/tree/master/data/podcast

[1]:
import malaya_speech
import numpy as np
from malaya_speech import Pipeline
[2]:
y, sr = malaya_speech.load('speech/video/The-Singaporean-White-Boy.wav')
len(y), sr
[2]:
(1634237, 16000)
[3]:
# just going to take 30 seconds
y = y[:sr * 30]
[4]:
import IPython.display as ipd
ipd.Audio(y, rate = sr)
[4]:

This audio extracted from https://www.youtube.com/watch?v=HylaY5e1awo&t=2s

Supported languages

[5]:
malaya_speech.language_detection.labels
[5]:
['english',
 'indonesian',
 'malay',
 'mandarin',
 'manglish',
 'others',
 'not a language']

Here we are not trying to tackle all possible languages, just towards hyperlocal languages in Malaysia.

List available deep model

[6]:
malaya_speech.language_detection.available_model()
INFO:root:last accuracy during training session before early stopping.
[6]:
Size (MB) Quantized Size (MB) Accuracy
vggvox-v2 30.9 7.92 0.90204
deep-speaker 96.9 24.40 0.89450

Load deep model

def deep_model(model: str = 'vggvox-v2', quantized: bool = False, **kwargs):
    """
    Load language detection deep model.

    Parameters
    ----------
    model : str, optional (default='vggvox-v2')
        Model architecture supported. Allowed values:

        * ``'vggvox-v2'`` - finetuned VGGVox V2.
        * ``'deep-speaker'`` - finetuned Deep Speaker.
    quantized : bool, optional (default=False)
        if True, will load 8-bit quantized model.
        Quantized model not necessary faster, totally depends on the machine.

    Returns
    -------
    result : malaya_speech.supervised.classification.load function
    """
[7]:
vggvox_v2 = malaya_speech.language_detection.deep_model(model = 'vggvox-v2')
deep_speaker = malaya_speech.language_detection.deep_model(model = 'deep-speaker')
/Users/huseinzolkepli/Documents/tf-1.15/env/lib/python3.7/site-packages/tensorflow_core/python/client/session.py:1750: UserWarning: An interactive session is already active. This can cause out-of-memory errors in some cases. You must explicitly call `InteractiveSession.close()` to release resources held by the other session(s).
  warnings.warn('An interactive session is already active. This can '

Load Quantized deep model

To load 8-bit quantized model, simply pass quantized = True, default is False.

We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine.

[8]:
quantized_vggvox_v2 = malaya_speech.language_detection.deep_model(model = 'vggvox-v2', quantized = True)
WARNING:root:Load quantized model will cause accuracy drop.

How to classify languages in an audio sample

So we are going to use VAD to help us. Instead we are going to classify as a whole sample, we chunk it into multiple small samples and classify it.

[9]:
vad = malaya_speech.vad.deep_model(model = 'vggvox-v2')
[10]:
%%time

frames = list(malaya_speech.utils.generator.frames(y, 30, sr))
CPU times: user 1.67 ms, sys: 127 µs, total: 1.8 ms
Wall time: 1.88 ms
[11]:
p = Pipeline()
pipeline = (
    p.batching(5)
    .foreach_map(vad.predict)
    .flatten()
)
p.visualize()
[11]:
_images/load-language-detection_23_0.png
[12]:
%%time

result = p.emit(frames)
result.keys()
/Users/huseinzolkepli/Documents/tf-1.15/env/lib/python3.7/site-packages/librosa/core/spectrum.py:224: UserWarning: n_fft=512 is too small for input signal of length=480
  n_fft, y.shape[-1]
CPU times: user 31.8 s, sys: 6.18 s, total: 38 s
Wall time: 7.79 s
[12]:
dict_keys(['batching', 'predict', 'flatten'])
[13]:
frames_vad = [(frame, result['flatten'][no]) for no, frame in enumerate(frames)]
grouped_vad = malaya_speech.utils.group.group_frames(frames_vad)
grouped_vad = malaya_speech.utils.group.group_frames_threshold(grouped_vad, threshold_to_stop = 0.3)
[14]:
malaya_speech.extra.visualization.visualize_vad(y, grouped_vad, sr, figsize = (15, 3))
_images/load-language-detection_26_0.png
[15]:
p_vggvox_v2 = Pipeline()
pipeline = (
    p_vggvox_v2.foreach_map(vggvox_v2)
    .flatten()
)
p_vggvox_v2.visualize()
[15]:
_images/load-language-detection_27_0.png
[16]:
p_deep_speaker = Pipeline()
pipeline = (
    p_deep_speaker.foreach_map(deep_speaker)
    .flatten()
)
p_deep_speaker.visualize()
[16]:
_images/load-language-detection_28_0.png
[17]:
%%time

samples_vad = [g[0] for g in grouped_vad]
result_vggvox_v2 = p_vggvox_v2.emit(samples_vad)
result_vggvox_v2.keys()
CPU times: user 4.84 s, sys: 988 ms, total: 5.83 s
Wall time: 1.36 s
[17]:
dict_keys(['language-detection', 'flatten'])
[18]:
%%time

samples_vad = [g[0] for g in grouped_vad]
result_deep_speaker = p_deep_speaker.emit(samples_vad)
result_deep_speaker.keys()
CPU times: user 4.98 s, sys: 778 ms, total: 5.76 s
Wall time: 1.53 s
[18]:
dict_keys(['language-detection', 'flatten'])
[19]:
samples_vad_vggvox_v2 = [(frame, result_vggvox_v2['flatten'][no]) for no, frame in enumerate(samples_vad)]
samples_vad_vggvox_v2
[19]:
[(<malaya_speech.model.frame.Frame at 0x13edc9f50>, 'not a language'),
 (<malaya_speech.model.frame.Frame at 0x13d63df50>, 'not a language'),
 (<malaya_speech.model.frame.Frame at 0x1425c5d50>, 'manglish'),
 (<malaya_speech.model.frame.Frame at 0x1425c5890>, 'malay'),
 (<malaya_speech.model.frame.Frame at 0x1084da510>, 'english'),
 (<malaya_speech.model.frame.Frame at 0x1425c5c90>, 'english'),
 (<malaya_speech.model.frame.Frame at 0x13e8505d0>, 'malay'),
 (<malaya_speech.model.frame.Frame at 0x13e8508d0>, 'malay'),
 (<malaya_speech.model.frame.Frame at 0x13e861350>, 'malay'),
 (<malaya_speech.model.frame.Frame at 0x13d63dd10>, 'malay'),
 (<malaya_speech.model.frame.Frame at 0x13e861290>, 'malay'),
 (<malaya_speech.model.frame.Frame at 0x13e8612d0>, 'not a language'),
 (<malaya_speech.model.frame.Frame at 0x13e861390>, 'malay'),
 (<malaya_speech.model.frame.Frame at 0x13e8613d0>, 'manglish'),
 (<malaya_speech.model.frame.Frame at 0x13e861410>, 'english'),
 (<malaya_speech.model.frame.Frame at 0x13e861490>, 'english'),
 (<malaya_speech.model.frame.Frame at 0x13e861450>, 'english'),
 (<malaya_speech.model.frame.Frame at 0x13e861510>, 'english'),
 (<malaya_speech.model.frame.Frame at 0x13e861550>, 'english'),
 (<malaya_speech.model.frame.Frame at 0x13e8614d0>, 'not a language'),
 (<malaya_speech.model.frame.Frame at 0x13e8615d0>, 'malay'),
 (<malaya_speech.model.frame.Frame at 0x13e861590>, 'malay')]
[20]:
samples_vad_deep_speaker = [(frame, result_deep_speaker['flatten'][no]) for no, frame in enumerate(samples_vad)]
samples_vad_deep_speaker
[20]:
[(<malaya_speech.model.frame.Frame at 0x13edc9f50>, 'not a language'),
 (<malaya_speech.model.frame.Frame at 0x13d63df50>, 'not a language'),
 (<malaya_speech.model.frame.Frame at 0x1425c5d50>, 'malay'),
 (<malaya_speech.model.frame.Frame at 0x1425c5890>, 'malay'),
 (<malaya_speech.model.frame.Frame at 0x1084da510>, 'english'),
 (<malaya_speech.model.frame.Frame at 0x1425c5c90>, 'english'),
 (<malaya_speech.model.frame.Frame at 0x13e8505d0>, 'english'),
 (<malaya_speech.model.frame.Frame at 0x13e8508d0>, 'malay'),
 (<malaya_speech.model.frame.Frame at 0x13e861350>, 'others'),
 (<malaya_speech.model.frame.Frame at 0x13d63dd10>, 'malay'),
 (<malaya_speech.model.frame.Frame at 0x13e861290>, 'english'),
 (<malaya_speech.model.frame.Frame at 0x13e8612d0>, 'not a language'),
 (<malaya_speech.model.frame.Frame at 0x13e861390>, 'malay'),
 (<malaya_speech.model.frame.Frame at 0x13e8613d0>, 'others'),
 (<malaya_speech.model.frame.Frame at 0x13e861410>, 'not a language'),
 (<malaya_speech.model.frame.Frame at 0x13e861490>, 'english'),
 (<malaya_speech.model.frame.Frame at 0x13e861450>, 'english'),
 (<malaya_speech.model.frame.Frame at 0x13e861510>, 'others'),
 (<malaya_speech.model.frame.Frame at 0x13e861550>, 'english'),
 (<malaya_speech.model.frame.Frame at 0x13e8614d0>, 'not a language'),
 (<malaya_speech.model.frame.Frame at 0x13e8615d0>, 'english'),
 (<malaya_speech.model.frame.Frame at 0x13e861590>, 'others')]
[21]:
import matplotlib.pyplot as plt
[22]:
nrows = 3
fig, ax = plt.subplots(nrows = nrows, ncols = 1)
fig.set_figwidth(20)
fig.set_figheight(nrows * 3)
malaya_speech.extra.visualization.visualize_vad(y, grouped_vad, sr, ax = ax[0])
malaya_speech.extra.visualization.plot_classification(samples_vad_vggvox_v2,
                                                      'language detection vggvox v2', ax = ax[1])
malaya_speech.extra.visualization.plot_classification(samples_vad_deep_speaker,
                                                      'language detection deep speaker', ax = ax[2])
fig.tight_layout()
plt.show()
_images/load-language-detection_34_0.png
[23]:
p_quantized_vggvox_v2 = Pipeline()
pipeline = (
    p_quantized_vggvox_v2.foreach_map(quantized_vggvox_v2)
    .flatten()
)
p_quantized_vggvox_v2.visualize()
[23]:
_images/load-language-detection_35_0.png
[24]:
%%time

samples_vad = [g[0] for g in grouped_vad]
result_quantized_vggvox_v2 = p_quantized_vggvox_v2.emit(samples_vad)
result_quantized_vggvox_v2.keys()
CPU times: user 4.78 s, sys: 870 ms, total: 5.65 s
Wall time: 1.33 s
[24]:
dict_keys(['language-detection', 'flatten'])
[25]:
samples_vad_quantized_vggvox_v2 = [(frame, result_quantized_vggvox_v2['flatten'][no]) for no, frame in enumerate(samples_vad)]
samples_vad_quantized_vggvox_v2
[25]:
[(<malaya_speech.model.frame.Frame at 0x13edc9f50>, 'not a language'),
 (<malaya_speech.model.frame.Frame at 0x13d63df50>, 'not a language'),
 (<malaya_speech.model.frame.Frame at 0x1425c5d50>, 'manglish'),
 (<malaya_speech.model.frame.Frame at 0x1425c5890>, 'malay'),
 (<malaya_speech.model.frame.Frame at 0x1084da510>, 'english'),
 (<malaya_speech.model.frame.Frame at 0x1425c5c90>, 'english'),
 (<malaya_speech.model.frame.Frame at 0x13e8505d0>, 'malay'),
 (<malaya_speech.model.frame.Frame at 0x13e8508d0>, 'malay'),
 (<malaya_speech.model.frame.Frame at 0x13e861350>, 'malay'),
 (<malaya_speech.model.frame.Frame at 0x13d63dd10>, 'malay'),
 (<malaya_speech.model.frame.Frame at 0x13e861290>, 'malay'),
 (<malaya_speech.model.frame.Frame at 0x13e8612d0>, 'not a language'),
 (<malaya_speech.model.frame.Frame at 0x13e861390>, 'malay'),
 (<malaya_speech.model.frame.Frame at 0x13e8613d0>, 'others'),
 (<malaya_speech.model.frame.Frame at 0x13e861410>, 'english'),
 (<malaya_speech.model.frame.Frame at 0x13e861490>, 'english'),
 (<malaya_speech.model.frame.Frame at 0x13e861450>, 'english'),
 (<malaya_speech.model.frame.Frame at 0x13e861510>, 'english'),
 (<malaya_speech.model.frame.Frame at 0x13e861550>, 'english'),
 (<malaya_speech.model.frame.Frame at 0x13e8614d0>, 'not a language'),
 (<malaya_speech.model.frame.Frame at 0x13e8615d0>, 'malay'),
 (<malaya_speech.model.frame.Frame at 0x13e861590>, 'malay')]
[26]:
nrows = 3
fig, ax = plt.subplots(nrows = nrows, ncols = 1)
fig.set_figwidth(20)
fig.set_figheight(nrows * 3)
malaya_speech.extra.visualization.visualize_vad(y, grouped_vad, sr, ax = ax[0])
malaya_speech.extra.visualization.plot_classification(samples_vad_vggvox_v2,
                                                      'language detection vggvox v2', ax = ax[1])
malaya_speech.extra.visualization.plot_classification(samples_vad_quantized_vggvox_v2,
                                                      'language detection quantized vggvox v2', ax = ax[2])
fig.tight_layout()
plt.show()
_images/load-language-detection_38_0.png

Reference

  1. The Singaporean White Boy - The Shan and Rozz Show: EP7, https://www.youtube.com/watch?v=HylaY5e1awo&t=2s&ab_channel=Clicknetwork

[ ]: