Language Detection#

This tutorial is available as an IPython notebook at malaya-speech/example/language-detection.

This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.

This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at malaya-speech/example/pipeline.

Dataset#

Handpicked Youtube videos, gathered at https://github.com/huseinzol05/malaya-speech/tree/master/data/podcast

[1]:

import malaya_speech
import numpy as np
from malaya_speech import Pipeline

[2]:

y, sr = malaya_speech.load('speech/video/The-Singaporean-White-Boy.wav')
len(y), sr

[2]:

(1634237, 16000)

[3]:

# just going to take 30 seconds
y = y[:sr * 30]

[4]:

import IPython.display as ipd
ipd.Audio(y, rate = sr)

[4]:

This audio extracted from https://www.youtube.com/watch?v=HylaY5e1awo&t=2s

Supported languages#

[5]:

malaya_speech.language_detection.labels

[5]:

['english',
 'indonesian',
 'malay',
 'mandarin',
 'manglish',
 'others',
 'not a language']

Here we are not trying to tackle all possible languages, just towards hyperlocal languages in Malaysia.

List available deep model#

[6]:

malaya_speech.language_detection.available_model()

INFO:root:last accuracy during training session before early stopping.

[6]:

	Size (MB)	Quantized Size (MB)	Accuracy
vggvox-v2	30.9	7.92	0.90204
deep-speaker	96.9	24.40	0.89450

Load deep model#

def deep_model(model: str = 'vggvox-v2', quantized: bool = False, **kwargs):
    """
    Load language detection deep model.

    Parameters
    ----------
    model : str, optional (default='vggvox-v2')
        Model architecture supported. Allowed values:

        * ``'vggvox-v2'`` - finetuned VGGVox V2.
        * ``'deep-speaker'`` - finetuned Deep Speaker.
    quantized : bool, optional (default=False)
        if True, will load 8-bit quantized model.
        Quantized model not necessary faster, totally depends on the machine.

    Returns
    -------
    result : malaya_speech.supervised.classification.load function
    """

[7]:

vggvox_v2 = malaya_speech.language_detection.deep_model(model = 'vggvox-v2')
deep_speaker = malaya_speech.language_detection.deep_model(model = 'deep-speaker')

/Users/huseinzolkepli/Documents/tf-1.15/env/lib/python3.7/site-packages/tensorflow_core/python/client/session.py:1750: UserWarning: An interactive session is already active. This can cause out-of-memory errors in some cases. You must explicitly call `InteractiveSession.close()` to release resources held by the other session(s).
  warnings.warn('An interactive session is already active. This can '

Load Quantized deep model#

To load 8-bit quantized model, simply pass quantized = True, default is False.

We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine.

[8]:

quantized_vggvox_v2 = malaya_speech.language_detection.deep_model(model = 'vggvox-v2', quantized = True)

WARNING:root:Load quantized model will cause accuracy drop.

How to classify languages in an audio sample#

So we are going to use VAD to help us. Instead we are going to classify as a whole sample, we chunk it into multiple small samples and classify it.

[9]:

vad = malaya_speech.vad.deep_model(model = 'vggvox-v2')

[10]:

%%time

frames = list(malaya_speech.utils.generator.frames(y, 30, sr))

CPU times: user 1.67 ms, sys: 127 µs, total: 1.8 ms
Wall time: 1.88 ms

[11]:

p = Pipeline()
pipeline = (
    p.batching(5)
    .foreach_map(vad.predict)
    .flatten()
)
p.visualize()

[11]:

_images/load-language-detection_23_0.png

[12]:

%%time

result = p.emit(frames)
result.keys()

/Users/huseinzolkepli/Documents/tf-1.15/env/lib/python3.7/site-packages/librosa/core/spectrum.py:224: UserWarning: n_fft=512 is too small for input signal of length=480
  n_fft, y.shape[-1]

CPU times: user 31.8 s, sys: 6.18 s, total: 38 s
Wall time: 7.79 s

[12]:

dict_keys(['batching', 'predict', 'flatten'])

[13]:

frames_vad = [(frame, result['flatten'][no]) for no, frame in enumerate(frames)]
grouped_vad = malaya_speech.utils.group.group_frames(frames_vad)
grouped_vad = malaya_speech.utils.group.group_frames_threshold(grouped_vad, threshold_to_stop = 0.3)

[14]:

malaya_speech.extra.visualization.visualize_vad(y, grouped_vad, sr, figsize = (15, 3))

_images/load-language-detection_26_0.png

[15]:

p_vggvox_v2 = Pipeline()
pipeline = (
    p_vggvox_v2.foreach_map(vggvox_v2)
    .flatten()
)
p_vggvox_v2.visualize()

[15]:

_images/load-language-detection_27_0.png

[16]:

p_deep_speaker = Pipeline()
pipeline = (
    p_deep_speaker.foreach_map(deep_speaker)
    .flatten()
)
p_deep_speaker.visualize()

[16]:

_images/load-language-detection_28_0.png

[17]:

%%time

samples_vad = [g[0] for g in grouped_vad]
result_vggvox_v2 = p_vggvox_v2.emit(samples_vad)
result_vggvox_v2.keys()

CPU times: user 4.84 s, sys: 988 ms, total: 5.83 s
Wall time: 1.36 s

[17]:

dict_keys(['language-detection', 'flatten'])

[18]:

%%time

samples_vad = [g[0] for g in grouped_vad]
result_deep_speaker = p_deep_speaker.emit(samples_vad)
result_deep_speaker.keys()

CPU times: user 4.98 s, sys: 778 ms, total: 5.76 s
Wall time: 1.53 s

[18]:

dict_keys(['language-detection', 'flatten'])

[19]:

samples_vad_vggvox_v2 = [(frame, result_vggvox_v2['flatten'][no]) for no, frame in enumerate(samples_vad)]
samples_vad_vggvox_v2

[19]:

[(<malaya_speech.model.frame.Frame at 0x13edc9f50>, 'not a language'),
 (<malaya_speech.model.frame.Frame at 0x13d63df50>, 'not a language'),
 (<malaya_speech.model.frame.Frame at 0x1425c5d50>, 'manglish'),
 (<malaya_speech.model.frame.Frame at 0x1425c5890>, 'malay'),
 (<malaya_speech.model.frame.Frame at 0x1084da510>, 'english'),
 (<malaya_speech.model.frame.Frame at 0x1425c5c90>, 'english'),
 (<malaya_speech.model.frame.Frame at 0x13e8505d0>, 'malay'),
 (<malaya_speech.model.frame.Frame at 0x13e8508d0>, 'malay'),
 (<malaya_speech.model.frame.Frame at 0x13e861350>, 'malay'),
 (<malaya_speech.model.frame.Frame at 0x13d63dd10>, 'malay'),
 (<malaya_speech.model.frame.Frame at 0x13e861290>, 'malay'),
 (<malaya_speech.model.frame.Frame at 0x13e8612d0>, 'not a language'),
 (<malaya_speech.model.frame.Frame at 0x13e861390>, 'malay'),
 (<malaya_speech.model.frame.Frame at 0x13e8613d0>, 'manglish'),
 (<malaya_speech.model.frame.Frame at 0x13e861410>, 'english'),
 (<malaya_speech.model.frame.Frame at 0x13e861490>, 'english'),
 (<malaya_speech.model.frame.Frame at 0x13e861450>, 'english'),
 (<malaya_speech.model.frame.Frame at 0x13e861510>, 'english'),
 (<malaya_speech.model.frame.Frame at 0x13e861550>, 'english'),
 (<malaya_speech.model.frame.Frame at 0x13e8614d0>, 'not a language'),
 (<malaya_speech.model.frame.Frame at 0x13e8615d0>, 'malay'),
 (<malaya_speech.model.frame.Frame at 0x13e861590>, 'malay')]

[20]:

samples_vad_deep_speaker = [(frame, result_deep_speaker['flatten'][no]) for no, frame in enumerate(samples_vad)]
samples_vad_deep_speaker

[20]:

[(<malaya_speech.model.frame.Frame at 0x13edc9f50>, 'not a language'),
 (<malaya_speech.model.frame.Frame at 0x13d63df50>, 'not a language'),
 (<malaya_speech.model.frame.Frame at 0x1425c5d50>, 'malay'),
 (<malaya_speech.model.frame.Frame at 0x1425c5890>, 'malay'),
 (<malaya_speech.model.frame.Frame at 0x1084da510>, 'english'),
 (<malaya_speech.model.frame.Frame at 0x1425c5c90>, 'english'),
 (<malaya_speech.model.frame.Frame at 0x13e8505d0>, 'english'),
 (<malaya_speech.model.frame.Frame at 0x13e8508d0>, 'malay'),
 (<malaya_speech.model.frame.Frame at 0x13e861350>, 'others'),
 (<malaya_speech.model.frame.Frame at 0x13d63dd10>, 'malay'),
 (<malaya_speech.model.frame.Frame at 0x13e861290>, 'english'),
 (<malaya_speech.model.frame.Frame at 0x13e8612d0>, 'not a language'),
 (<malaya_speech.model.frame.Frame at 0x13e861390>, 'malay'),
 (<malaya_speech.model.frame.Frame at 0x13e8613d0>, 'others'),
 (<malaya_speech.model.frame.Frame at 0x13e861410>, 'not a language'),
 (<malaya_speech.model.frame.Frame at 0x13e861490>, 'english'),
 (<malaya_speech.model.frame.Frame at 0x13e861450>, 'english'),
 (<malaya_speech.model.frame.Frame at 0x13e861510>, 'others'),
 (<malaya_speech.model.frame.Frame at 0x13e861550>, 'english'),
 (<malaya_speech.model.frame.Frame at 0x13e8614d0>, 'not a language'),
 (<malaya_speech.model.frame.Frame at 0x13e8615d0>, 'english'),
 (<malaya_speech.model.frame.Frame at 0x13e861590>, 'others')]

[21]:

import matplotlib.pyplot as plt

[22]:

nrows = 3
fig, ax = plt.subplots(nrows = nrows, ncols = 1)
fig.set_figwidth(20)
fig.set_figheight(nrows * 3)
malaya_speech.extra.visualization.visualize_vad(y, grouped_vad, sr, ax = ax[0])
malaya_speech.extra.visualization.plot_classification(samples_vad_vggvox_v2,
                                                      'language detection vggvox v2', ax = ax[1])
malaya_speech.extra.visualization.plot_classification(samples_vad_deep_speaker,
                                                      'language detection deep speaker', ax = ax[2])
fig.tight_layout()
plt.show()

_images/load-language-detection_34_0.png

[23]:

p_quantized_vggvox_v2 = Pipeline()
pipeline = (
    p_quantized_vggvox_v2.foreach_map(quantized_vggvox_v2)
    .flatten()
)
p_quantized_vggvox_v2.visualize()

[23]:

_images/load-language-detection_35_0.png

[24]:

%%time

samples_vad = [g[0] for g in grouped_vad]
result_quantized_vggvox_v2 = p_quantized_vggvox_v2.emit(samples_vad)
result_quantized_vggvox_v2.keys()

CPU times: user 4.78 s, sys: 870 ms, total: 5.65 s
Wall time: 1.33 s

[24]:

dict_keys(['language-detection', 'flatten'])

[25]:

samples_vad_quantized_vggvox_v2 = [(frame, result_quantized_vggvox_v2['flatten'][no]) for no, frame in enumerate(samples_vad)]
samples_vad_quantized_vggvox_v2

[25]:

[(<malaya_speech.model.frame.Frame at 0x13edc9f50>, 'not a language'),
 (<malaya_speech.model.frame.Frame at 0x13d63df50>, 'not a language'),
 (<malaya_speech.model.frame.Frame at 0x1425c5d50>, 'manglish'),
 (<malaya_speech.model.frame.Frame at 0x1425c5890>, 'malay'),
 (<malaya_speech.model.frame.Frame at 0x1084da510>, 'english'),
 (<malaya_speech.model.frame.Frame at 0x1425c5c90>, 'english'),
 (<malaya_speech.model.frame.Frame at 0x13e8505d0>, 'malay'),
 (<malaya_speech.model.frame.Frame at 0x13e8508d0>, 'malay'),
 (<malaya_speech.model.frame.Frame at 0x13e861350>, 'malay'),
 (<malaya_speech.model.frame.Frame at 0x13d63dd10>, 'malay'),
 (<malaya_speech.model.frame.Frame at 0x13e861290>, 'malay'),
 (<malaya_speech.model.frame.Frame at 0x13e8612d0>, 'not a language'),
 (<malaya_speech.model.frame.Frame at 0x13e861390>, 'malay'),
 (<malaya_speech.model.frame.Frame at 0x13e8613d0>, 'others'),
 (<malaya_speech.model.frame.Frame at 0x13e861410>, 'english'),
 (<malaya_speech.model.frame.Frame at 0x13e861490>, 'english'),
 (<malaya_speech.model.frame.Frame at 0x13e861450>, 'english'),
 (<malaya_speech.model.frame.Frame at 0x13e861510>, 'english'),
 (<malaya_speech.model.frame.Frame at 0x13e861550>, 'english'),
 (<malaya_speech.model.frame.Frame at 0x13e8614d0>, 'not a language'),
 (<malaya_speech.model.frame.Frame at 0x13e8615d0>, 'malay'),
 (<malaya_speech.model.frame.Frame at 0x13e861590>, 'malay')]

[26]:

nrows = 3
fig, ax = plt.subplots(nrows = nrows, ncols = 1)
fig.set_figwidth(20)
fig.set_figheight(nrows * 3)
malaya_speech.extra.visualization.visualize_vad(y, grouped_vad, sr, ax = ax[0])
malaya_speech.extra.visualization.plot_classification(samples_vad_vggvox_v2,
                                                      'language detection vggvox v2', ax = ax[1])
malaya_speech.extra.visualization.plot_classification(samples_vad_quantized_vggvox_v2,
                                                      'language detection quantized vggvox v2', ax = ax[2])
fig.tight_layout()
plt.show()

_images/load-language-detection_38_0.png

Reference#

The Singaporean White Boy - The Shan and Rozz Show: EP7, https://www.youtube.com/watch?v=HylaY5e1awo&t=2s&ab_channel=Clicknetwork

[ ]:

Language Detection

Contents

Language Detection#

Dataset#

Supported languages#

List available deep model#

Load deep model#

Load Quantized deep model#

How to classify languages in an audio sample#

Reference#