Age Detection#

This tutorial is available as an IPython notebook at malaya-speech/example/age-detection.

This module is language independent, so it save to use on different languages. Pretrained models trained on multilanguages.

This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at malaya-speech/example/pipeline.

Dataset#

Trained on Age column from Common Voice Mozilla Dataset with augmented noises, https://commonvoice.mozilla.org/

[1]:
import malaya_speech
import numpy as np
from malaya_speech import Pipeline
[2]:
y, sr = malaya_speech.load('speech/video/The-Singaporean-White-Boy.wav')
len(y), sr
[2]:
(1634237, 16000)
[3]:
# just going to take 30 seconds
y = y[:sr * 30]
[4]:
import IPython.display as ipd
ipd.Audio(y, rate = sr)
[4]:

This audio extracted from https://www.youtube.com/watch?v=HylaY5e1awo&t=2s

Supported age#

[5]:
malaya_speech.age_detection.labels
[5]:
['teens',
 'twenties',
 'thirties',
 'fourties',
 'fifties',
 'sixties',
 'seventies',
 'eighties',
 'nineties',
 'not an age']

List available deep model#

[6]:
malaya_speech.age_detection.available_model()
INFO:root:last accuracy during training session before early stopping.
[6]:
Size (MB) Quantized Size (MB) Accuracy
vggvox-v2 30.9 7.92 0.57523
deep-speaker 96.9 24.40 0.58584

Load deep model#

def deep_model(model: str = 'vggvox-v2', quantized: bool = False, **kwargs):
    """
    Load age detection deep model.

    Parameters
    ----------
    model : str, optional (default='vggvox-v2')
        Model architecture supported. Allowed values:

        * ``'vggvox-v2'`` - finetuned VGGVox V2.
        * ``'deep-speaker'`` - finetuned Deep Speaker.
    quantized : bool, optional (default=False)
        if True, will load 8-bit quantized model.
        Quantized model not necessary faster, totally depends on the machine.

    Returns
    -------
    result : malaya_speech.supervised.classification.load function
    """
[7]:
vggvox_v2 = malaya_speech.age_detection.deep_model(model = 'vggvox-v2')
deep_speaker = malaya_speech.age_detection.deep_model(model = 'deep-speaker')
WARNING:tensorflow:From /Users/huseinzolkepli/Documents/malaya-speech/malaya_speech/utils/__init__.py:66: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/malaya-speech/malaya_speech/utils/__init__.py:66: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/malaya-speech/malaya_speech/utils/__init__.py:68: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/malaya-speech/malaya_speech/utils/__init__.py:68: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/malaya-speech/malaya_speech/utils/__init__.py:61: The name tf.InteractiveSession is deprecated. Please use tf.compat.v1.InteractiveSession instead.

WARNING:tensorflow:From /Users/huseinzolkepli/Documents/malaya-speech/malaya_speech/utils/__init__.py:61: The name tf.InteractiveSession is deprecated. Please use tf.compat.v1.InteractiveSession instead.

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/tensorflow_core/python/client/session.py:1750: UserWarning: An interactive session is already active. This can cause out-of-memory errors in some cases. You must explicitly call `InteractiveSession.close()` to release resources held by the other session(s).
  warnings.warn('An interactive session is already active. This can '

Load Quantized deep model#

To load 8-bit quantized model, simply pass quantized = True, default is False.

We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine.

[8]:
quantized_vggvox_v2 = malaya_speech.age_detection.deep_model(model = 'vggvox-v2', quantized = True)
WARNING:root:Load quantized model will cause accuracy drop.

How to classify age in an audio sample#

So we are going to use VAD to help us. Instead we are going to classify as a whole sample, we chunk it into multiple small samples and classify it.

[9]:
vad = malaya_speech.vad.deep_model(model = 'vggvox-v2')
[10]:
%%time

frames = list(malaya_speech.utils.generator.frames(y, 30, sr))
CPU times: user 2.47 ms, sys: 123 µs, total: 2.59 ms
Wall time: 2.63 ms
[11]:
p = Pipeline()
pipeline = (
    p.batching(5)
    .foreach_map(vad.predict)
    .flatten()
)
p.visualize()
[11]:
_images/load-age-detection_22_0.png
[12]:
%%time

result = p.emit(frames)
result.keys()
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/librosa/core/spectrum.py:224: UserWarning: n_fft=512 is too small for input signal of length=480
  n_fft, y.shape[-1]
CPU times: user 32.7 s, sys: 5.6 s, total: 38.3 s
Wall time: 7.55 s
[12]:
dict_keys(['batching', 'predict', 'flatten'])
[13]:
frames_vad = [(frame, result['flatten'][no]) for no, frame in enumerate(frames)]
grouped_vad = malaya_speech.utils.group.group_frames(frames_vad)
grouped_vad = malaya_speech.utils.group.group_frames_threshold(grouped_vad, threshold_to_stop = 0.3)
[14]:
malaya_speech.extra.visualization.visualize_vad(y, grouped_vad, sr, figsize = (15, 3))
<Figure size 1500x300 with 1 Axes>
[15]:
p_vggvox_v2 = Pipeline()
pipeline = (
    p_vggvox_v2.foreach_map(vggvox_v2)
    .flatten()
)
p_vggvox_v2.visualize()
[15]:
_images/load-age-detection_26_0.png
[16]:
p_deep_speaker = Pipeline()
pipeline = (
    p_deep_speaker.foreach_map(deep_speaker)
    .flatten()
)
p_deep_speaker.visualize()
[16]:
_images/load-age-detection_27_0.png
[17]:
%%time

samples_vad = [g[0] for g in grouped_vad]
result_vggvox_v2 = p_vggvox_v2.emit(samples_vad)
result_vggvox_v2.keys()
CPU times: user 4.64 s, sys: 841 ms, total: 5.48 s
Wall time: 1.4 s
[17]:
dict_keys(['age-detection', 'flatten'])
[18]:
%%time

samples_vad = [g[0] for g in grouped_vad]
result_deep_speaker = p_deep_speaker.emit(samples_vad)
result_deep_speaker.keys()
CPU times: user 5.07 s, sys: 797 ms, total: 5.86 s
Wall time: 1.53 s
[18]:
dict_keys(['age-detection', 'flatten'])
[19]:
samples_vad_vggvox_v2 = [(frame, result_vggvox_v2['flatten'][no]) for no, frame in enumerate(samples_vad)]
samples_vad_vggvox_v2
[19]:
[(<malaya_speech.model.frame.Frame at 0x147039490>, 'not an age'),
 (<malaya_speech.model.frame.Frame at 0x146fe2b90>, 'not an age'),
 (<malaya_speech.model.frame.Frame at 0x146fe2a90>, 'teens'),
 (<malaya_speech.model.frame.Frame at 0x146fe2850>, 'teens'),
 (<malaya_speech.model.frame.Frame at 0x146fe2950>, 'teens'),
 (<malaya_speech.model.frame.Frame at 0x147045a50>, 'teens'),
 (<malaya_speech.model.frame.Frame at 0x147045a90>, 'teens'),
 (<malaya_speech.model.frame.Frame at 0x147045ad0>, 'teens'),
 (<malaya_speech.model.frame.Frame at 0x147045b50>, 'teens'),
 (<malaya_speech.model.frame.Frame at 0x147045b10>, 'fourties'),
 (<malaya_speech.model.frame.Frame at 0x147045a10>, 'fourties'),
 (<malaya_speech.model.frame.Frame at 0x147045b90>, 'not an age'),
 (<malaya_speech.model.frame.Frame at 0x147045bd0>, 'sixties'),
 (<malaya_speech.model.frame.Frame at 0x147045c10>, 'seventies'),
 (<malaya_speech.model.frame.Frame at 0x147045c50>, 'not an age'),
 (<malaya_speech.model.frame.Frame at 0x147045cd0>, 'teens'),
 (<malaya_speech.model.frame.Frame at 0x147045c90>, 'teens'),
 (<malaya_speech.model.frame.Frame at 0x147045d50>, 'thirties'),
 (<malaya_speech.model.frame.Frame at 0x147045d90>, 'fourties'),
 (<malaya_speech.model.frame.Frame at 0x147045d10>, 'not an age'),
 (<malaya_speech.model.frame.Frame at 0x147045e10>, 'fourties'),
 (<malaya_speech.model.frame.Frame at 0x147045dd0>, 'teens')]
[20]:
samples_vad_deep_speaker = [(frame, result_deep_speaker['flatten'][no]) for no, frame in enumerate(samples_vad)]
samples_vad_deep_speaker
[20]:
[(<malaya_speech.model.frame.Frame at 0x147039490>, 'not an age'),
 (<malaya_speech.model.frame.Frame at 0x146fe2b90>, 'not an age'),
 (<malaya_speech.model.frame.Frame at 0x146fe2a90>, 'not an age'),
 (<malaya_speech.model.frame.Frame at 0x146fe2850>, 'teens'),
 (<malaya_speech.model.frame.Frame at 0x146fe2950>, 'fifties'),
 (<malaya_speech.model.frame.Frame at 0x147045a50>, 'not an age'),
 (<malaya_speech.model.frame.Frame at 0x147045a90>, 'fourties'),
 (<malaya_speech.model.frame.Frame at 0x147045ad0>, 'sixties'),
 (<malaya_speech.model.frame.Frame at 0x147045b50>, 'sixties'),
 (<malaya_speech.model.frame.Frame at 0x147045b10>, 'fourties'),
 (<malaya_speech.model.frame.Frame at 0x147045a10>, 'thirties'),
 (<malaya_speech.model.frame.Frame at 0x147045b90>, 'not an age'),
 (<malaya_speech.model.frame.Frame at 0x147045bd0>, 'sixties'),
 (<malaya_speech.model.frame.Frame at 0x147045c10>, 'fourties'),
 (<malaya_speech.model.frame.Frame at 0x147045c50>, 'not an age'),
 (<malaya_speech.model.frame.Frame at 0x147045cd0>, 'sixties'),
 (<malaya_speech.model.frame.Frame at 0x147045c90>, 'teens'),
 (<malaya_speech.model.frame.Frame at 0x147045d50>, 'thirties'),
 (<malaya_speech.model.frame.Frame at 0x147045d90>, 'sixties'),
 (<malaya_speech.model.frame.Frame at 0x147045d10>, 'not an age'),
 (<malaya_speech.model.frame.Frame at 0x147045e10>, 'fourties'),
 (<malaya_speech.model.frame.Frame at 0x147045dd0>, 'teens')]
[21]:
import matplotlib.pyplot as plt
[22]:
nrows = 3
fig, ax = plt.subplots(nrows = nrows, ncols = 1)
fig.set_figwidth(20)
fig.set_figheight(nrows * 3)
malaya_speech.extra.visualization.visualize_vad(y, grouped_vad, sr, ax = ax[0])
malaya_speech.extra.visualization.plot_classification(samples_vad_vggvox_v2,
                                                      'age detection vggvox v2', ax = ax[1])
malaya_speech.extra.visualization.plot_classification(samples_vad_deep_speaker,
                                                      'age detection deep speaker', ax = ax[2])
fig.tight_layout()
plt.show()
_images/load-age-detection_33_0.png
[23]:
p_quantized_vggvox_v2 = Pipeline()
pipeline = (
    p_quantized_vggvox_v2.foreach_map(quantized_vggvox_v2)
    .flatten()
)
p_quantized_vggvox_v2.visualize()
[23]:
_images/load-age-detection_34_0.png
[24]:
%%time

samples_vad = [g[0] for g in grouped_vad]
result_quantized_vggvox_v2 = p_quantized_vggvox_v2.emit(samples_vad)
result_quantized_vggvox_v2.keys()
CPU times: user 4.75 s, sys: 935 ms, total: 5.68 s
Wall time: 1.33 s
[24]:
dict_keys(['age-detection', 'flatten'])
[25]:
samples_vad_quantized_vggvox_v2 = [(frame, result_quantized_vggvox_v2['flatten'][no]) for no, frame in enumerate(samples_vad)]
samples_vad_quantized_vggvox_v2
[25]:
[(<malaya_speech.model.frame.Frame at 0x147039490>, 'not an age'),
 (<malaya_speech.model.frame.Frame at 0x146fe2b90>, 'not an age'),
 (<malaya_speech.model.frame.Frame at 0x146fe2a90>, 'teens'),
 (<malaya_speech.model.frame.Frame at 0x146fe2850>, 'teens'),
 (<malaya_speech.model.frame.Frame at 0x146fe2950>, 'teens'),
 (<malaya_speech.model.frame.Frame at 0x147045a50>, 'teens'),
 (<malaya_speech.model.frame.Frame at 0x147045a90>, 'teens'),
 (<malaya_speech.model.frame.Frame at 0x147045ad0>, 'teens'),
 (<malaya_speech.model.frame.Frame at 0x147045b50>, 'teens'),
 (<malaya_speech.model.frame.Frame at 0x147045b10>, 'fourties'),
 (<malaya_speech.model.frame.Frame at 0x147045a10>, 'fourties'),
 (<malaya_speech.model.frame.Frame at 0x147045b90>, 'not an age'),
 (<malaya_speech.model.frame.Frame at 0x147045bd0>, 'sixties'),
 (<malaya_speech.model.frame.Frame at 0x147045c10>, 'seventies'),
 (<malaya_speech.model.frame.Frame at 0x147045c50>, 'not an age'),
 (<malaya_speech.model.frame.Frame at 0x147045cd0>, 'teens'),
 (<malaya_speech.model.frame.Frame at 0x147045c90>, 'teens'),
 (<malaya_speech.model.frame.Frame at 0x147045d50>, 'thirties'),
 (<malaya_speech.model.frame.Frame at 0x147045d90>, 'fourties'),
 (<malaya_speech.model.frame.Frame at 0x147045d10>, 'not an age'),
 (<malaya_speech.model.frame.Frame at 0x147045e10>, 'fourties'),
 (<malaya_speech.model.frame.Frame at 0x147045dd0>, 'teens')]
[26]:
nrows = 3
fig, ax = plt.subplots(nrows = nrows, ncols = 1)
fig.set_figwidth(20)
fig.set_figheight(nrows * 3)
malaya_speech.extra.visualization.visualize_vad(y, grouped_vad, sr, ax = ax[0])
malaya_speech.extra.visualization.plot_classification(samples_vad_vggvox_v2,
                                                      'age detection vggvox v2', ax = ax[1])
malaya_speech.extra.visualization.plot_classification(samples_vad_quantized_vggvox_v2,
                                                      'age detection quantized vggvox v2', ax = ax[2])
fig.tight_layout()
plt.show()
_images/load-age-detection_37_0.png

Reference#

  1. The Singaporean White Boy - The Shan and Rozz Show: EP7, https://www.youtube.com/watch?v=HylaY5e1awo&t=2s&ab_channel=Clicknetwork

  2. Common Voice dataset, https://commonvoice.mozilla.org/

[ ]: