Speaker Overlap Detection
Contents
Speaker Overlap Detection#
This tutorial is available as an IPython notebook at malaya-speech/example/speaker-overlap.
This module is language independent, so it save to use on different languages.
This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at malaya-speech/example/pipeline.
Dataset#
Trained on Voxceleb V1 and LibriSpeech datasets with augmented noise.
[1]:
import malaya_speech
import numpy as np
from malaya_speech import Pipeline
[2]:
y, sr = malaya_speech.load('speech/video/The-Singaporean-White-Boy.wav')
len(y), sr
[2]:
(1634237, 16000)
[3]:
# just going to take 30 seconds
y = y[:sr * 30]
[4]:
import IPython.display as ipd
ipd.Audio(y, rate = sr)
[4]:
This audio extracted from https://www.youtube.com/watch?v=HylaY5e1awo&t=2s
List available deep model#
[5]:
malaya_speech.speaker_overlap.available_model()
[5]:
Size (MB) | Quantized Size (MB) | Accuracy | |
---|---|---|---|
vggvox-v2 | 31.1 | 7.92 | 0.82861 |
speakernet | 20.3 | 5.18 | 0.80145 |
Load deep model#
def deep_model(model: str = 'vggvox-v2', quantized: bool = False, **kwargs):
"""
Load speaker overlap deep model.
Parameters
----------
model : str, optional (default='vggvox-v2')
Model architecture supported. Allowed values:
* ``'vggvox-v2'`` - finetuned VGGVox V2.
* ``'speakernet'`` - finetuned SpeakerNet.
quantized : bool, optional (default=False)
if True, will load 8-bit quantized model.
Quantized model not necessary faster, totally depends on the machine.
Returns
-------
result : malaya_speech.supervised.classification.load function
"""
[9]:
vggvox_v2 = malaya_speech.speaker_overlap.deep_model(model = 'vggvox-v2')
/Users/huseinzolkepli/Documents/tf-1.15/env/lib/python3.7/site-packages/tensorflow_core/python/client/session.py:1750: UserWarning: An interactive session is already active. This can cause out-of-memory errors in some cases. You must explicitly call `InteractiveSession.close()` to release resources held by the other session(s).
warnings.warn('An interactive session is already active. This can '
Load Quantized deep model#
To load 8-bit quantized model, simply pass quantized = True
, default is False
.
We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine.
[11]:
quantized_vggvox_v2 = malaya_speech.speaker_overlap.deep_model(model = 'vggvox-v2', quantized = True)
WARNING:root:Load quantized model will cause accuracy drop.
How to use Speaker Overlap detection#
We trained speaker change detection models on 100 ms frames, so, split a sample into multiple chunks with length 500 ms each, and speaker change only got 2 classes, False and True.
[12]:
frames = list(malaya_speech.utils.generator.frames(y, 100, sr))
[13]:
%%time
probs_vggvox_v2 = [(frame, vggvox_v2.predict_proba([frame])[0, 1]) for frame in frames]
CPU times: user 25.4 s, sys: 5.78 s, total: 31.1 s
Wall time: 6.79 s
[14]:
%%time
probs_quantized_vggvox_v2 = [(frame, quantized_vggvox_v2.predict_proba([frame])[0, 1]) for frame in frames]
CPU times: user 26.5 s, sys: 6.04 s, total: 32.6 s
Wall time: 7.21 s
[15]:
vad = malaya_speech.vad.deep_model(model = 'vggvox-v2')
frames = list(malaya_speech.utils.generator.frames(y, 30, sr))
p = Pipeline()
pipeline = (
p.batching(5)
.foreach_map(vad.predict)
.flatten()
)
p.visualize()
[15]:
[16]:
result = p.emit(frames)
result.keys()
/Users/huseinzolkepli/Documents/tf-1.15/env/lib/python3.7/site-packages/librosa/core/spectrum.py:224: UserWarning: n_fft=512 is too small for input signal of length=480
n_fft, y.shape[-1]
[16]:
dict_keys(['batching', 'predict', 'flatten'])
[17]:
frames_vad = [(frame, result['flatten'][no]) for no, frame in enumerate(frames)]
grouped_vad = malaya_speech.utils.group.group_frames(frames_vad)
grouped_vad = malaya_speech.utils.group.group_frames_threshold(grouped_vad, threshold_to_stop = 0.3)
grouped_vad = malaya_speech.utils.group.group_frames(grouped_vad)
[18]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
[19]:
nrows = 3
fig, ax = plt.subplots(nrows = nrows, ncols = 1)
fig.set_figwidth(20)
fig.set_figheight(nrows * 3)
malaya_speech.extra.visualization.visualize_vad(y, grouped_vad, sr, ax = ax[0])
malaya_speech.extra.visualization.plot_classification(probs_vggvox_v2, 'speaker-overlap vggvox-v2',
yaxis = True, ax = ax[1])
malaya_speech.extra.visualization.plot_classification(probs_quantized_vggvox_v2,
'speaker-overlap quantized vggvox-v2',
yaxis = True, ax = ax[2])
fig.tight_layout()
plt.show()
Reference#
The Singaporean White Boy - The Shan and Rozz Show: EP7, https://www.youtube.com/watch?v=HylaY5e1awo&t=2s&ab_channel=Clicknetwork
[ ]: