Speaker Diarization using Features#

This tutorial is available as an IPython notebook at malaya-speech/example/diarization-features.

This module is language independent, so it save to use on different languages. Pretrained models trained on multilanguages.

This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at malaya-speech/example/pipeline.

What is the different with Speaker Diarization#

Current speaker diarization, https://malaya-speech.readthedocs.io/en/latest/load-diarization.html

Required a pipeline, VAD -> Group positive VADs -> Speaker models -> Clustering, and this pipeline required a really good VAD and Speaker models. What if we can directly cluster using STFT / Features and arange the timestamp.

Inspired by khursani8,

Wave -> STFT / Features -> Clustering -> arange timestamp.

The features can be anything, such as,

  • MFCC

  • Melspectrogram

  • Conv

from malaya_speech import Pipeline
import malaya_speech
import numpy as np
import matplotlib.pyplot as plt
Load audio sample#

y, sr = malaya_speech.load('speech/video/The-Singaporean-White-Boy.wav')
len(y), sr
(1634237, 16000)
# just going to take 60 seconds
y = y[:sr * 60]

This audio extracted from https://www.youtube.com/watch?v=HylaY5e1awo&t=2s

Generate Log Melspectrogram#

You can use interface malaya_speech.utils.featurization.STTFeaturizer,

class STTFeaturizer:
    def __init__(
        sample_rate: int, optional (default=16000)
        frame_ms: int, optional (default=25)
            To calculate `frame_length` for librosa STFT, `frame_length = int(sample_rate * (frame_ms / 1000))`
        stride_ms: int, optional (default=10)
            To calculate `frame_step` for librosa STFT, `frame_step = int(sample_rate * (stride_ms / 1000))`
        nfft: int, optional (default=None)
            If None, will calculate by `math.ceil(math.log2((frame_ms / 1000) * sample_rate))`
        num_feature_bins: int, optional (default=80)
            Size of output features.
        feature_type: str, optional (default='log_mel_spectrogram')
            Features type, allowed values:

            * ``'spectrogram'`` - np.square(np.abs(librosa.core.stft))
            * ``'mfcc'`` - librosa.feature.mfcc(np.square(np.abs(librosa.core.stft)))
            * ``'log_mel_spectrogram'`` - log(mel(np.square(np.abs(librosa.core.stft))))

Spectral Clustering#

This is a Python re-implementation of the spectral clustering algorithm in the paper Speaker Diarization with LSTM.

So, make sure you already install spectralcluster,

pip install spectralcluster
from spectralcluster import SpectralClusterer

clusterer = SpectralClusterer(

Clustering on log MelSpectrogram#


featurizer = malaya_speech.featurization.STTFeaturizer(feature_type = 'log_mel_spectrogram',
                                                      frame_ms = 50, stride_ms = 30)
features = featurizer(y)
CPU times: user 110 ms, sys: 31.2 ms, total: 141 ms
Wall time: 111 ms
Wall time: 111 ms
(2001, 80)
from malaya_speech.utils.dist import l2_normalize

cluster_labels = clusterer.predict(l2_normalize(features))
frames = malaya_speech.arange.arange_frames(features, y, sr)
results = []
for no, result in enumerate(cluster_labels):
    results.append((frames[no], result))
grouped = malaya_speech.group.group_frames(results)
CPU times: user 15.3 s, sys: 1.56 s, total: 16.8 s
Wall time: 5.2 s
Wall time: 5.2 s
[(<malaya_speech.model.frame.Frame at 0x16b7e5250>, 1),
 (<malaya_speech.model.frame.Frame at 0x17b77d790>, 2),
 (<malaya_speech.model.frame.Frame at 0x17b77db10>, 1),
 (<malaya_speech.model.frame.Frame at 0x17b77d090>, 2),
 (<malaya_speech.model.frame.Frame at 0x17b77da50>, 1),
 (<malaya_speech.model.frame.Frame at 0x17b77d650>, 2),
 (<malaya_speech.model.frame.Frame at 0x16cece210>, 0),
 (<malaya_speech.model.frame.Frame at 0x16ceced90>, 1),
 (<malaya_speech.model.frame.Frame at 0x16cece450>, 0),
 (<malaya_speech.model.frame.Frame at 0x17b77de90>, 1),
 (<malaya_speech.model.frame.Frame at 0x16cece150>, 0),
 (<malaya_speech.model.frame.Frame at 0x16cece950>, 2),
 (<malaya_speech.model.frame.Frame at 0x16cecea90>, 0),
 (<malaya_speech.model.frame.Frame at 0x1856b8210>, 1)]

Clustering on TRILL#

The TRILL model presented in “Towards Learning a Universal Non-Semantic Representation of Speech”. It exceeds state-of-the-art performance on a number of transfer learning tasks drawn from the non-semantic speech domain (speech emotion recognition, language identification, etc). It is trained on publicly-available AudioSet, https://tfhub.dev/google/nonsemantic-speech-benchmark/trill/3

import tensorflow_hub as hub
module = hub.load('https://tfhub.dev/google/nonsemantic-speech-benchmark/trill/3')
# i think 60ms pretty ok
frames = malaya_speech.generator.frames(y, frame_duration_ms = 30)
from tqdm import tqdm

arrays = [f.array for f in frames]
embeddings = []
for i in tqdm(range(len(arrays))):
    e = module(arrays[i], sample_rate=16000)['embedding']
100%|██████████| 2000/2000 [01:02<00:00, 32.23it/s]
concat = np.concatenate(embeddings, axis = 0)
(2000, 512)
clusterer = SpectralClusterer(
    thresholding_soft_multiplier = 1.0,

cluster_labels = clusterer.predict(l2_normalize(concat))
frames = malaya_speech.arange.arange_frames(concat, y, sr)
results_trill = []
for no, result in enumerate(cluster_labels):
    results_trill.append((frames[no], result))
grouped_trill = malaya_speech.group.group_frames(results_trill)
CPU times: user 16.1 s, sys: 1.43 s, total: 17.5 s
Wall time: 4.28 s
Wall time: 4.28 s
[(<malaya_speech.model.frame.Frame at 0x16dacc210>, 0)]
nrows = 3
fig, ax = plt.subplots(nrows = nrows, ncols = 1)
fig.set_figheight(nrows * 3)
min_timestamp = min([i[0].timestamp for i in grouped])
max_timestamp = max([i[0].timestamp + i[0].duration for i in grouped])
ax[0].set_xlim((min_timestamp, max_timestamp))
ax[0].plot([i / sr for i in range(len(y))], y)
                                                      'diarization using spectral cluster', ax = ax[1],
                                                     x_text = 0.01)
                                                      'diarization using spectral cluster TRILL', ax = ax[2],
                                                     x_text = 0.01)
/Users/huseinzolkepli/Documents/malaya-speech/malaya_speech/extra/visualization.py:168: RuntimeWarning: invalid value encountered in true_divide
  std = (a - np.min(a)) / (np.max(a) - np.min(a))
  std = (a - np.min(a)) / (np.max(a) - np.min(a))
import IPython.display as ipd

ipd.Audio(grouped[0][0].array, rate = sr)
ipd.Audio(grouped[1][0].array, rate = sr)
ipd.Audio(grouped[2][0].array, rate = sr)
ipd.Audio(grouped[3][0].array, rate = sr)