Unsupervised clustering using HMM#

This tutorial is available as an IPython notebook at malaya-speech/example/diarization-clustering-hmm.

This module is language independent, so it save to use on different languages. Pretrained models trained on multilanguages.

This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at malaya-speech/example/pipeline.

[1]:
from malaya_speech import Pipeline
import malaya_speech
import numpy as np
import matplotlib.pyplot as plt

load Speaker Vector#

So to know speakers similarity, we can use speaker vector, just load using malaya_speech.speaker_vector.deep_model. Read more about malaya-speech Speaker Vector at https://malaya-speech.readthedocs.io/en/latest/load-speaker-vector.html

We are going to compare conformer-base and vggvox-v2.

[2]:
model_conformer = malaya_speech.speaker_vector.deep_model('conformer-base')
model_vggvox2 = malaya_speech.speaker_vector.deep_model('vggvox-v2')

Load audio sample#

[3]:
y, sr = malaya_speech.load('speech/video/The-Singaporean-White-Boy.wav')
len(y), sr
[3]:
(1634237, 16000)
[4]:
# just going to take 60 seconds
y = y[:sr * 60]

This audio extracted from https://www.youtube.com/watch?v=HylaY5e1awo&t=2s

Load VAD#

We need to use VAD module to know which parts of the audio sample are speaker activities, read more about VAD at https://malaya-speech.readthedocs.io/en/latest/load-vad.html

[5]:
vad = malaya_speech.vad.deep_model(model = 'vggvox-v2')
[6]:
frames = list(malaya_speech.utils.generator.frames(y, 30, sr))
[7]:
p = Pipeline()
pipeline = (
    p.batching(5)
    .foreach_map(vad.predict)
    .flatten()
)
p.visualize()
[7]:
_images/load-diarization-clustering-hmm_14_0.png
[8]:
%%time

result = p(frames)
result.keys()
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/librosa/core/spectrum.py:224: UserWarning: n_fft=512 is too small for input signal of length=480
  n_fft, y.shape[-1]
CPU times: user 1min 2s, sys: 38.7 s, total: 1min 41s
Wall time: 21.8 s
[8]:
dict_keys(['batching', 'predict', 'flatten'])
[9]:
frames_vad = [(frame, result['flatten'][no]) for no, frame in enumerate(frames)]
grouped_vad = malaya_speech.utils.group.group_frames(frames_vad)
grouped_vad = malaya_speech.utils.group.group_frames_threshold(grouped_vad, threshold_to_stop = 0.3)
[10]:
malaya_speech.extra.visualization.visualize_vad(y, grouped_vad, sr, figsize = (15, 3))
_images/load-diarization-clustering-hmm_17_0.png

Load Hidden Markov model#

Make sure you already installed hmmlearn, if not, simply,

pip3 install hmmlearn
class HiddenMarkovModelClustering:
    def __init__(
        self,
        min_clusters: int,
        max_clusters: int,
        metric: str = 'cosine',
        covariance_type: str = 'diag',
        threshold: float = 0.35,
        single_cluster_detection_quantile: float = 0.05,
        single_cluster_detection_threshold: float = 1.15,
    ):
        """
        Load malaya-speech HiddenMarkovModel, originallly from pyannote, https://github.com/pyannote/pyannote-audio/blob/develop/pyannote/audio/pipelines/clustering.py

        Parameters
        ----------
        min_clusters: int
            minimum cluster size, must bigger than 0
        max_clusters: int
            maximum cluster size, must equal or bigger than `min_clusters`.
            if equal to `min_clusters`, will directly fit into HMM without calculating the best cluster size.
        metric: str, optional (default='cosine')
            Only support `cosine` and `euclidean`.
        covariance_type: str, optional (default='diag')
            Acceptable input shape, https://hmmlearn.readthedocs.io/en/latest/api.html#gaussianhmm
        threshold: float, optional (default=0.35)
            minimum threshold to assume current iteration of cluster is the best fit.
        """

To get better results using HMM, set norm_function=None and log_distance_metric=None.

[11]:
from malaya_speech.model.clustering import HiddenMarkovModelClustering
[12]:
hmm_model = HiddenMarkovModelClustering(min_clusters = 1, max_clusters = 20)
[13]:
result_diarization_hmm_conformer = malaya_speech.diarization.clustering(
    vad_results = grouped_vad,
    speaker_vector = model_conformer,
    model = hmm_model,
    norm_function = None,
)
result_diarization_hmm_conformer[:5]
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/hmmlearn/hmm.py:273: RuntimeWarning: invalid value encountered in true_divide
  / (means_weight + denom))
Some rows of transmat_ have zero sum because no transition from the state was ever observed.
[13]:
[(<malaya_speech.model.frame.Frame at 0x184c72150>, 'not a speaker'),
 (<malaya_speech.model.frame.Frame at 0x184c863d0>, 'not a speaker'),
 (<malaya_speech.model.frame.Frame at 0x184c86390>, 'speaker 1'),
 (<malaya_speech.model.frame.Frame at 0x184c86450>, 'speaker 3'),
 (<malaya_speech.model.frame.Frame at 0x184c86490>, 'speaker 3')]
[14]:
result_diarization_hmm_vggvox2 = malaya_speech.diarization.clustering(
    vad_results = grouped_vad,
    speaker_vector = model_vggvox2,
    model = hmm_model,
    norm_function = None,
)
result_diarization_hmm_vggvox2[:5]
[14]:
[(<malaya_speech.model.frame.Frame at 0x184c72150>, 'not a speaker'),
 (<malaya_speech.model.frame.Frame at 0x184c863d0>, 'not a speaker'),
 (<malaya_speech.model.frame.Frame at 0x184c86390>, 'speaker 0'),
 (<malaya_speech.model.frame.Frame at 0x184c86450>, 'speaker 0'),
 (<malaya_speech.model.frame.Frame at 0x184c86490>, 'speaker 0')]
[15]:
nrows = 3
fig, ax = plt.subplots(nrows = nrows, ncols = 1)
fig.set_figwidth(20)
fig.set_figheight(nrows * 3)
malaya_speech.extra.visualization.visualize_vad(y, grouped_vad, sr, ax = ax[0])
malaya_speech.extra.visualization.plot_classification(result_diarization_hmm_conformer,
                                                      'conformer + hmm', ax = ax[1],
                                                     x_text = 0.01)
malaya_speech.extra.visualization.plot_classification(result_diarization_hmm_vggvox2,
                                                      'vggvox2 + hmm', ax = ax[2],
                                                     x_text = 0.01)
fig.tight_layout()
plt.show()
_images/load-diarization-clustering-hmm_23_0.png
[ ]: