{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Speaker Diarization using Features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [malaya-speech/example/diarization-features](https://github.com/huseinzol05/malaya-speech/tree/master/example/diarization-features).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This module is language independent, so it save to use on different languages. Pretrained models trained on multilanguages.\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at [malaya-speech/example/pipeline](https://github.com/huseinzol05/malaya-speech/tree/master/example/pipeline).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### What is the different with Speaker Diarization\n", "\n", "Current speaker diarization, https://malaya-speech.readthedocs.io/en/latest/load-diarization.html\n", "\n", "Required a pipeline, VAD -> Group positive VADs -> Speaker models -> Clustering, and this pipeline required a really good VAD and Speaker models. What if we can directly cluster using STFT / Features and arange the timestamp.\n", "\n", "Inspired by [khursani8](https://github.com/khursani8),\n", "\n", "Wave -> STFT / Features -> Clustering -> arange timestamp.\n", "\n", "The features can be anything, such as,\n", "\n", "- MFCC\n", "- Melspectrogram\n", "- Conv" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/tensorflow_addons/utils/ensure_tf_install.py:67: UserWarning: Tensorflow Addons supports using Python ops for all Tensorflow versions above or equal to 2.3.0 and strictly below 2.5.0 (nightly versions are not supported). \n", " The versions of TensorFlow you are currently using is 2.5.0 and is not supported. \n", "Some things might work, some things might not.\n", "If you were to encounter a bug, do not file an issue.\n", "If you want to make sure you're using a tested and supported configuration, either change the TensorFlow version or the TensorFlow Addons's version. \n", "You can find the compatibility matrix in TensorFlow Addon's readme:\n", "https://github.com/tensorflow/addons\n", " UserWarning,\n", "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/tensorflow_addons/utils/resource_loader.py:103: UserWarning: You are currently using TensorFlow 2.5.0 and trying to load a custom op (custom_ops/seq2seq/_beam_search_ops.so).\n", "TensorFlow Addons has compiled its custom ops against TensorFlow 2.4.0, and there are no compatibility guarantees between the two versions. \n", "This means that you might get segfaults when loading the custom op, or other kind of low-level errors.\n", " If you do, do not file an issue on Github. This is a known limitation.\n", "\n", "It might help you to fallback to pure Python ops with TF_ADDONS_PY_OPS . To do that, see https://github.com/tensorflow/addons#gpucpu-custom-ops \n", "\n", "You can also change the TensorFlow version installed on your system. You would need a TensorFlow version equal to or above 2.4.0 and strictly below 2.5.0.\n", " Note that nightly versions of TensorFlow, as well as non-pip TensorFlow like `conda install tensorflow` or compiled from source are not supported.\n", "\n", "The last solution is to find the TensorFlow Addons version that has custom ops compatible with the TensorFlow installed on your system. To do that, refer to the readme: https://github.com/tensorflow/addons\n", " UserWarning,\n", "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/malaya_boilerplate-0.0.10-py3.7.egg/malaya_boilerplate/frozen_graph.py:28: UserWarning: Cannot import beam_search_ops from Tensorflow Addons, `deep_model` for stemmer will not available to use, make sure Tensorflow Addons version >= 0.12.0\n" ] } ], "source": [ "from malaya_speech import Pipeline\n", "import malaya_speech\n", "import numpy as np\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load audio sample" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1634237, 16000)" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y, sr = malaya_speech.load('speech/video/The-Singaporean-White-Boy.wav')\n", "len(y), sr" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# just going to take 60 seconds\n", "y = y[:sr * 60]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This audio extracted from https://www.youtube.com/watch?v=HylaY5e1awo&t=2s" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Generate Log Melspectrogram\n", "\n", "You can use interface `malaya_speech.utils.featurization.STTFeaturizer`,\n", "\n", "```python\n", "class STTFeaturizer:\n", " def __init__(\n", " self,\n", " sample_rate=16000,\n", " frame_ms=25,\n", " stride_ms=10,\n", " nfft=None,\n", " num_feature_bins=80,\n", " feature_type='log_mel_spectrogram',\n", " preemphasis=0.97,\n", " dither=1e-5,\n", " normalize_signal=True,\n", " normalize_feature=True,\n", " norm_per_feature=True,\n", " **kwargs,\n", " ):\n", " \"\"\"\n", " sample_rate: int, optional (default=16000)\n", " frame_ms: int, optional (default=25)\n", " To calculate `frame_length` for librosa STFT, `frame_length = int(sample_rate * (frame_ms / 1000))`\n", " stride_ms: int, optional (default=10)\n", " To calculate `frame_step` for librosa STFT, `frame_step = int(sample_rate * (stride_ms / 1000))`\n", " nfft: int, optional (default=None)\n", " If None, will calculate by `math.ceil(math.log2((frame_ms / 1000) * sample_rate))`\n", " num_feature_bins: int, optional (default=80)\n", " Size of output features.\n", " feature_type: str, optional (default='log_mel_spectrogram')\n", " Features type, allowed values:\n", "\n", " * ``'spectrogram'`` - np.square(np.abs(librosa.core.stft))\n", " * ``'mfcc'`` - librosa.feature.mfcc(np.square(np.abs(librosa.core.stft)))\n", " * ``'log_mel_spectrogram'`` - log(mel(np.square(np.abs(librosa.core.stft))))\n", " \"\"\"\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Spectral Clustering\n", "\n", "This is a Python re-implementation of the spectral clustering algorithm in the paper [Speaker Diarization with LSTM](https://google.github.io/speaker-id/publications/LstmDiarization/).\n", "\n", "So, make sure you already install [spectralcluster](https://pypi.org/project/spectralcluster/),\n", "\n", "```bash\n", "pip install spectralcluster\n", "```" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "from spectralcluster import SpectralClusterer\n", "\n", "clusterer = SpectralClusterer(\n", " min_clusters=1,\n", " max_clusters=100,\n", " p_percentile=0.95,\n", " gaussian_blur_sigma=30.0,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Clustering on log MelSpectrogram" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 110 ms, sys: 31.2 ms, total: 141 ms\n", "Wall time: 111 ms\n" ] }, { "data": { "text/plain": [ "(2001, 80)" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "featurizer = malaya_speech.featurization.STTFeaturizer(feature_type = 'log_mel_spectrogram',\n", " frame_ms = 50, stride_ms = 30)\n", "features = featurizer(y)\n", "features.shape" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [], "source": [ "from malaya_speech.utils.dist import l2_normalize" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 15.3 s, sys: 1.56 s, total: 16.8 s\n", "Wall time: 5.2 s\n" ] } ], "source": [ "%%time\n", "\n", "cluster_labels = clusterer.predict(l2_normalize(features))\n", "frames = malaya_speech.arange.arange_frames(features, y, sr)\n", "results = []\n", "for no, result in enumerate(cluster_labels):\n", " results.append((frames[no], result))\n", "grouped = malaya_speech.group.group_frames(results)" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(, 1),\n", " (, 2),\n", " (, 1),\n", " (, 2),\n", " (, 1),\n", " (, 2),\n", " (, 0),\n", " (, 1),\n", " (, 0),\n", " (, 1),\n", " (, 0),\n", " (, 2),\n", " (, 0),\n", " (, 1)]" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grouped" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Clustering on TRILL\n", "\n", "The TRILL model presented in \"Towards Learning a Universal Non-Semantic Representation of Speech\". It exceeds state-of-the-art performance on a number of transfer learning tasks drawn from the non-semantic speech domain (speech emotion recognition, language identification, etc). It is trained on publicly-available AudioSet, https://tfhub.dev/google/nonsemantic-speech-benchmark/trill/3" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "import tensorflow_hub as hub\n", "module = hub.load('https://tfhub.dev/google/nonsemantic-speech-benchmark/trill/3')" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2000" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# i think 60ms pretty ok\n", "frames = malaya_speech.generator.frames(y, frame_duration_ms = 30)\n", "len(frames)" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████| 2000/2000 [01:02<00:00, 32.23it/s]\n" ] } ], "source": [ "from tqdm import tqdm\n", "\n", "arrays = [f.array for f in frames]\n", "embeddings = []\n", "for i in tqdm(range(len(arrays))):\n", " e = module(arrays[i], sample_rate=16000)['embedding']\n", " embeddings.append(e)" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(2000, 512)" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "concat = np.concatenate(embeddings, axis = 0)\n", "concat.shape" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [], "source": [ "clusterer = SpectralClusterer(\n", " min_clusters=1,\n", " max_clusters=100,\n", " p_percentile=0.95,\n", " gaussian_blur_sigma=1.0,\n", " thresholding_soft_multiplier = 1.0,\n", ")" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 16.1 s, sys: 1.43 s, total: 17.5 s\n", "Wall time: 4.28 s\n" ] }, { "data": { "text/plain": [ "[(, 0)]" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "\n", "cluster_labels = clusterer.predict(l2_normalize(concat))\n", "frames = malaya_speech.arange.arange_frames(concat, y, sr)\n", "results_trill = []\n", "for no, result in enumerate(cluster_labels):\n", " results_trill.append((frames[no], result))\n", "grouped_trill = malaya_speech.group.group_frames(results_trill)\n", "grouped_trill" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/huseinzolkepli/Documents/malaya-speech/malaya_speech/extra/visualization.py:168: RuntimeWarning: invalid value encountered in true_divide\n", " std = (a - np.min(a)) / (np.max(a) - np.min(a))\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "nrows = 3\n", "fig, ax = plt.subplots(nrows = nrows, ncols = 1)\n", "fig.set_figwidth(20)\n", "fig.set_figheight(nrows * 3)\n", "min_timestamp = min([i[0].timestamp for i in grouped])\n", "max_timestamp = max([i[0].timestamp + i[0].duration for i in grouped])\n", "ax[0].set_xlim((min_timestamp, max_timestamp))\n", "ax[0].plot([i / sr for i in range(len(y))], y)\n", "malaya_speech.extra.visualization.plot_classification(grouped, \n", " 'diarization using spectral cluster', ax = ax[1],\n", " x_text = 0.01)\n", "malaya_speech.extra.visualization.plot_classification(grouped_trill, \n", " 'diarization using spectral cluster TRILL', ax = ax[2],\n", " x_text = 0.01)\n", "fig.tight_layout()\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import IPython.display as ipd\n", "\n", "ipd.Audio(grouped[0][0].array, rate = sr)" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(grouped[1][0].array, rate = sr)" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(grouped[2][0].array, rate = sr)" ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ipd.Audio(grouped[3][0].array, rate = sr)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" } }, "nbformat": 4, "nbformat_minor": 4 }