Contents

Speaker Vector

Contents

Speaker Vector#

This tutorial is available as an IPython notebook at malaya-speech/example/speaker-vector.

This module is language independent, so it save to use on different languages. Pretrained models trained on multilanguages.

This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at malaya-speech/example/pipeline.

[1]:

import os

os.environ['CUDA_VISIBLE_DEVICES'] = ''

[2]:

from malaya_speech import Pipeline
import malaya_speech
import numpy as np

`pyaudio` is not available, `malaya_speech.streaming.stream` is not able to use.

[3]:

import logging

logging.basicConfig(level=logging.INFO)

List available deep model#

[4]:

malaya_speech.speaker_vector.available_model()

INFO:malaya_speech.speaker_vector:tested on VoxCeleb2 test set. Lower EER is better.
INFO:malaya_speech.speaker_vector:download the test set at https://github.com/huseinzol05/malaya-speech/tree/master/data/voxceleb

[4]:

	Size (MB)	Quantized Size (MB)	Embedding Size	EER
deep-speaker	96.7	24.40	512.0	0.21870
vggvox-v1	70.8	17.70	1024.0	0.13944
vggvox-v2	43.2	7.92	512.0	0.04460
conformer-base	99.4	27.20	512.0	0.06938
conformer-tiny	20.3	6.21	512.0	0.08687

Smaller EER the better model is.

Load deep model#

def deep_model(model: str = 'vggvox-v2', quantized: bool = False, **kwargs):
    """
    Load Speaker2Vec model.

    Parameters
    ----------
    model : str, optional (default='speakernet')
        Check available models at `malaya_speech.speaker_vector.available_model()`.
    quantized : bool, optional (default=False)
        if True, will load 8-bit quantized model.
        Quantized model not necessary faster, totally depends on the machine.

    Returns
    -------
    result : malaya_speech.supervised.classification.load function
    """

[4]:

model = malaya_speech.speaker_vector.deep_model('conformer-base')

2023-01-27 23:13:07.304958: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-01-27 23:13:07.314286: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2023-01-27 23:13:07.314317: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: husein-MS-7D31
2023-01-27 23:13:07.314322: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: husein-MS-7D31
2023-01-27 23:13:07.314421: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program
2023-01-27 23:13:07.314615: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 470.161.3

[16]:

from glob import glob

speakers = ['speech/example-speaker/khalil-nooh.wav',
'speech/example-speaker/mas-aisyah.wav',
'speech/example-speaker/shafiqah-idayu.wav',
'speech/example-speaker/husein-zolkepli.wav'
           ]

Pipeline#

[17]:

def load_wav(file):
    return malaya_speech.load(file)[0]

p = Pipeline()
frame = p.foreach_map(load_wav).map(model)

[18]:

p.visualize()

[18]:

_images/load-speaker-vector_15_0.png

[19]:

r = p(speakers)

Calculate similarity#

[20]:

from scipy.spatial.distance import cdist

1 - cdist(r['speaker-vector'], r['speaker-vector'], metric = 'cosine')

[20]:

array([[ 1.        , -0.40519101, -0.35340283,  0.41143028],
       [-0.40519101,  1.        ,  0.49214888, -0.43672686],
       [-0.35340283,  0.49214888,  1.        , -0.27411077],
       [ 0.41143028, -0.43672686, -0.27411077,  1.        ]])

Remember, our files are,

['speech/example-speaker/khalil-nooh.wav',
 'speech/example-speaker/mas-aisyah.wav',
 'speech/example-speaker/shafiqah-idayu.wav',
 'speech/example-speaker/husein-zolkepli.wav']