Speaker Vector#

This tutorial is available as an IPython notebook at malaya-speech/example/speaker-vector.

This module is language independent, so it save to use on different languages. Pretrained models trained on multilanguages.

This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at malaya-speech/example/pipeline.

[1]:

from malaya_speech import Pipeline
import malaya_speech
import numpy as np

List available deep model#

[2]:

malaya_speech.speaker_vector.available_model()

INFO:root:tested on VoxCeleb2 test set. Lower EER is better.

[2]:

	Size (MB)	Quantized Size (MB)	Embedding Size	EER
deep-speaker	96.7	24.40	512.0	0.21870
vggvox-v1	70.8	17.70	1024.0	0.14070
vggvox-v2	43.2	7.92	512.0	0.04450
speakernet	35.0	8.88	7205.0	0.02122

Smaller EER the better model is.

Load deep model#

def deep_model(model: str = 'speakernet', quantized: bool = False, **kwargs):
    """
    Load Speaker2Vec model.

    Parameters
    ----------
    model : str, optional (default='speakernet')
        Model architecture supported. Allowed values:

        * ``'vggvox-v1'`` - VGGVox V1, embedding size 1024
        * ``'vggvox-v2'`` - VGGVox V2, embedding size 512
        * ``'deep-speaker'`` - Deep Speaker, embedding size 512
        * ``'speakernet'`` - SpeakerNet, embedding size 7205

    quantized : bool, optional (default=False)
        if True, will load 8-bit quantized model.
        Quantized model not necessary faster, totally depends on the machine.

    Returns
    -------
    result : malaya_speech.supervised.classification.load function
    """

[3]:

model = malaya_speech.speaker_vector.deep_model('speakernet')

Load Quantized deep model#

To load 8-bit quantized model, simply pass quantized = True, default is False.

We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine.

[4]:

quantized_model = malaya_speech.speaker_vector.deep_model('speakernet', quantized = True)

WARNING:root:Load quantized model will cause accuracy drop.
/Users/huseinzolkepli/Documents/tf-1.15/env/lib/python3.7/site-packages/tensorflow_core/python/client/session.py:1750: UserWarning: An interactive session is already active. This can cause out-of-memory errors in some cases. You must explicitly call `InteractiveSession.close()` to release resources held by the other session(s).
  warnings.warn('An interactive session is already active. This can '

[5]:

from glob import glob

speakers = ['speech/example-speaker/khalil-nooh.wav',
'speech/example-speaker/mas-aisyah.wav',
'speech/example-speaker/shafiqah-idayu.wav',
'speech/example-speaker/husein-zolkepli.wav']

Pipeline#

[6]:

def load_wav(file):
    return malaya_speech.load(file)[0]

p = Pipeline()
frame = p.foreach_map(load_wav).map(model)

[7]:

p.visualize()

[7]:

[8]:

r = p.emit(speakers)

[9]:

quantized_p = Pipeline()
quantized_frame = quantized_p.foreach_map(load_wav).map(quantized_model)

[10]:

quantized_r = quantized_p.emit(speakers)

Calculate similarity#

[11]:

from scipy.spatial.distance import cdist

1 - cdist(r['speaker-vector'], r['speaker-vector'], metric = 'cosine')

[11]:

array([[1.        , 0.85895569, 0.85036787, 0.919863  ],
       [0.85895569, 1.        , 0.88895719, 0.85086463],
       [0.85036787, 0.88895719, 1.        , 0.86070389],
       [0.919863  , 0.85086463, 0.86070389, 1.        ]])

[12]:

from scipy.spatial.distance import cdist

1 - cdist(quantized_r['speaker-vector'], quantized_r['speaker-vector'], metric = 'cosine')

[12]:

array([[1.        , 0.86325292, 0.8574443 , 0.92556189],
       [0.86325292, 1.        , 0.88897938, 0.85685812],
       [0.8574443 , 0.88897938, 1.        , 0.86453416],
       [0.92556189, 0.85685812, 0.86453416, 1.        ]])

Remember, our files are,

['speech/example-speaker/khalil-nooh.wav',
 'speech/example-speaker/mas-aisyah.wav',
 'speech/example-speaker/shafiqah-idayu.wav',
 'speech/example-speaker/husein-zolkepli.wav']

If we check first row,

[1.        , 0.86325292, 0.8574443 , 0.92556189]

second biggest is 0.91986299, which is 4th column, for husein-zolkepli.wav. So the speaker vector knows khalil-nooh.wav sounds similar to husein-zolkepli.wav due to gender factor.

Reference#

deep-speaker, https://github.com/philipperemy/deep-speaker, exported from Keras to TF checkpoint.
vggvox-v1, https://github.com/linhdvu14/vggvox-speaker-identification, exported from Keras to TF checkpoint.
vggvox-v2, https://github.com/WeidiXie/VGG-Speaker-Recognition, exported from Keras to TF checkpoint.
speakernet, Nvidia NeMo, https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_recognition, exported from Pytorch to TF.
VoxCeleb2, speaker verification dataset, http://www.robots.ox.ac.uk/~vgg/data/voxceleb/index.html#about

Speaker Vector

Contents

Speaker Vector#

List available deep model#

Load deep model#

Load Quantized deep model#

Pipeline#

Calculate similarity#

Reference#