Contents

Speaker Vector HuggingFace

Contents

Speaker Vector HuggingFace#

This tutorial is available as an IPython notebook at malaya-speech/example/speaker-vector-huggingface.

This module is language independent, so it save to use on different languages. Pretrained models trained on multilanguages.

This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at malaya-speech/example/pipeline.

[1]:

import os

os.environ['CUDA_VISIBLE_DEVICES'] = ''

[2]:

from malaya_speech import Pipeline
import malaya_speech
import numpy as np

`pyaudio` is not available, `malaya_speech.streaming.pyaudio_vad.stream` is not able to use.

[3]:

import logging

logging.basicConfig(level=logging.INFO)

List available HuggingFace model#

[4]:

malaya_speech.speaker_vector.available_huggingface()

INFO:malaya_speech.speaker_vector:tested on VoxCeleb2 test set. Lower EER is better.
INFO:malaya_speech.speaker_vector:download the test set at https://github.com/huseinzol05/malaya-speech/tree/master/data/voxceleb

[4]:

	Size (MB)	Embedding Size	EER
microsoft/wavlm-base-sv	405.0	512.0	0.078274
microsoft/wavlm-base-plus-sv	405.0	512.0	0.066884
microsoft/unispeech-sat-large-sv	1290.0	512.0	0.203277
microsoft/unispeech-sat-base-sv	404.0	512.0	0.078282
microsoft/unispeech-sat-base-plus-sv	404.0	512.0	0.076128

Smaller EER the better model is.

Load HuggingFace model#

def huggingface(
    model: str = 'microsoft/wavlm-base-plus-sv',
    force_check: bool = True,
    **kwargs,
):
    """
    Load Finetuned models from HuggingFace.

    Parameters
    ----------
    model : str, optional (default='microsoft/wavlm-base-plus-sv')
        Check available models at `malaya_speech.speaker_vector.available_huggingface()`.
    force_check: bool, optional (default=True)
        Force check model one of malaya model.
        Set to False if you have your own huggingface model.

    Returns
    -------
    result : malaya_speech.torch_model.huggingface.XVector class
    """

[5]:

model = malaya_speech.speaker_vector.huggingface()

[6]:

from glob import glob

speakers = [
    'speech/example-speaker/khalil-nooh.wav',
    'speech/example-speaker/mas-aisyah.wav',
    'speech/example-speaker/shafiqah-idayu.wav',
    'speech/example-speaker/husein-zolkepli.wav'
]

Pipeline#

[7]:

def load_wav(file):
    return malaya_speech.load(file)[0]

p = Pipeline()
frame = p.foreach_map(load_wav).map(model)

[8]:

p.visualize()

/home/husein/.local/lib/python3.8/site-packages/networkx/readwrite/graphml.py:346: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  (np.int, "int"), (np.int8, "int"),
/home/husein/.local/lib/python3.8/site-packages/networkx/readwrite/gexf.py:220: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  (np.int, "int"), (np.int8, "int"),

[8]:

_images/load-speaker-vector-huggingface_15_1.png

[9]:

r = p(speakers)

/home/husein/.local/lib/python3.8/site-packages/transformers/models/wavlm/modeling_wavlm.py:1755: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  return (input_length - kernel_size) // stride + 1

Calculate similarity#

[10]:

from sklearn.metrics.pairwise import cosine_similarity

[15]:

cosine_similarity(r['speaker-vector-huggingface'])

[15]:

array([[1.0000001 , 0.4988824 , 0.36581534, 0.92033345],
       [0.4988824 , 1.0000001 , 0.6961168 , 0.48723692],
       [0.36581534, 0.6961168 , 1.0000002 , 0.38700205],
       [0.92033345, 0.48723692, 0.38700205, 1.0000001 ]], dtype=float32)

Remember, our files are,

['speech/example-speaker/khalil-nooh.wav',
 'speech/example-speaker/mas-aisyah.wav',
 'speech/example-speaker/shafiqah-idayu.wav',
 'speech/example-speaker/husein-zolkepli.wav']