Speech Split PySPTK¶

detailed speaking style conversion by disentangling speech into content, timbre, rhythm and pitch using PySPTK.

This tutorial is available as an IPython notebook at malaya-speech/example/speechsplit-conversion-pysptk.

This module is language independent, so it save to use on different languages.

Explanation¶

We created super fast Speech Split Conversion model, called FastSpeechSplit, Faster and Accurate Speech Split Conversion using Transformer. No paper produced.

Steps to reproduce can check at https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/speechsplit-conversion

F0 Conversion¶

Make sure already installed pysptk,

pip install pysptk

[1]:

import malaya_speech
import numpy as np

List available Speech Split models¶

[2]:

malaya_speech.speechsplit_conversion.available_deep_conversion(f0_mode = 'pysptk')

[2]:

	Size (MB)	Quantized Size (MB)
fastspeechsplit-vggvox-v2	232.0	59.2
fastspeechsplit-v2-vggvox-v2	105.0	411.0

Load Deep Conversion¶

def deep_conversion(
    model: str = 'fastspeechsplit-v2-vggvox-v2',
    f0_mode = 'pysptk',
    quantized: bool = False,
    **kwargs,
):
    """
    Load Voice Conversion model.

    Parameters
    ----------
    model : str, optional (default='fastvc-32-vggvox-v2')
        Model architecture supported. Allowed values:

        * ``'fastspeechsplit-vggvox-v2'`` - FastSpeechSplit with VGGVox-v2 Speaker Vector.
        * ``'fastspeechsplit-v2-vggvox-v2'`` - FastSpeechSplit V2 with VGGVox-v2 Speaker Vector.

    f0_mode : str, optional (default='pysptk)
        F0 conversion supported. Allowed values:

        * ``'pysptk'`` - https://github.com/r9y9/pysptk, sensitive towards gender.
        * ``'pyworld'`` - https://pypi.org/project/pyworld/

    quantized : bool, optional (default=False)
        if True, will load 8-bit quantized model.
        Quantized model not necessary faster, totally depends on the machine.

    Returns
    -------
    result : malaya_speech.supervised.speechsplit_conversion.load function
    """

[23]:

model = malaya_speech.speechsplit_conversion.deep_conversion(model = 'fastspeechsplit-vggvox-v2')
model_v2 = malaya_speech.speechsplit_conversion.deep_conversion(model = 'fastspeechsplit-v2-vggvox-v2')

Predict¶

def predict(
    self,
    original_audio,
    target_audio,
    modes = ['R', 'F', 'U', 'RF', 'RU', 'FU', 'RFU'],
):
    """
    Change original voice audio to follow targeted voice.

    Parameters
    ----------
    original_audio: np.array or malaya_speech.model.frame.Frame
    target_audio: np.array or malaya_speech.model.frame.Frame
    modes: List[str], optional (default = ['R', 'F', 'U', 'RF', 'RU', 'FU', 'RFU'])
        R denotes rhythm, F denotes pitch target, U denotes speaker target (vector).

        * ``'R'`` - maintain `original_audio` F and U on `target_audio` R.
        * ``'F'`` - maintain `original_audio` R and U on `target_audio` F.
        * ``'U'`` - maintain `original_audio` R and F on `target_audio` U.
        * ``'RF'`` - maintain `original_audio` U on `target_audio` R and F.
        * ``'RU'`` - maintain `original_audio` F on `target_audio` R and U.
        * ``'FU'`` - maintain `original_audio` R on `target_audio` F and U.
        * ``'RFU'`` - no conversion happened, just do encoder-decoder on `target_audio`

    Returns
    -------
    result: Dict[modes]
    """

``original_audio`` and ``target_audio`` must 22050 sample rate.

[4]:

sr = 22050
original_audio = malaya_speech.load('speech/example-speaker/haqkiem.wav', sr = sr)[0]
target_audio = malaya_speech.load('speech/example-speaker/female.wav', sr = sr)[0]

[5]:

import IPython.display as ipd

ipd.Audio(original_audio, rate = sr)

[5]:

[6]:

ipd.Audio(target_audio[:sr * 2], rate = sr)

[6]:

[7]:

%%time
r = model.predict(original_audio, target_audio)
r.keys()

CPU times: user 32 s, sys: 6.65 s, total: 38.7 s
Wall time: 17.7 s

[7]:

dict_keys(['R', 'F', 'U', 'RF', 'RU', 'FU', 'RFU'])

[8]:

%%time
r_v2 = model_v2.predict(original_audio, target_audio)
r_v2.keys()

CPU times: user 58.7 s, sys: 12.5 s, total: 1min 11s
Wall time: 30.5 s

[8]:

dict_keys(['R', 'F', 'U', 'RF', 'RU', 'FU', 'RFU'])

Speech Split output¶

Will returned mel feature size 80.
This mel feature only able to synthesize using Universal Vocoder, eg, Universal Melgan, https://malaya-speech.readthedocs.io/en/latest/load-universal-melgan.html

Load Universal MelGAN¶

Read more about Universal MelGAN at https://malaya-speech.readthedocs.io/en/latest/load-universal-melgan.html

[10]:

melgan = malaya_speech.vocoder.melgan(model = 'universal-1024')

[11]:

%%time

y_ = melgan.predict([r['R']])
ipd.Audio(y_[0], rate = sr)

CPU times: user 17.3 s, sys: 3.62 s, total: 20.9 s
Wall time: 6.16 s

[11]:

[12]:

%%time

y_ = melgan.predict([r_v2['R']])
ipd.Audio(y_[0], rate = sr)

CPU times: user 16.1 s, sys: 2.42 s, total: 18.5 s
Wall time: 3.41 s

[12]:

[13]:

%%time

y_ = melgan.predict([r['F']])
ipd.Audio(y_[0], rate = sr)

CPU times: user 13.6 s, sys: 2.18 s, total: 15.8 s
Wall time: 2.82 s

[13]:

[14]:

%%time

y_ = melgan.predict([r_v2['F']])
ipd.Audio(y_[0], rate = sr)

CPU times: user 13.7 s, sys: 2.25 s, total: 15.9 s
Wall time: 3.09 s

[14]:

[15]:

%%time

y_ = melgan.predict([r['U']])
ipd.Audio(y_[0], rate = sr)

CPU times: user 14.8 s, sys: 2.35 s, total: 17.2 s
Wall time: 3.46 s

[15]:

[16]:

%%time

y_ = melgan.predict([r_v2['U']])
ipd.Audio(y_[0], rate = sr)

CPU times: user 14 s, sys: 2.37 s, total: 16.4 s
Wall time: 2.97 s

[16]:

[17]:

%%time

y_ = melgan.predict([r['RF']])
ipd.Audio(y_[0], rate = sr)

CPU times: user 16.2 s, sys: 2.48 s, total: 18.7 s
Wall time: 3.46 s

[17]:

[18]:

%%time

y_ = melgan.predict([r_v2['RF']])
ipd.Audio(y_[0], rate = sr)

CPU times: user 16.5 s, sys: 2.62 s, total: 19.1 s
Wall time: 3.64 s

[18]:

[19]:

%%time

y_ = melgan.predict([r['RU']])
ipd.Audio(y_[0], rate = sr)

CPU times: user 16.9 s, sys: 2.59 s, total: 19.5 s
Wall time: 3.85 s

[19]:

[20]:

%%time

y_ = melgan.predict([r_v2['RU']])
ipd.Audio(y_[0], rate = sr)

CPU times: user 16.8 s, sys: 2.65 s, total: 19.4 s
Wall time: 3.5 s

[20]:

[21]:

%%time

y_ = melgan.predict([r['FU']])
ipd.Audio(y_[0], rate = sr)

CPU times: user 14.1 s, sys: 2.04 s, total: 16.1 s
Wall time: 2.99 s

[21]:

[22]:

%%time

y_ = melgan.predict([r_v2['FU']])
ipd.Audio(y_[0], rate = sr)

CPU times: user 14.2 s, sys: 2.25 s, total: 16.5 s
Wall time: 2.93 s

[22]: