Speech Split PySPTK#

detailed speaking style conversion by disentangling speech into content, timbre, rhythm and pitch using PySPTK.

This tutorial is available as an IPython notebook at malaya-speech/example/speechsplit-conversion-pysptk.

This module is language independent, so it save to use on different languages.

Explanation#

We created super fast Speech Split Conversion model, called FastSpeechSplit, Faster and Accurate Speech Split Conversion using Transformer. No paper produced.

Steps to reproduce can check at https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/speechsplit-conversion

F0 Conversion#

Make sure already installed pysptk,

pip install pysptk
[1]:
import malaya_speech
import numpy as np

List available Speech Split models#

[2]:
malaya_speech.speechsplit_conversion.available_deep_conversion(f0_mode = 'pysptk')
[2]:
Size (MB) Quantized Size (MB)
fastspeechsplit-vggvox-v2 232.0 59.2
fastspeechsplit-v2-vggvox-v2 105.0 411.0

Load Deep Conversion#

def deep_conversion(
    model: str = 'fastspeechsplit-v2-vggvox-v2',
    f0_mode = 'pysptk',
    quantized: bool = False,
    **kwargs,
):
    """
    Load Voice Conversion model.

    Parameters
    ----------
    model : str, optional (default='fastvc-32-vggvox-v2')
        Model architecture supported. Allowed values:

        * ``'fastspeechsplit-vggvox-v2'`` - FastSpeechSplit with VGGVox-v2 Speaker Vector.
        * ``'fastspeechsplit-v2-vggvox-v2'`` - FastSpeechSplit V2 with VGGVox-v2 Speaker Vector.

    f0_mode : str, optional (default='pysptk)
        F0 conversion supported. Allowed values:

        * ``'pysptk'`` - https://github.com/r9y9/pysptk, sensitive towards gender.
        * ``'pyworld'`` - https://pypi.org/project/pyworld/

    quantized : bool, optional (default=False)
        if True, will load 8-bit quantized model.
        Quantized model not necessary faster, totally depends on the machine.

    Returns
    -------
    result : malaya_speech.supervised.speechsplit_conversion.load function
    """
[23]:
model = malaya_speech.speechsplit_conversion.deep_conversion(model = 'fastspeechsplit-vggvox-v2')
model_v2 = malaya_speech.speechsplit_conversion.deep_conversion(model = 'fastspeechsplit-v2-vggvox-v2')

Predict#

def predict(
    self,
    original_audio,
    target_audio,
    modes = ['R', 'F', 'U', 'RF', 'RU', 'FU', 'RFU'],
):
    """
    Change original voice audio to follow targeted voice.

    Parameters
    ----------
    original_audio: np.array or malaya_speech.model.frame.Frame
    target_audio: np.array or malaya_speech.model.frame.Frame
    modes: List[str], optional (default = ['R', 'F', 'U', 'RF', 'RU', 'FU', 'RFU'])
        R denotes rhythm, F denotes pitch target, U denotes speaker target (vector).

        * ``'R'`` - maintain `original_audio` F and U on `target_audio` R.
        * ``'F'`` - maintain `original_audio` R and U on `target_audio` F.
        * ``'U'`` - maintain `original_audio` R and F on `target_audio` U.
        * ``'RF'`` - maintain `original_audio` U on `target_audio` R and F.
        * ``'RU'`` - maintain `original_audio` F on `target_audio` R and U.
        * ``'FU'`` - maintain `original_audio` R on `target_audio` F and U.
        * ``'RFU'`` - no conversion happened, just do encoder-decoder on `target_audio`

    Returns
    -------
    result: Dict[modes]
    """

``original_audio`` and ``target_audio`` must 22050 sample rate.

[4]:
sr = 22050
original_audio = malaya_speech.load('speech/example-speaker/haqkiem.wav', sr = sr)[0]
target_audio = malaya_speech.load('speech/example-speaker/female.wav', sr = sr)[0]
[5]:
import IPython.display as ipd

ipd.Audio(original_audio, rate = sr)
[5]:
[6]:
ipd.Audio(target_audio[:sr * 2], rate = sr)
[6]:
[7]:
%%time
r = model.predict(original_audio, target_audio)
r.keys()
CPU times: user 32 s, sys: 6.65 s, total: 38.7 s
Wall time: 17.7 s
[7]:
dict_keys(['R', 'F', 'U', 'RF', 'RU', 'FU', 'RFU'])
[8]:
%%time
r_v2 = model_v2.predict(original_audio, target_audio)
r_v2.keys()
CPU times: user 58.7 s, sys: 12.5 s, total: 1min 11s
Wall time: 30.5 s
[8]:
dict_keys(['R', 'F', 'U', 'RF', 'RU', 'FU', 'RFU'])

Speech Split output#

  1. Will returned mel feature size 80.

  2. This mel feature only able to synthesize using Universal Vocoder, eg, Universal Melgan, https://malaya-speech.readthedocs.io/en/latest/load-universal-melgan.html

Load Universal MelGAN#

Read more about Universal MelGAN at https://malaya-speech.readthedocs.io/en/latest/load-universal-melgan.html

[10]:
melgan = malaya_speech.vocoder.melgan(model = 'universal-1024')
[11]:
%%time

y_ = melgan.predict([r['R']])
ipd.Audio(y_[0], rate = sr)
CPU times: user 17.3 s, sys: 3.62 s, total: 20.9 s
Wall time: 6.16 s
[11]:
[12]:
%%time

y_ = melgan.predict([r_v2['R']])
ipd.Audio(y_[0], rate = sr)
CPU times: user 16.1 s, sys: 2.42 s, total: 18.5 s
Wall time: 3.41 s
[12]:
[13]:
%%time

y_ = melgan.predict([r['F']])
ipd.Audio(y_[0], rate = sr)
CPU times: user 13.6 s, sys: 2.18 s, total: 15.8 s
Wall time: 2.82 s
[13]:
[14]:
%%time

y_ = melgan.predict([r_v2['F']])
ipd.Audio(y_[0], rate = sr)
CPU times: user 13.7 s, sys: 2.25 s, total: 15.9 s
Wall time: 3.09 s
[14]:
[15]:
%%time

y_ = melgan.predict([r['U']])
ipd.Audio(y_[0], rate = sr)
CPU times: user 14.8 s, sys: 2.35 s, total: 17.2 s
Wall time: 3.46 s
[15]:
[16]:
%%time

y_ = melgan.predict([r_v2['U']])
ipd.Audio(y_[0], rate = sr)
CPU times: user 14 s, sys: 2.37 s, total: 16.4 s
Wall time: 2.97 s
[16]:
[17]:
%%time

y_ = melgan.predict([r['RF']])
ipd.Audio(y_[0], rate = sr)
CPU times: user 16.2 s, sys: 2.48 s, total: 18.7 s
Wall time: 3.46 s
[17]:
[18]:
%%time

y_ = melgan.predict([r_v2['RF']])
ipd.Audio(y_[0], rate = sr)
CPU times: user 16.5 s, sys: 2.62 s, total: 19.1 s
Wall time: 3.64 s
[18]:
[19]:
%%time

y_ = melgan.predict([r['RU']])
ipd.Audio(y_[0], rate = sr)
CPU times: user 16.9 s, sys: 2.59 s, total: 19.5 s
Wall time: 3.85 s
[19]:
[20]:
%%time

y_ = melgan.predict([r_v2['RU']])
ipd.Audio(y_[0], rate = sr)
CPU times: user 16.8 s, sys: 2.65 s, total: 19.4 s
Wall time: 3.5 s
[20]:
[21]:
%%time

y_ = melgan.predict([r['FU']])
ipd.Audio(y_[0], rate = sr)
CPU times: user 14.1 s, sys: 2.04 s, total: 16.1 s
Wall time: 2.99 s
[21]:
[22]:
%%time

y_ = melgan.predict([r_v2['FU']])
ipd.Audio(y_[0], rate = sr)
CPU times: user 14.2 s, sys: 2.25 s, total: 16.5 s
Wall time: 2.93 s
[22]: