Contents

Speech Split PySPTK

Contents

Speech Split PySPTK#

detailed speaking style conversion by disentangling speech into content, timbre, rhythm and pitch using PySPTK.

This tutorial is available as an IPython notebook at malaya-speech/example/speechsplit-conversion-pysptk.

This module is language independent, so it save to use on different languages.

Explanation#

We created super fast Speech Split Conversion model, called FastSpeechSplit, Faster and Accurate Speech Split Conversion using Transformer. No paper produced.

Steps to reproduce can check at https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/speechsplit-conversion

F0 Conversion#

Make sure already installed pysptk,

pip install pysptk

[1]:

import malaya_speech
import numpy as np

List available Speech Split models#

[2]:

malaya_speech.speechsplit_conversion.available_deep_conversion(f0_mode = 'pysptk')

[2]:

	Size (MB)	Quantized Size (MB)
fastspeechsplit-vggvox-v2	232.0	59.2
fastspeechsplit-v2-vggvox-v2	105.0	411.0

Load Deep Conversion#

def deep_conversion(
    model: str = 'fastspeechsplit-v2-vggvox-v2',
    f0_mode = 'pysptk',
    quantized: bool = False,
    **kwargs,
):
    """
    Load Voice Conversion model.

    Parameters
    ----------
    model : str, optional (default='fastvc-32-vggvox-v2')
        Model architecture supported. Allowed values:

        * ``'fastspeechsplit-vggvox-v2'`` - FastSpeechSplit with VGGVox-v2 Speaker Vector.
        * ``'fastspeechsplit-v2-vggvox-v2'`` - FastSpeechSplit V2 with VGGVox-v2 Speaker Vector.

    f0_mode : str, optional (default='pysptk)
        F0 conversion supported. Allowed values:

        * ``'pysptk'`` - https://github.com/r9y9/pysptk, sensitive towards gender.
        * ``'pyworld'`` - https://pypi.org/project/pyworld/

    quantized : bool, optional (default=False)
        if True, will load 8-bit quantized model.
        Quantized model not necessary faster, totally depends on the machine.

    Returns
    -------
    result : malaya_speech.supervised.speechsplit_conversion.load function
    """

[23]:

model = malaya_speech.speechsplit_conversion.deep_conversion(model = 'fastspeechsplit-vggvox-v2')
model_v2 = malaya_speech.speechsplit_conversion.deep_conversion(model = 'fastspeechsplit-v2-vggvox-v2')

Predict#

def predict(
    self,
    original_audio,
    target_audio,
    modes = ['R', 'F', 'U', 'RF', 'RU', 'FU', 'RFU'],
):
    """
    Change original voice audio to follow targeted voice.

    Parameters
    ----------
    original_audio: np.array or malaya_speech.model.frame.Frame
    target_audio: np.array or malaya_speech.model.frame.Frame
    modes: List[str], optional (default = ['R', 'F', 'U', 'RF', 'RU', 'FU', 'RFU'])
        R denotes rhythm, F denotes pitch target, U denotes speaker target (vector).

        * ``'R'`` - maintain `original_audio` F and U on `target_audio` R.
        * ``'F'`` - maintain `original_audio` R and U on `target_audio` F.
        * ``'U'`` - maintain `original_audio` R and F on `target_audio` U.
        * ``'RF'`` - maintain `original_audio` U on `target_audio` R and F.
        * ``'RU'`` - maintain `original_audio` F on `target_audio` R and U.
        * ``'FU'`` - maintain `original_audio` R on `target_audio` F and U.
        * ``'RFU'`` - no conversion happened, just do encoder-decoder on `target_audio`

    Returns
    -------
    result: Dict[modes]
    """

``original_audio`` and ``target_audio`` must 22050 sample rate.

[4]:

sr = 22050
original_audio = malaya_speech.load('speech/example-speaker/haqkiem.wav', sr = sr)[0]
target_audio = malaya_speech.load('speech/example-speaker/female.wav', sr = sr)[0]

[5]:

import IPython.display as ipd

ipd.Audio(original_audio, rate = sr)

[5]:

[6]:

ipd.Audio(target_audio[:sr * 2], rate = sr)

[6]:

[7]:

%%time
r = model.predict(original_audio, target_audio)
r.keys()

CPU times: user 32 s, sys: 6.65 s, total: 38.7 s
Wall time: 17.7 s

[7]:

dict_keys(['R', 'F', 'U', 'RF', 'RU', 'FU', 'RFU'])

[8]:

%%time
r_v2 = model_v2.predict(original_audio, target_audio)
r_v2.keys()

CPU times: user 58.7 s, sys: 12.5 s, total: 1min 11s
Wall time: 30.5 s

[8]:

dict_keys(['R', 'F', 'U', 'RF', 'RU', 'FU', 'RFU'])

Speech Split output#

Will returned mel feature size 80.
This mel feature only able to synthesize using Universal Vocoder, eg, Universal Melgan, https://malaya-speech.readthedocs.io/en/latest/load-universal-melgan.html

Load Universal MelGAN#

Read more about Universal MelGAN at https://malaya-speech.readthedocs.io/en/latest/load-universal-melgan.html

[10]:

melgan = malaya_speech.vocoder.melgan(model = 'universal-1024')

[11]:

%%time

y_ = melgan.predict([r['R']])
ipd.Audio(y_[0], rate = sr)

CPU times: user 17.3 s, sys: 3.62 s, total: 20.9 s
Wall time: 6.16 s

[11]:

[12]:

%%time

y_ = melgan.predict([r_v2['R']])
ipd.Audio(y_[0], rate = sr)

CPU times: user 16.1 s, sys: 2.42 s, total: 18.5 s
Wall time: 3.41 s

[12]:

[13]:

%%time

y_ = melgan.predict([r['F']])
ipd.Audio(y_[0], rate = sr)

CPU times: user 13.6 s, sys: 2.18 s, total: 15.8 s
Wall time: 2.82 s

[13]:

[14]:

%%time

y_ = melgan.predict([r_v2['F']])
ipd.Audio(y_[0], rate = sr)

CPU times: user 13.7 s, sys: 2.25 s, total: 15.9 s
Wall time: 3.09 s

[14]:

[15]:

%%time

y_ = melgan.predict([r['U']])
ipd.Audio(y_[0], rate = sr)

CPU times: user 14.8 s, sys: 2.35 s, total: 17.2 s
Wall time: 3.46 s

[15]:

[16]:

%%time

y_ = melgan.predict([r_v2['U']])
ipd.Audio(y_[0], rate = sr)

CPU times: user 14 s, sys: 2.37 s, total: 16.4 s
Wall time: 2.97 s

[16]:

[17]:

%%time

y_ = melgan.predict([r['RF']])
ipd.Audio(y_[0], rate = sr)

CPU times: user 16.2 s, sys: 2.48 s, total: 18.7 s
Wall time: 3.46 s

[17]:

[18]:

%%time

y_ = melgan.predict([r_v2['RF']])
ipd.Audio(y_[0], rate = sr)

CPU times: user 16.5 s, sys: 2.62 s, total: 19.1 s
Wall time: 3.64 s

[18]:

[19]:

%%time

y_ = melgan.predict([r['RU']])
ipd.Audio(y_[0], rate = sr)

CPU times: user 16.9 s, sys: 2.59 s, total: 19.5 s
Wall time: 3.85 s

[19]:

[20]:

%%time

y_ = melgan.predict([r_v2['RU']])
ipd.Audio(y_[0], rate = sr)

CPU times: user 16.8 s, sys: 2.65 s, total: 19.4 s
Wall time: 3.5 s

[20]:

[21]:

%%time

y_ = melgan.predict([r['FU']])
ipd.Audio(y_[0], rate = sr)

CPU times: user 14.1 s, sys: 2.04 s, total: 16.1 s
Wall time: 2.99 s

[21]:

[22]:

%%time

y_ = melgan.predict([r_v2['FU']])
ipd.Audio(y_[0], rate = sr)

CPU times: user 14.2 s, sys: 2.25 s, total: 16.5 s
Wall time: 2.93 s

[22]: