Speech Split PyWorld#

detailed speaking style conversion by disentangling speech into content, timbre, rhythm and pitch using Pyworld.

This tutorial is available as an IPython notebook at malaya-speech/example/speechsplit-conversion-pyworld.

This module is language independent, so it save to use on different languages.

Explanation#

We created super fast Speech Split Conversion model, called FastSpeechSplit, Faster and Accurate Speech Split Conversion using Transformer. No paper produced.

Steps to reproduce can check at https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/speechsplit-conversion

F0 Conversion#

Make sure already installed pyworld,

pip install pyworld
[1]:
import malaya_speech
import numpy as np

List available Speech Split models#

[2]:
malaya_speech.speechsplit_conversion.available_deep_conversion(f0_mode = 'pyworld')
[2]:
Size (MB) Quantized Size (MB)
fastspeechsplit-vggvox-v2 232.0 59.2
fastspeechsplit-v2-vggvox-v2 105.0 411.0

Load Deep Conversion#

def deep_conversion(
    model: str = 'fastspeechsplit-v2-vggvox-v2',
    f0_mode = 'pysptk',
    quantized: bool = False,
    **kwargs,
):
    """
    Load Voice Conversion model.

    Parameters
    ----------
    model : str, optional (default='fastvc-32-vggvox-v2')
        Model architecture supported. Allowed values:

        * ``'fastspeechsplit-vggvox-v2'`` - FastSpeechSplit with VGGVox-v2 Speaker Vector.
        * ``'fastspeechsplit-v2-vggvox-v2'`` - FastSpeechSplit V2 with VGGVox-v2 Speaker Vector.

    f0_mode : str, optional (default='pysptk)
        F0 conversion supported. Allowed values:

        * ``'pysptk'`` - https://github.com/r9y9/pysptk, sensitive towards gender.
        * ``'pyworld'`` - https://pypi.org/project/pyworld/

    quantized : bool, optional (default=False)
        if True, will load 8-bit quantized model.
        Quantized model not necessary faster, totally depends on the machine.

    Returns
    -------
    result : malaya_speech.supervised.speechsplit_conversion.load function
    """
[3]:
model = malaya_speech.speechsplit_conversion.deep_conversion(model = 'fastspeechsplit-vggvox-v2',
                                                            f0_mode = 'pyworld')
model_v2 = malaya_speech.speechsplit_conversion.deep_conversion(model = 'fastspeechsplit-v2-vggvox-v2',
                                                               f0_mode = 'pyworld')

Predict#

def predict(
    self,
    original_audio,
    target_audio,
    modes = ['R', 'F', 'U', 'RF', 'RU', 'FU', 'RFU'],
):
    """
    Change original voice audio to follow targeted voice.

    Parameters
    ----------
    original_audio: np.array or malaya_speech.model.frame.Frame
    target_audio: np.array or malaya_speech.model.frame.Frame
    modes: List[str], optional (default = ['R', 'F', 'U', 'RF', 'RU', 'FU', 'RFU'])
        R denotes rhythm, F denotes pitch target, U denotes speaker target (vector).

        * ``'R'`` - maintain `original_audio` F and U on `target_audio` R.
        * ``'F'`` - maintain `original_audio` R and U on `target_audio` F.
        * ``'U'`` - maintain `original_audio` R and F on `target_audio` U.
        * ``'RF'`` - maintain `original_audio` U on `target_audio` R and F.
        * ``'RU'`` - maintain `original_audio` F on `target_audio` R and U.
        * ``'FU'`` - maintain `original_audio` R on `target_audio` F and U.
        * ``'RFU'`` - no conversion happened, just do encoder-decoder on `target_audio`

    Returns
    -------
    result: Dict[modes]
    """

``original_audio`` and ``target_audio`` must 22050 sample rate.

[4]:
sr = 22050
original_audio = malaya_speech.load('speech/example-speaker/haqkiem.wav', sr = sr)[0]
target_audio = malaya_speech.load('speech/example-speaker/female.wav', sr = sr)[0]
[5]:
import IPython.display as ipd

ipd.Audio(original_audio, rate = sr)
[5]:
[6]:
ipd.Audio(target_audio[:sr * 2], rate = sr)
[6]:
[7]:
%%time
r = model.predict(original_audio, target_audio)
r.keys()
CPU times: user 29.4 s, sys: 5.26 s, total: 34.6 s
Wall time: 13.3 s
[7]:
dict_keys(['R', 'F', 'U', 'RF', 'RU', 'FU', 'RFU'])
[8]:
%%time
r_v2 = model_v2.predict(original_audio, target_audio)
r_v2.keys()
CPU times: user 52.7 s, sys: 9.96 s, total: 1min 2s
Wall time: 23.6 s
[8]:
dict_keys(['R', 'F', 'U', 'RF', 'RU', 'FU', 'RFU'])

Speech Split output#

  1. Will returned mel feature size 80.

  2. This mel feature only able to synthesize using Universal Vocoder, eg, Universal Melgan, https://malaya-speech.readthedocs.io/en/latest/load-universal-melgan.html

Load Universal MelGAN#

Read more about Universal MelGAN at https://malaya-speech.readthedocs.io/en/latest/load-universal-melgan.html

[9]:
melgan = malaya_speech.vocoder.melgan(model = 'universal-1024')
[11]:
%%time

y_ = melgan.predict([r['R']])
ipd.Audio(y_[0], rate = sr)
CPU times: user 17.8 s, sys: 3.61 s, total: 21.4 s
Wall time: 5.9 s
[11]:
[12]:
%%time

y_ = melgan.predict([r_v2['R']])
ipd.Audio(y_[0], rate = sr)
CPU times: user 16.2 s, sys: 2.58 s, total: 18.8 s
Wall time: 3.28 s
[12]:
[13]:
%%time

y_ = melgan.predict([r['F']])
ipd.Audio(y_[0], rate = sr)
CPU times: user 13.7 s, sys: 1.91 s, total: 15.7 s
Wall time: 2.81 s
[13]:
[14]:
%%time

y_ = melgan.predict([r_v2['F']])
ipd.Audio(y_[0], rate = sr)
CPU times: user 13.9 s, sys: 2.14 s, total: 16 s
Wall time: 2.9 s
[14]:
[15]:
%%time

y_ = melgan.predict([r['U']])
ipd.Audio(y_[0], rate = sr)
CPU times: user 13.5 s, sys: 2.05 s, total: 15.5 s
Wall time: 2.78 s
[15]:
[16]:
%%time

y_ = melgan.predict([r_v2['U']])
ipd.Audio(y_[0], rate = sr)
CPU times: user 13.4 s, sys: 2.05 s, total: 15.5 s
Wall time: 2.82 s
[16]: