Speech Split PyWorld#

detailed speaking style conversion by disentangling speech into content, timbre, rhythm and pitch using Pyworld.

This tutorial is available as an IPython notebook at malaya-speech/example/speechsplit-conversion-pyworld.

This module is language independent, so it save to use on different languages.


We created super fast Speech Split Conversion model, called FastSpeechSplit, Faster and Accurate Speech Split Conversion using Transformer. No paper produced.

Steps to reproduce can check at https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/speechsplit-conversion

F0 Conversion#

Make sure already installed pyworld,

pip install pyworld
import malaya_speech
import numpy as np

List available Speech Split models#

malaya_speech.speechsplit_conversion.available_deep_conversion(f0_mode = 'pyworld')
Size (MB) Quantized Size (MB)
fastspeechsplit-vggvox-v2 232.0 59.2
fastspeechsplit-v2-vggvox-v2 105.0 411.0

Load Deep Conversion#

def deep_conversion(
    model: str = 'fastspeechsplit-v2-vggvox-v2',
    f0_mode = 'pysptk',
    quantized: bool = False,
    Load Voice Conversion model.

    model : str, optional (default='fastvc-32-vggvox-v2')
        Model architecture supported. Allowed values:

        * ``'fastspeechsplit-vggvox-v2'`` - FastSpeechSplit with VGGVox-v2 Speaker Vector.
        * ``'fastspeechsplit-v2-vggvox-v2'`` - FastSpeechSplit V2 with VGGVox-v2 Speaker Vector.

    f0_mode : str, optional (default='pysptk)
        F0 conversion supported. Allowed values:

        * ``'pysptk'`` - https://github.com/r9y9/pysptk, sensitive towards gender.
        * ``'pyworld'`` - https://pypi.org/project/pyworld/

    quantized : bool, optional (default=False)
        if True, will load 8-bit quantized model.
        Quantized model not necessary faster, totally depends on the machine.

    result : malaya_speech.supervised.speechsplit_conversion.load function
model = malaya_speech.speechsplit_conversion.deep_conversion(model = 'fastspeechsplit-vggvox-v2',
                                                            f0_mode = 'pyworld')
model_v2 = malaya_speech.speechsplit_conversion.deep_conversion(model = 'fastspeechsplit-v2-vggvox-v2',
                                                               f0_mode = 'pyworld')


def predict(
    modes = ['R', 'F', 'U', 'RF', 'RU', 'FU', 'RFU'],
    Change original voice audio to follow targeted voice.

    original_audio: np.array or malaya_speech.model.frame.Frame
    target_audio: np.array or malaya_speech.model.frame.Frame
    modes: List[str], optional (default = ['R', 'F', 'U', 'RF', 'RU', 'FU', 'RFU'])
        R denotes rhythm, F denotes pitch target, U denotes speaker target (vector).

        * ``'R'`` - maintain `original_audio` F and U on `target_audio` R.
        * ``'F'`` - maintain `original_audio` R and U on `target_audio` F.
        * ``'U'`` - maintain `original_audio` R and F on `target_audio` U.
        * ``'RF'`` - maintain `original_audio` U on `target_audio` R and F.
        * ``'RU'`` - maintain `original_audio` F on `target_audio` R and U.
        * ``'FU'`` - maintain `original_audio` R on `target_audio` F and U.
        * ``'RFU'`` - no conversion happened, just do encoder-decoder on `target_audio`

    result: Dict[modes]

``original_audio`` and ``target_audio`` must 22050 sample rate.

sr = 22050
original_audio = malaya_speech.load('speech/example-speaker/haqkiem.wav', sr = sr)[0]
target_audio = malaya_speech.load('speech/example-speaker/female.wav', sr = sr)[0]
import IPython.display as ipd

ipd.Audio(original_audio, rate = sr)
ipd.Audio(target_audio[:sr * 2], rate = sr)
r = model.predict(original_audio, target_audio)
CPU times: user 29.4 s, sys: 5.26 s, total: 34.6 s
Wall time: 13.3 s
dict_keys(['R', 'F', 'U', 'RF', 'RU', 'FU', 'RFU'])
r_v2 = model_v2.predict(original_audio, target_audio)
CPU times: user 52.7 s, sys: 9.96 s, total: 1min 2s
Wall time: 23.6 s
dict_keys(['R', 'F', 'U', 'RF', 'RU', 'FU', 'RFU'])

Speech Split output#

  1. Will returned mel feature size 80.

  2. This mel feature only able to synthesize using Universal Vocoder, eg, Universal Melgan, https://malaya-speech.readthedocs.io/en/latest/load-universal-melgan.html

Load Universal MelGAN#

Read more about Universal MelGAN at https://malaya-speech.readthedocs.io/en/latest/load-universal-melgan.html

melgan = malaya_speech.vocoder.melgan(model = 'universal-1024')

y_ = melgan.predict([r['R']])
ipd.Audio(y_[0], rate = sr)
CPU times: user 17.8 s, sys: 3.61 s, total: 21.4 s
Wall time: 5.9 s

y_ = melgan.predict([r_v2['R']])
ipd.Audio(y_[0], rate = sr)
CPU times: user 16.2 s, sys: 2.58 s, total: 18.8 s
Wall time: 3.28 s

y_ = melgan.predict([r['F']])
ipd.Audio(y_[0], rate = sr)
CPU times: user 13.7 s, sys: 1.91 s, total: 15.7 s
Wall time: 2.81 s

y_ = melgan.predict([r_v2['F']])
ipd.Audio(y_[0], rate = sr)
CPU times: user 13.9 s, sys: 2.14 s, total: 16 s
Wall time: 2.9 s

y_ = melgan.predict([r['U']])
ipd.Audio(y_[0], rate = sr)
CPU times: user 13.5 s, sys: 2.05 s, total: 15.5 s
Wall time: 2.78 s

y_ = melgan.predict([r_v2['U']])
ipd.Audio(y_[0], rate = sr)
CPU times: user 13.4 s, sys: 2.05 s, total: 15.5 s
Wall time: 2.82 s