Speech Split PySPTK
Contents
Speech Split PySPTK#
detailed speaking style conversion by disentangling speech into content, timbre, rhythm and pitch using PySPTK.
This tutorial is available as an IPython notebook at malaya-speech/example/speechsplit-conversion-pysptk.
This module is language independent, so it save to use on different languages.
Explanation#
We created super fast Speech Split Conversion model, called FastSpeechSplit, Faster and Accurate Speech Split Conversion using Transformer. No paper produced.
Steps to reproduce can check at https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/speechsplit-conversion
F0 Conversion#
Make sure already installed pysptk,
pip install pysptk
[1]:
import malaya_speech
import numpy as np
List available Speech Split models#
[2]:
malaya_speech.speechsplit_conversion.available_deep_conversion(f0_mode = 'pysptk')
[2]:
Size (MB) | Quantized Size (MB) | |
---|---|---|
fastspeechsplit-vggvox-v2 | 232.0 | 59.2 |
fastspeechsplit-v2-vggvox-v2 | 105.0 | 411.0 |
Load Deep Conversion#
def deep_conversion(
model: str = 'fastspeechsplit-v2-vggvox-v2',
f0_mode = 'pysptk',
quantized: bool = False,
**kwargs,
):
"""
Load Voice Conversion model.
Parameters
----------
model : str, optional (default='fastvc-32-vggvox-v2')
Model architecture supported. Allowed values:
* ``'fastspeechsplit-vggvox-v2'`` - FastSpeechSplit with VGGVox-v2 Speaker Vector.
* ``'fastspeechsplit-v2-vggvox-v2'`` - FastSpeechSplit V2 with VGGVox-v2 Speaker Vector.
f0_mode : str, optional (default='pysptk)
F0 conversion supported. Allowed values:
* ``'pysptk'`` - https://github.com/r9y9/pysptk, sensitive towards gender.
* ``'pyworld'`` - https://pypi.org/project/pyworld/
quantized : bool, optional (default=False)
if True, will load 8-bit quantized model.
Quantized model not necessary faster, totally depends on the machine.
Returns
-------
result : malaya_speech.supervised.speechsplit_conversion.load function
"""
[23]:
model = malaya_speech.speechsplit_conversion.deep_conversion(model = 'fastspeechsplit-vggvox-v2')
model_v2 = malaya_speech.speechsplit_conversion.deep_conversion(model = 'fastspeechsplit-v2-vggvox-v2')
Predict#
def predict(
self,
original_audio,
target_audio,
modes = ['R', 'F', 'U', 'RF', 'RU', 'FU', 'RFU'],
):
"""
Change original voice audio to follow targeted voice.
Parameters
----------
original_audio: np.array or malaya_speech.model.frame.Frame
target_audio: np.array or malaya_speech.model.frame.Frame
modes: List[str], optional (default = ['R', 'F', 'U', 'RF', 'RU', 'FU', 'RFU'])
R denotes rhythm, F denotes pitch target, U denotes speaker target (vector).
* ``'R'`` - maintain `original_audio` F and U on `target_audio` R.
* ``'F'`` - maintain `original_audio` R and U on `target_audio` F.
* ``'U'`` - maintain `original_audio` R and F on `target_audio` U.
* ``'RF'`` - maintain `original_audio` U on `target_audio` R and F.
* ``'RU'`` - maintain `original_audio` F on `target_audio` R and U.
* ``'FU'`` - maintain `original_audio` R on `target_audio` F and U.
* ``'RFU'`` - no conversion happened, just do encoder-decoder on `target_audio`
Returns
-------
result: Dict[modes]
"""
``original_audio`` and ``target_audio`` must 22050 sample rate.
[4]:
sr = 22050
original_audio = malaya_speech.load('speech/example-speaker/haqkiem.wav', sr = sr)[0]
target_audio = malaya_speech.load('speech/example-speaker/female.wav', sr = sr)[0]
[5]:
import IPython.display as ipd
ipd.Audio(original_audio, rate = sr)
[5]:
[6]:
ipd.Audio(target_audio[:sr * 2], rate = sr)
[6]:
[7]:
%%time
r = model.predict(original_audio, target_audio)
r.keys()
CPU times: user 32 s, sys: 6.65 s, total: 38.7 s
Wall time: 17.7 s
[7]:
dict_keys(['R', 'F', 'U', 'RF', 'RU', 'FU', 'RFU'])
[8]:
%%time
r_v2 = model_v2.predict(original_audio, target_audio)
r_v2.keys()
CPU times: user 58.7 s, sys: 12.5 s, total: 1min 11s
Wall time: 30.5 s
[8]:
dict_keys(['R', 'F', 'U', 'RF', 'RU', 'FU', 'RFU'])
Speech Split output#
Will returned mel feature size 80.
This mel feature only able to synthesize using Universal Vocoder, eg, Universal Melgan, https://malaya-speech.readthedocs.io/en/latest/load-universal-melgan.html
Load Universal MelGAN#
Read more about Universal MelGAN at https://malaya-speech.readthedocs.io/en/latest/load-universal-melgan.html
[10]:
melgan = malaya_speech.vocoder.melgan(model = 'universal-1024')
[11]:
%%time
y_ = melgan.predict([r['R']])
ipd.Audio(y_[0], rate = sr)
CPU times: user 17.3 s, sys: 3.62 s, total: 20.9 s
Wall time: 6.16 s
[11]:
[12]:
%%time
y_ = melgan.predict([r_v2['R']])
ipd.Audio(y_[0], rate = sr)
CPU times: user 16.1 s, sys: 2.42 s, total: 18.5 s
Wall time: 3.41 s
[12]:
[13]:
%%time
y_ = melgan.predict([r['F']])
ipd.Audio(y_[0], rate = sr)
CPU times: user 13.6 s, sys: 2.18 s, total: 15.8 s
Wall time: 2.82 s
[13]:
[14]:
%%time
y_ = melgan.predict([r_v2['F']])
ipd.Audio(y_[0], rate = sr)
CPU times: user 13.7 s, sys: 2.25 s, total: 15.9 s
Wall time: 3.09 s
[14]:
[15]:
%%time
y_ = melgan.predict([r['U']])
ipd.Audio(y_[0], rate = sr)
CPU times: user 14.8 s, sys: 2.35 s, total: 17.2 s
Wall time: 3.46 s
[15]:
[16]:
%%time
y_ = melgan.predict([r_v2['U']])
ipd.Audio(y_[0], rate = sr)
CPU times: user 14 s, sys: 2.37 s, total: 16.4 s
Wall time: 2.97 s
[16]:
[17]:
%%time
y_ = melgan.predict([r['RF']])
ipd.Audio(y_[0], rate = sr)
CPU times: user 16.2 s, sys: 2.48 s, total: 18.7 s
Wall time: 3.46 s
[17]:
[18]:
%%time
y_ = melgan.predict([r_v2['RF']])
ipd.Audio(y_[0], rate = sr)
CPU times: user 16.5 s, sys: 2.62 s, total: 19.1 s
Wall time: 3.64 s
[18]:
[19]:
%%time
y_ = melgan.predict([r['RU']])
ipd.Audio(y_[0], rate = sr)
CPU times: user 16.9 s, sys: 2.59 s, total: 19.5 s
Wall time: 3.85 s
[19]:
[20]:
%%time
y_ = melgan.predict([r_v2['RU']])
ipd.Audio(y_[0], rate = sr)
CPU times: user 16.8 s, sys: 2.65 s, total: 19.4 s
Wall time: 3.5 s
[20]:
[21]:
%%time
y_ = melgan.predict([r['FU']])
ipd.Audio(y_[0], rate = sr)
CPU times: user 14.1 s, sys: 2.04 s, total: 16.1 s
Wall time: 2.99 s
[21]:
[22]:
%%time
y_ = melgan.predict([r_v2['FU']])
ipd.Audio(y_[0], rate = sr)
CPU times: user 14.2 s, sys: 2.25 s, total: 16.5 s
Wall time: 2.93 s
[22]: