Speech enhancement
Contents
Speech enhancement#
This tutorial is available as an IPython notebook at malaya-speech/example/speech-enhancement.
This module is language independent, so it save to use on different languages. Pretrained models trained on multilanguages.
This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at malaya-speech/example/pipeline.
Dataset#
Trained on English, Manglish and Bahasa podcasts with augmented noises, gathered at https://github.com/huseinzol05/malaya-speech/tree/master/data/podcast
Purpose of this module to enhance voice activities, reduce reverberance, reduce loudness and broken voices.
voice -> malaya-speech noise reduction -> malaya-speech speech enhancement.
[1]:
import malaya_speech
import numpy as np
from malaya_speech import Pipeline
[2]:
sr = 22050
y, _ = malaya_speech.load('speech/khutbah/wadi-annuar.wav', sr = sr)
len(y), sr, len(y) / sr
[2]:
(220500, 22050, 10.0)
So total length is 10 seconds.
[3]:
import IPython.display as ipd
ipd.Audio(y[:10 * sr], rate = sr)
[3]:
The speech got room echo and a bit of broken high pitch, because it recorded in mosque.
[5]:
from glob import glob
wavs = glob('speech/enhance/*.wav')
wavs
[5]:
['speech/enhance/461-y_.wav',
'speech/enhance/125-y_.wav',
'speech/enhance/371-y_.wav',
'speech/enhance/328-y_.wav']
[6]:
wavs = [malaya_speech.load(f, sr = sr)[0] for f in wavs]
List available deep enhance#
[7]:
malaya_speech.speech_enhancement.available_deep_enhance()
INFO:root:Only calculate SDR, ISR, SAR on voice sample. Higher is better.
[7]:
Size (MB) | Quantized Size (MB) | SDR | ISR | SAR | |
---|---|---|---|---|---|
unet | 40.7 | 10.30 | 9.877178 | 15.916217 | 13.70913 |
resnet-unet | 36.4 | 9.29 | 9.436170 | 16.861030 | 12.32157 |
resnext-unet | 36.1 | 9.26 | 9.685578 | 16.421370 | 12.45115 |
Load deep enhance#
def deep_enhance(model: str = 'unet', quantized: bool = False, **kwargs):
"""
Load Speech Enhancement UNET Waveform sampling deep learning model.
Parameters
----------
model : str, optional (default='unet')
Model architecture supported. Allowed values:
* ``'unet'`` - pretrained UNET Speech Enhancement.
* ``'resnet-unet'`` - pretrained resnet-UNET Speech Enhancement.
* ``'resnext-unet'`` - pretrained resnext-UNET Speech Enhancement.
quantized : bool, optional (default=False)
if True, will load 8-bit quantized model.
Quantized model not necessary faster, totally depends on the machine.
Returns
-------
result : malaya_speech.model.tf.UNET1D class
"""
[8]:
model = malaya_speech.speech_enhancement.deep_enhance(model = 'unet')
quantized_model = malaya_speech.speech_enhancement.deep_enhance(model = 'unet', quantized = True)
WARNING:root:Load quantized model will cause accuracy drop.
/Users/huseinzolkepli/Documents/tf-1.15/env/lib/python3.7/site-packages/tensorflow_core/python/client/session.py:1750: UserWarning: An interactive session is already active. This can cause out-of-memory errors in some cases. You must explicitly call `InteractiveSession.close()` to release resources held by the other session(s).
warnings.warn('An interactive session is already active. This can '
[9]:
resnet = malaya_speech.speech_enhancement.deep_enhance(model = 'resnet-unet')
quantized_resnet = malaya_speech.speech_enhancement.deep_enhance(model = 'resnet-unet', quantized = True)
WARNING:root:Load quantized model will cause accuracy drop.
Important factor for deep enhance#
Deep Enhance model trained on 22k sample rate, so make sure load the audio with 22k sample rate.
malaya_speech.load(audio_file, sr = 22050)
librosa.load(audio_file, sr = 22050)
You can feed dynamic length of audio, no need to cap, the model do padding by itself. But again, the longer the audio, the longer time required to calculate, unless you have GPU to speed up.
The model process on waveform level, so no STFT or inverse STFT involved.
Predict#
Speech Enhancement model only accept 1 audio for single feed-forward,
def predict(self, input):
"""
Enhance inputs, will return waveform.
Parameters
----------
input: np.array
np.array or malaya_speech.model.frame.Frame.
Returns
-------
result: np.array
"""
[16]:
%%time
logits = model.predict(y)
ipd.Audio(logits, rate = 22050)
CPU times: user 2.1 s, sys: 489 ms, total: 2.59 s
Wall time: 508 ms
[16]:
[17]:
%%time
quantized_logits = quantized_model.predict(y)
ipd.Audio(quantized_logits, rate = 22050)
CPU times: user 2.28 s, sys: 481 ms, total: 2.76 s
Wall time: 515 ms
[17]:
[18]:
%%time
logits = resnet.predict(y)
ipd.Audio(logits, rate = 22050)
CPU times: user 2.62 s, sys: 770 ms, total: 3.39 s
Wall time: 907 ms
[18]:
[19]:
%%time
logits = quantized_resnet.predict(y)
ipd.Audio(logits, rate = 22050)
CPU times: user 2.59 s, sys: 728 ms, total: 3.32 s
Wall time: 874 ms
[19]:
Try more example#
[20]:
%%time
logits = [model.predict(w) for w in wavs]
CPU times: user 4.44 s, sys: 1.16 s, total: 5.6 s
Wall time: 1.01 s
[32]:
%%time
resnet_logits = [resnet.predict(w) for w in wavs]
CPU times: user 4.68 s, sys: 1.23 s, total: 5.91 s
Wall time: 1.12 s
[21]:
%%time
quantized_logits = [quantized_model.predict(w) for w in wavs]
CPU times: user 4.15 s, sys: 1.08 s, total: 5.23 s
Wall time: 952 ms
[23]:
%%time
quantized_resnet = [quantized_resnet.predict(w) for w in wavs]
CPU times: user 4.66 s, sys: 1.18 s, total: 5.83 s
Wall time: 1.14 s
[24]:
ipd.Audio(wavs[0], rate = 22050)
[24]:
[27]:
ipd.Audio(logits[0], rate = 22050)
[27]:
[33]:
ipd.Audio(resnet_logits[0], rate = 22050)
[33]:
[29]:
ipd.Audio(wavs[1], rate = 22050)
[29]:
[30]:
ipd.Audio(logits[1], rate = 22050)
[30]:
[34]:
ipd.Audio(resnet_logits[1], rate = 22050)
[34]:
[35]:
ipd.Audio(wavs[2], rate = 22050)
[35]:
[36]:
ipd.Audio(logits[2], rate = 22050)
[36]:
[37]:
ipd.Audio(resnet_logits[2], rate = 22050)
[37]:
[38]:
ipd.Audio(wavs[3], rate = 22050)
[38]:
[39]:
ipd.Audio(logits[3], rate = 22050)
[39]:
[40]:
ipd.Audio(resnet_logits[3], rate = 22050)
[40]:
List available masking model#
Masking model is just simply mask STFT input to reduce echo, reverberance and broken pitch. This model cannot generate new waveform, eg, if input waveform is filtered using low / high frequency, this model cannot enhance the input. So, we prefer use malaya_speech.speech_enhancement.deep_enhance
.
[41]:
malaya_speech.speech_enhancement.available_deep_masking()
INFO:root:Only calculate SDR, ISR, SAR on voice sample. Higher is better.
[41]:
Size (MB) | Quantized Size (MB) | SUM MAE | MAE_SPEAKER | MAE_NOISE | SDR | ISR | SAR | |
---|---|---|---|---|---|---|---|---|
unet | 78.9 | 20.0 | 0.85896 | 0.468490 | 0.390460 | 12.128050 | 14.67067 | 15.019682 |
resnet-unet | 91.4 | 23.0 | 0.81540 | 0.447958 | 0.367441 | 12.349259 | 14.85418 | 15.217510 |
Load masking model#
def deep_masking(model: str = 'resnet-unet', quantized: bool = False, **kwargs):
"""
Load Speech Enhancement STFT UNET masking deep learning model.
Parameters
----------
model : str, optional (default='resnet-unet')
Model architecture supported. Allowed values:
* ``'unet'`` - pretrained UNET.
* ``'resnet-unet'`` - pretrained resnet-UNET.
quantized : bool, optional (default=False)
if True, will load 8-bit quantized model.
Quantized model not necessary faster, totally depends on the machine.
Returns
-------
result : malaya_speech.model.tf.UNETSTFT class
"""
[ ]:
model = malaya_speech.speech_enhancement.deep_masking(model = 'resnet-unet')
Important factor for deep masking#
Speech enhancement masking model trained on 44k sample rate, so make sure load the audio with 44k sample rate.
malaya_speech.load(audio_file, sr = 44100)
librosa.load(audio_file, sr = 44100)
You can feed dynamic length of audio, no need to cap, the model do padding by itself. But again, the longer the audio, the longer time required to calculate, unless you have GPU to speed up.
STFT and Inverse STFT can be done on GPU level, so the model is really fast on GPU.
[43]:
sr = 44100
y, _ = malaya_speech.load('speech/khutbah/wadi-annuar.wav', sr = sr)
len(y), sr, len(y) / sr
[43]:
(441000, 44100, 10.0)
[44]:
%%time
output = model(y)
CPU times: user 4.82 s, sys: 919 ms, total: 5.73 s
Wall time: 1.8 s
[45]:
output
[45]:
{'voice': array([1.9417714e-08, 2.0993056e-08, 2.4434440e-08, ..., 2.1756661e-01,
1.9999057e-01, 1.4723262e-01], dtype=float32),
'noise': array([-1.9704540e-08, -2.3319327e-08, -2.4154849e-08, ...,
1.5757367e-01, 1.5660551e-01, 9.5091663e-02], dtype=float32)}
[46]:
ipd.Audio(output['voice'], rate = sr)
[46]:
[47]:
%%time
output = model(malaya_speech.resample(wavs[0], 22050, sr))
CPU times: user 4.22 s, sys: 647 ms, total: 4.86 s
Wall time: 795 ms
[48]:
ipd.Audio(output['voice'], rate = sr)
[48]:
[ ]: