Speech enhancement#

This tutorial is available as an IPython notebook at malaya-speech/example/speech-enhancement.

This module is language independent, so it save to use on different languages. Pretrained models trained on multilanguages.

This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at malaya-speech/example/pipeline.

Dataset#

Trained on English, Manglish and Bahasa podcasts with augmented noises, gathered at https://github.com/huseinzol05/malaya-speech/tree/master/data/podcast

Purpose of this module to enhance voice activities, reduce reverberance, reduce loudness and broken voices.

voice -> malaya-speech noise reduction -> malaya-speech speech enhancement.

[1]:
import malaya_speech
import numpy as np
from malaya_speech import Pipeline
[2]:
sr = 22050
y, _ = malaya_speech.load('speech/khutbah/wadi-annuar.wav', sr = sr)
len(y), sr, len(y) / sr
[2]:
(220500, 22050, 10.0)

So total length is 10 seconds.

[3]:
import IPython.display as ipd
ipd.Audio(y[:10 * sr], rate = sr)
[3]:

The speech got room echo and a bit of broken high pitch, because it recorded in mosque.

[5]:
from glob import glob

wavs = glob('speech/enhance/*.wav')
wavs
[5]:
['speech/enhance/461-y_.wav',
 'speech/enhance/125-y_.wav',
 'speech/enhance/371-y_.wav',
 'speech/enhance/328-y_.wav']
[6]:
wavs = [malaya_speech.load(f, sr = sr)[0] for f in wavs]

List available deep enhance#

[7]:
malaya_speech.speech_enhancement.available_deep_enhance()
INFO:root:Only calculate SDR, ISR, SAR on voice sample. Higher is better.
[7]:
Size (MB) Quantized Size (MB) SDR ISR SAR
unet 40.7 10.30 9.877178 15.916217 13.70913
resnet-unet 36.4 9.29 9.436170 16.861030 12.32157
resnext-unet 36.1 9.26 9.685578 16.421370 12.45115

Load deep enhance#

def deep_enhance(model: str = 'unet', quantized: bool = False, **kwargs):
    """
    Load Speech Enhancement UNET Waveform sampling deep learning model.

    Parameters
    ----------
    model : str, optional (default='unet')
        Model architecture supported. Allowed values:

        * ``'unet'`` - pretrained UNET Speech Enhancement.
        * ``'resnet-unet'`` - pretrained resnet-UNET Speech Enhancement.
        * ``'resnext-unet'`` - pretrained resnext-UNET Speech Enhancement.
    quantized : bool, optional (default=False)
        if True, will load 8-bit quantized model.
        Quantized model not necessary faster, totally depends on the machine.

    Returns
    -------
    result : malaya_speech.model.tf.UNET1D class
    """
[8]:
model = malaya_speech.speech_enhancement.deep_enhance(model = 'unet')
quantized_model = malaya_speech.speech_enhancement.deep_enhance(model = 'unet', quantized = True)
WARNING:root:Load quantized model will cause accuracy drop.
/Users/huseinzolkepli/Documents/tf-1.15/env/lib/python3.7/site-packages/tensorflow_core/python/client/session.py:1750: UserWarning: An interactive session is already active. This can cause out-of-memory errors in some cases. You must explicitly call `InteractiveSession.close()` to release resources held by the other session(s).
  warnings.warn('An interactive session is already active. This can '
[9]:
resnet = malaya_speech.speech_enhancement.deep_enhance(model = 'resnet-unet')
quantized_resnet = malaya_speech.speech_enhancement.deep_enhance(model = 'resnet-unet', quantized = True)
WARNING:root:Load quantized model will cause accuracy drop.

Important factor for deep enhance#

  1. Deep Enhance model trained on 22k sample rate, so make sure load the audio with 22k sample rate.

malaya_speech.load(audio_file, sr = 22050)
librosa.load(audio_file, sr = 22050)
  1. You can feed dynamic length of audio, no need to cap, the model do padding by itself. But again, the longer the audio, the longer time required to calculate, unless you have GPU to speed up.

  2. The model process on waveform level, so no STFT or inverse STFT involved.

Predict#

Speech Enhancement model only accept 1 audio for single feed-forward,

def predict(self, input):
    """
    Enhance inputs, will return waveform.

    Parameters
    ----------
    input: np.array
        np.array or malaya_speech.model.frame.Frame.

    Returns
    -------
    result: np.array
    """
[16]:
%%time

logits = model.predict(y)
ipd.Audio(logits, rate = 22050)
CPU times: user 2.1 s, sys: 489 ms, total: 2.59 s
Wall time: 508 ms
[16]:
[17]:
%%time

quantized_logits = quantized_model.predict(y)
ipd.Audio(quantized_logits, rate = 22050)
CPU times: user 2.28 s, sys: 481 ms, total: 2.76 s
Wall time: 515 ms
[17]:
[18]:
%%time

logits = resnet.predict(y)
ipd.Audio(logits, rate = 22050)
CPU times: user 2.62 s, sys: 770 ms, total: 3.39 s
Wall time: 907 ms
[18]:
[19]:
%%time

logits = quantized_resnet.predict(y)
ipd.Audio(logits, rate = 22050)
CPU times: user 2.59 s, sys: 728 ms, total: 3.32 s
Wall time: 874 ms
[19]:

Try more example#

[20]:
%%time

logits = [model.predict(w) for w in wavs]
CPU times: user 4.44 s, sys: 1.16 s, total: 5.6 s
Wall time: 1.01 s
[32]:
%%time

resnet_logits = [resnet.predict(w) for w in wavs]
CPU times: user 4.68 s, sys: 1.23 s, total: 5.91 s
Wall time: 1.12 s
[21]:
%%time

quantized_logits = [quantized_model.predict(w) for w in wavs]
CPU times: user 4.15 s, sys: 1.08 s, total: 5.23 s
Wall time: 952 ms
[23]:
%%time

quantized_resnet = [quantized_resnet.predict(w) for w in wavs]
CPU times: user 4.66 s, sys: 1.18 s, total: 5.83 s
Wall time: 1.14 s
[24]:
ipd.Audio(wavs[0], rate = 22050)
[24]:
[27]:
ipd.Audio(logits[0], rate = 22050)
[27]:
[33]:
ipd.Audio(resnet_logits[0], rate = 22050)
[33]:
[29]:
ipd.Audio(wavs[1], rate = 22050)
[29]:
[30]:
ipd.Audio(logits[1], rate = 22050)
[30]:
[34]:
ipd.Audio(resnet_logits[1], rate = 22050)
[34]:
[35]:
ipd.Audio(wavs[2], rate = 22050)
[35]:
[36]:
ipd.Audio(logits[2], rate = 22050)
[36]:
[37]:
ipd.Audio(resnet_logits[2], rate = 22050)
[37]:
[38]:
ipd.Audio(wavs[3], rate = 22050)
[38]:
[39]:
ipd.Audio(logits[3], rate = 22050)
[39]:
[40]:
ipd.Audio(resnet_logits[3], rate = 22050)
[40]:

List available masking model#

Masking model is just simply mask STFT input to reduce echo, reverberance and broken pitch. This model cannot generate new waveform, eg, if input waveform is filtered using low / high frequency, this model cannot enhance the input. So, we prefer use malaya_speech.speech_enhancement.deep_enhance.

[41]:
malaya_speech.speech_enhancement.available_deep_masking()
INFO:root:Only calculate SDR, ISR, SAR on voice sample. Higher is better.
[41]:
Size (MB) Quantized Size (MB) SUM MAE MAE_SPEAKER MAE_NOISE SDR ISR SAR
unet 78.9 20.0 0.85896 0.468490 0.390460 12.128050 14.67067 15.019682
resnet-unet 91.4 23.0 0.81540 0.447958 0.367441 12.349259 14.85418 15.217510

Load masking model#

def deep_masking(model: str = 'resnet-unet', quantized: bool = False, **kwargs):
    """
    Load Speech Enhancement STFT UNET masking deep learning model.

    Parameters
    ----------
    model : str, optional (default='resnet-unet')
        Model architecture supported. Allowed values:

        * ``'unet'`` - pretrained UNET.
        * ``'resnet-unet'`` - pretrained resnet-UNET.
    quantized : bool, optional (default=False)
        if True, will load 8-bit quantized model.
        Quantized model not necessary faster, totally depends on the machine.

    Returns
    -------
    result : malaya_speech.model.tf.UNETSTFT class
    """
[ ]:
model = malaya_speech.speech_enhancement.deep_masking(model = 'resnet-unet')

Important factor for deep masking#

  1. Speech enhancement masking model trained on 44k sample rate, so make sure load the audio with 44k sample rate.

malaya_speech.load(audio_file, sr = 44100)
librosa.load(audio_file, sr = 44100)
  1. You can feed dynamic length of audio, no need to cap, the model do padding by itself. But again, the longer the audio, the longer time required to calculate, unless you have GPU to speed up.

  2. STFT and Inverse STFT can be done on GPU level, so the model is really fast on GPU.

[43]:
sr = 44100
y, _ = malaya_speech.load('speech/khutbah/wadi-annuar.wav', sr = sr)
len(y), sr, len(y) / sr
[43]:
(441000, 44100, 10.0)
[44]:
%%time

output = model(y)
CPU times: user 4.82 s, sys: 919 ms, total: 5.73 s
Wall time: 1.8 s
[45]:
output
[45]:
{'voice': array([1.9417714e-08, 2.0993056e-08, 2.4434440e-08, ..., 2.1756661e-01,
        1.9999057e-01, 1.4723262e-01], dtype=float32),
 'noise': array([-1.9704540e-08, -2.3319327e-08, -2.4154849e-08, ...,
         1.5757367e-01,  1.5660551e-01,  9.5091663e-02], dtype=float32)}
[46]:
ipd.Audio(output['voice'], rate = sr)
[46]:
[47]:
%%time

output = model(malaya_speech.resample(wavs[0], 22050, sr))
CPU times: user 4.22 s, sys: 647 ms, total: 4.86 s
Wall time: 795 ms
[48]:
ipd.Audio(output['voice'], rate = sr)
[48]:
[ ]: