Speech enhancement

Dataset#

Trained on English, Manglish and Bahasa podcasts with augmented noises, gathered at https://github.com/huseinzol05/malaya-speech/tree/master/data/podcast

Purpose of this module to enhance voice activities, reduce reverberance, reduce loudness and broken voices.

voice -> malaya-speech noise reduction -> malaya-speech speech enhancement.

[1]:

import malaya_speech
import numpy as np
from malaya_speech import Pipeline

[2]:

sr = 22050
y, _ = malaya_speech.load('speech/khutbah/wadi-annuar.wav', sr = sr)
len(y), sr, len(y) / sr

[2]:

(220500, 22050, 10.0)

So total length is 10 seconds.

[3]:

import IPython.display as ipd
ipd.Audio(y[:10 * sr], rate = sr)

[3]:

The speech got room echo and a bit of broken high pitch, because it recorded in mosque.

[5]:

from glob import glob

wavs = glob('speech/enhance/*.wav')
wavs

[5]:

['speech/enhance/461-y_.wav',
 'speech/enhance/125-y_.wav',
 'speech/enhance/371-y_.wav',
 'speech/enhance/328-y_.wav']

[6]:

wavs = [malaya_speech.load(f, sr = sr)[0] for f in wavs]

List available deep enhance#

[7]:

malaya_speech.speech_enhancement.available_deep_enhance()

INFO:root:Only calculate SDR, ISR, SAR on voice sample. Higher is better.

[7]:

	Size (MB)	Quantized Size (MB)	SDR	ISR	SAR
unet	40.7	10.30	9.877178	15.916217	13.70913
resnet-unet	36.4	9.29	9.436170	16.861030	12.32157
resnext-unet	36.1	9.26	9.685578	16.421370	12.45115

Load deep enhance#

def deep_enhance(model: str = 'unet', quantized: bool = False, **kwargs):
    """
    Load Speech Enhancement UNET Waveform sampling deep learning model.

    Parameters
    ----------
    model : str, optional (default='unet')
        Model architecture supported. Allowed values:

        * ``'unet'`` - pretrained UNET Speech Enhancement.
        * ``'resnet-unet'`` - pretrained resnet-UNET Speech Enhancement.
        * ``'resnext-unet'`` - pretrained resnext-UNET Speech Enhancement.
    quantized : bool, optional (default=False)
        if True, will load 8-bit quantized model.
        Quantized model not necessary faster, totally depends on the machine.

    Returns
    -------
    result : malaya_speech.model.tf.UNET1D class
    """

[8]:

model = malaya_speech.speech_enhancement.deep_enhance(model = 'unet')
quantized_model = malaya_speech.speech_enhancement.deep_enhance(model = 'unet', quantized = True)

WARNING:root:Load quantized model will cause accuracy drop.
/Users/huseinzolkepli/Documents/tf-1.15/env/lib/python3.7/site-packages/tensorflow_core/python/client/session.py:1750: UserWarning: An interactive session is already active. This can cause out-of-memory errors in some cases. You must explicitly call `InteractiveSession.close()` to release resources held by the other session(s).
  warnings.warn('An interactive session is already active. This can '

[9]:

resnet = malaya_speech.speech_enhancement.deep_enhance(model = 'resnet-unet')
quantized_resnet = malaya_speech.speech_enhancement.deep_enhance(model = 'resnet-unet', quantized = True)

WARNING:root:Load quantized model will cause accuracy drop.

Important factor for deep enhance#

Deep Enhance model trained on 22k sample rate, so make sure load the audio with 22k sample rate.

malaya_speech.load(audio_file, sr = 22050)
librosa.load(audio_file, sr = 22050)

You can feed dynamic length of audio, no need to cap, the model do padding by itself. But again, the longer the audio, the longer time required to calculate, unless you have GPU to speed up.
The model process on waveform level, so no STFT or inverse STFT involved.

Predict#

Speech Enhancement model only accept 1 audio for single feed-forward,

def predict(self, input):
    """
    Enhance inputs, will return waveform.

    Parameters
    ----------
    input: np.array
        np.array or malaya_speech.model.frame.Frame.

    Returns
    -------
    result: np.array
    """

[16]:

%%time

logits = model.predict(y)
ipd.Audio(logits, rate = 22050)

CPU times: user 2.1 s, sys: 489 ms, total: 2.59 s
Wall time: 508 ms

[16]:

[17]:

%%time

quantized_logits = quantized_model.predict(y)
ipd.Audio(quantized_logits, rate = 22050)

CPU times: user 2.28 s, sys: 481 ms, total: 2.76 s
Wall time: 515 ms

[17]:

[18]:

%%time

logits = resnet.predict(y)
ipd.Audio(logits, rate = 22050)

CPU times: user 2.62 s, sys: 770 ms, total: 3.39 s
Wall time: 907 ms

[18]:

[19]:

%%time

logits = quantized_resnet.predict(y)
ipd.Audio(logits, rate = 22050)

CPU times: user 2.59 s, sys: 728 ms, total: 3.32 s
Wall time: 874 ms

[19]:

Try more example#

[20]:

%%time

logits = [model.predict(w) for w in wavs]

CPU times: user 4.44 s, sys: 1.16 s, total: 5.6 s
Wall time: 1.01 s

[32]:

%%time

resnet_logits = [resnet.predict(w) for w in wavs]

CPU times: user 4.68 s, sys: 1.23 s, total: 5.91 s
Wall time: 1.12 s

[21]:

%%time

quantized_logits = [quantized_model.predict(w) for w in wavs]

CPU times: user 4.15 s, sys: 1.08 s, total: 5.23 s
Wall time: 952 ms

[23]:

%%time

quantized_resnet = [quantized_resnet.predict(w) for w in wavs]

CPU times: user 4.66 s, sys: 1.18 s, total: 5.83 s
Wall time: 1.14 s

[24]:

ipd.Audio(wavs[0], rate = 22050)

[24]:

[27]:

ipd.Audio(logits[0], rate = 22050)

[27]:

[33]:

ipd.Audio(resnet_logits[0], rate = 22050)

[33]:

[29]:

ipd.Audio(wavs[1], rate = 22050)

[29]:

[30]:

ipd.Audio(logits[1], rate = 22050)

[30]:

[34]:

ipd.Audio(resnet_logits[1], rate = 22050)

[34]:

[35]:

ipd.Audio(wavs[2], rate = 22050)

[35]:

[36]:

ipd.Audio(logits[2], rate = 22050)

[36]:

[37]:

ipd.Audio(resnet_logits[2], rate = 22050)

[37]:

[38]:

ipd.Audio(wavs[3], rate = 22050)

[38]:

[39]:

ipd.Audio(logits[3], rate = 22050)

[39]:

[40]:

ipd.Audio(resnet_logits[3], rate = 22050)

[40]:

List available masking model#

Masking model is just simply mask STFT input to reduce echo, reverberance and broken pitch. This model cannot generate new waveform, eg, if input waveform is filtered using low / high frequency, this model cannot enhance the input. So, we prefer use malaya_speech.speech_enhancement.deep_enhance.

[41]:

malaya_speech.speech_enhancement.available_deep_masking()

INFO:root:Only calculate SDR, ISR, SAR on voice sample. Higher is better.

[41]:

	Size (MB)	Quantized Size (MB)	SUM MAE	MAE_SPEAKER	MAE_NOISE	SDR	ISR	SAR
unet	78.9	20.0	0.85896	0.468490	0.390460	12.128050	14.67067	15.019682
resnet-unet	91.4	23.0	0.81540	0.447958	0.367441	12.349259	14.85418	15.217510

Load masking model#

def deep_masking(model: str = 'resnet-unet', quantized: bool = False, **kwargs):
    """
    Load Speech Enhancement STFT UNET masking deep learning model.

    Parameters
    ----------
    model : str, optional (default='resnet-unet')
        Model architecture supported. Allowed values:

        * ``'unet'`` - pretrained UNET.
        * ``'resnet-unet'`` - pretrained resnet-UNET.
    quantized : bool, optional (default=False)
        if True, will load 8-bit quantized model.
        Quantized model not necessary faster, totally depends on the machine.

    Returns
    -------
    result : malaya_speech.model.tf.UNETSTFT class
    """

[ ]:

model = malaya_speech.speech_enhancement.deep_masking(model = 'resnet-unet')

Important factor for deep masking#

Speech enhancement masking model trained on 44k sample rate, so make sure load the audio with 44k sample rate.

malaya_speech.load(audio_file, sr = 44100)
librosa.load(audio_file, sr = 44100)

You can feed dynamic length of audio, no need to cap, the model do padding by itself. But again, the longer the audio, the longer time required to calculate, unless you have GPU to speed up.
STFT and Inverse STFT can be done on GPU level, so the model is really fast on GPU.

[43]:

sr = 44100
y, _ = malaya_speech.load('speech/khutbah/wadi-annuar.wav', sr = sr)
len(y), sr, len(y) / sr

[43]:

(441000, 44100, 10.0)

[44]:

%%time

output = model(y)

CPU times: user 4.82 s, sys: 919 ms, total: 5.73 s
Wall time: 1.8 s

[45]:

output

[45]:

{'voice': array([1.9417714e-08, 2.0993056e-08, 2.4434440e-08, ..., 2.1756661e-01,
        1.9999057e-01, 1.4723262e-01], dtype=float32),
 'noise': array([-1.9704540e-08, -2.3319327e-08, -2.4154849e-08, ...,
         1.5757367e-01,  1.5660551e-01,  9.5091663e-02], dtype=float32)}

[46]:

ipd.Audio(output['voice'], rate = sr)

[46]:

[47]:

%%time

output = model(malaya_speech.resample(wavs[0], 22050, sr))

CPU times: user 4.22 s, sys: 647 ms, total: 4.86 s
Wall time: 795 ms

[48]:

ipd.Audio(output['voice'], rate = sr)

[48]:

[ ]:

Contents

Speech enhancement#

Dataset#

List available deep enhance#

Load deep enhance#

Important factor for deep enhance#

Predict#

Try more example#

List available masking model#

Load masking model#

Important factor for deep masking#