Remove silents using VAD#

Remove silents actually is pretty hard, traditional people use certain dB threshold, if lower, we assume it is a silent with certain window size. If I set -20 dB for one sample audio, does not mean able to do it for another samples.

This tutorial is available as an IPython notebook at malaya-speech/example/remove-silents-vad.

This module is language independent, so it save to use on different languages.

This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at malaya-speech/example/pipeline.

[1]:
import malaya_speech
import numpy as np
import librosa
from malaya_speech import Pipeline
[2]:
def norm_mel(y, sr):
    mel = librosa.feature.melspectrogram(y, sr = sr, n_mels = 80)
    return np.log10(np.maximum(mel, 1e-10)).T

def plot(y, sr):
    mel = norm_mel(y, sr)
    fig, axs = plt.subplots(2, figsize=(10, 8))
    axs[0].plot(y)
    im = axs[1].imshow(np.rot90(mel), aspect='auto', interpolation='none')
    fig.colorbar(mappable=im, shrink=0.65, orientation='horizontal', ax=axs[1])
    plt.show()

Load easy example#

[3]:
y, sr = malaya_speech.load('speech/podcast/nusantara.wav')
len(y) / sr
[3]:
12.27
[4]:
import matplotlib.pyplot as plt
import IPython.display as ipd
[5]:
ipd.Audio(y, rate = sr)
[5]:
[6]:
plot(y, sr)
_images/remove-silent-vad_11_0.png

If you see at waveform graph or mel graph, we can see silent periods at the start, middle and end.

Use librosa.effects.trim#

[7]:
y_ = librosa.effects.trim(y, top_db = 20)[0]
[8]:
ipd.Audio(y_, rate = sr)
[8]:
[9]:
plot(y_, sr)
_images/remove-silent-vad_16_0.png

Looks good, but it missed silents at the middle.

Use pydub.silence.split_on_silence#

[10]:
from pydub import AudioSegment
from pydub.silence import split_on_silence

Before changed from float np.array into audiosegment, need to cast to int.

[11]:
y_int = malaya_speech.astype.float_to_int(y)
audio = AudioSegment(
    y_int.tobytes(),
    frame_rate = sr,
    sample_width = y_int.dtype.itemsize,
    channels = 1
)
[12]:
audio_chunks = split_on_silence(
    audio,
    min_silence_len = 200,
    silence_thresh = -30,
    keep_silence = 100,
)
audio_chunks
[12]:
[<pydub.audio_segment.AudioSegment at 0x14fb01810>,
 <pydub.audio_segment.AudioSegment at 0x14fb01950>,
 <pydub.audio_segment.AudioSegment at 0x14fb01990>,
 <pydub.audio_segment.AudioSegment at 0x14fb01dd0>,
 <pydub.audio_segment.AudioSegment at 0x14fb07490>]
[13]:
y_ = sum(audio_chunks)
y_ = np.array(y_.get_array_of_samples())
y_ = malaya_speech.astype.int_to_float(y_)
[14]:
ipd.Audio(y_, rate = sr)
[14]: