Super Resolution¶
This tutorial is available as an IPython notebook at malaya-speech/example/super-resolution.
This module is language independent, so it save to use on different languages. Pretrained models trained on multilanguages.
This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at malaya-speech/example/pipeline.
Dataset¶
Trained on English, Manglish and Bahasa podcasts with augmented noises, gathered at https://github.com/huseinzol05/malaya-speech/tree/master/data/podcast
Purpose of this module to increase sample rate.
[1]:
import malaya_speech
import numpy as np
from malaya_speech import Pipeline
import IPython.display as ipd
List available deep model¶
[2]:
malaya_speech.super_resolution.available_model()
INFO:root:Only calculate SDR, ISR, SAR on voice sample. Higher is better.
[2]:
| Size (MB) | Quantized Size (MB) | SDR | ISR | SAR | |
|---|---|---|---|---|---|
| srgan-128 | 7.37 | 2.04 | 17.03345 | 22.330260 | 17.454372 |
| srgan-256 | 29.50 | 7.55 | 16.34558 | 22.067493 | 17.024390 |
We modified SRGAN to do 1D Convolution and use output distance as generator loss, originally use content loss.
Load deep model¶
def deep_model(model: str = 'srgan-256', quantized: bool = False, **kwargs):
"""
Load Super Resolution 4x deep learning model.
Parameters
----------
model : str, optional (default='srgan-256')
Model architecture supported. Allowed values:
* ``'srgan-128'`` - srgan with 128 filter size and 16 residual blocks.
* ``'srgan-256'`` - srgan with 256 filter size and 16 residual blocks.
quantized : bool, optional (default=False)
if True, will load 8-bit quantized model.
Quantized model not necessary faster, totally depends on the machine.
Returns
-------
result : malaya_speech.model.tf.UNET1D class
"""
[3]:
model = malaya_speech.super_resolution.deep_model(model = 'srgan-256')
model_128 = malaya_speech.super_resolution.deep_model(model = 'srgan-128')
WARNING:tensorflow:From /Users/huseinzolkepli/Documents/malaya-speech/malaya_speech/utils/__init__.py:66: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.
WARNING:tensorflow:From /Users/huseinzolkepli/Documents/malaya-speech/malaya_speech/utils/__init__.py:66: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.
WARNING:tensorflow:From /Users/huseinzolkepli/Documents/malaya-speech/malaya_speech/utils/__init__.py:68: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef instead.
WARNING:tensorflow:From /Users/huseinzolkepli/Documents/malaya-speech/malaya_speech/utils/__init__.py:68: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef instead.
WARNING:tensorflow:From /Users/huseinzolkepli/Documents/malaya-speech/malaya_speech/utils/__init__.py:61: The name tf.InteractiveSession is deprecated. Please use tf.compat.v1.InteractiveSession instead.
WARNING:tensorflow:From /Users/huseinzolkepli/Documents/malaya-speech/malaya_speech/utils/__init__.py:61: The name tf.InteractiveSession is deprecated. Please use tf.compat.v1.InteractiveSession instead.
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/tensorflow_core/python/client/session.py:1750: UserWarning: An interactive session is already active. This can cause out-of-memory errors in some cases. You must explicitly call `InteractiveSession.close()` to release resources held by the other session(s).
warnings.warn('An interactive session is already active. This can '
Load Quantized deep model¶
To load 8-bit quantized model, simply pass quantized = True, default is False.
We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine.
[4]:
quantized_model = malaya_speech.super_resolution.deep_model(model = 'srgan-256', quantized = True)
quantized_model_128 = malaya_speech.super_resolution.deep_model(model = 'srgan-128', quantized = True)
WARNING:root:Load quantized model will cause accuracy drop.
WARNING:root:Load quantized model will cause accuracy drop.
Important factor¶
Currently we only supported 4x super resolution, if input sample rate is 16k, output will become 16k * 4.
We trained on 11025 for input sample rate, 44100 for output sample rate.
Predict¶
def predict(self, input):
"""
Enhance inputs, will return waveform.
Parameters
----------
input: np.array
np.array or malaya_speech.model.frame.Frame.
Returns
-------
result: np.array
"""
Let say we have a low sample rate audio,
[5]:
sr = 44100
reduction_factor = 4
[6]:
y, sr_ = malaya_speech.load('speech/44k/test-0.wav', sr = sr // reduction_factor)
ipd.Audio(y[:sr_ * 4], rate = sr_)
[6]:
[7]:
%%time
output = model(y)
ipd.Audio(output[:sr * 4], rate = sr)
CPU times: user 1min 2s, sys: 3.49 s, total: 1min 5s
Wall time: 10.9 s
[7]:
[8]:
%%time
output = model_128(y)
ipd.Audio(output[:sr * 4], rate = sr)
CPU times: user 18.1 s, sys: 1.74 s, total: 19.8 s
Wall time: 3.54 s
[8]:
[9]:
%%time
output = quantized_model_128(y)
ipd.Audio(output[:sr * 4], rate = sr)
CPU times: user 18.2 s, sys: 2.05 s, total: 20.2 s
Wall time: 3.89 s
[9]:
Below is common technique people do upsampling using interpolate,
[10]:
y_ = malaya_speech.resample(y, sr // reduction_factor, sr)
ipd.Audio(y_[:sr * 4], rate = sr)
[10]:
Try more examples¶
[11]:
y, sr_ = malaya_speech.load('speech/example-speaker/husein-generated.wav', sr = sr // reduction_factor)
sr_
[11]:
11025
[12]:
ipd.Audio(y[:sr_ * 4], rate = sr_)
[12]:
[13]:
%%time
output = model(y)
ipd.Audio(output[:sr * 4], rate = sr)
CPU times: user 58.1 s, sys: 3.38 s, total: 1min 1s
Wall time: 10.5 s
[13]:
[14]:
%%time
output = model_128(y)
ipd.Audio(output[:sr * 4], rate = sr)
CPU times: user 17.2 s, sys: 1.6 s, total: 18.8 s
Wall time: 3.14 s
[14]:
[15]:
y_ = malaya_speech.resample(y, sr_, sr)
ipd.Audio(y_[:sr * 4], rate = sr)
[15]:
[16]:
y, sr_ = malaya_speech.load('speech/44k/test-2.wav', sr = sr // reduction_factor)
ipd.Audio(y[:sr_ * 4], rate = sr_)
[16]:
[17]:
%%time
output = model(y)
ipd.Audio(output[:sr * 4], rate = sr)
CPU times: user 1min 1s, sys: 3.64 s, total: 1min 4s
Wall time: 11.5 s
[17]:
[18]:
%%time
output = model_128(y)
ipd.Audio(output[:sr * 4], rate = sr)
CPU times: user 18.5 s, sys: 1.83 s, total: 20.3 s
Wall time: 3.72 s
[18]:
[19]:
y_ = malaya_speech.resample(y, sr_, sr)
ipd.Audio(y_[:sr * 4], rate = sr)
[19]:
Use Pipeline¶
Incase your audio is too long and you do not want to burden your machine. So, you can use malaya-speech Pipeline to split the audio splitted to 3 seconds, predict one-by-one and combine after that.
[20]:
p = Pipeline()
pipeline = (
p.map(malaya_speech.load, sr = sr // reduction_factor)
.map(lambda x: x[0])
.map(malaya_speech.generator.frames, frame_duration_ms = 3000, sample_rate = sr // reduction_factor)
.foreach_map(model_128)
.map(np.concatenate)
)
p.visualize()
[20]:
[21]:
%%time
results = p('speech/podcast/nusantara.wav')
CPU times: user 20.2 s, sys: 2.52 s, total: 22.7 s
Wall time: 4.19 s
[22]:
results.keys()
[22]:
dict_keys(['load', '<lambda>', 'frames', 'super-resolution', 'concatenate'])
[23]:
ipd.Audio(results['concatenate'], rate = sr)
[23]:
[24]:
ipd.Audio(results['<lambda>'], rate = sr // reduction_factor)
[24]:
[25]:
y_ = malaya_speech.resample(results['<lambda>'], sr // reduction_factor, sr)
ipd.Audio(y_, rate = sr)
[25]: