Quantization#

This tutorial is available as an IPython notebook at malaya-speech/example/quantization.

We provided Quantized model for all Malaya-Speech models, example, gender detection models,

[3]:
import malaya_speech

malaya_speech.gender.available_model()
INFO:root:last accuracy during training session before early stopping.
[3]:
Size (MB) Quantized Size (MB) Accuracy
vggvox-v2 31.1 7.92 0.9756
deep-speaker 96.9 24.40 0.9455

Usually quantized model able to compress 4x of original size. This quantized model will convert all possible floating constants to quantized constants, and only stored mean, standard deviation of floating constants and quantized constants.

Again, quantized model is not necessary faster, because tensorflow will cast back to FP32 during feed-forward for certain operations.

Use quantized model#

Simply pass quantized parameter become True, default is False.

[4]:
quantized_vggvox_v2 = malaya_speech.gender.deep_model(model = 'vggvox-v2', quantized = True)
vggvox_v2 = malaya_speech.gender.deep_model(model = 'vggvox-v2')
WARNING:root:Load quantized model will cause accuracy drop.
INFO:root:running gender/vggvox-v2-quantized using device /device:CPU:0
INFO:root:running gender/vggvox-v2 using device /device:CPU:0
[5]:
y, sr = malaya_speech.load('speech/video/The-Singaporean-White-Boy.wav')
y = y[:int(sr * 0.5)]
len(y), sr
[5]:
(8000, 16000)
[7]:
%%time

vggvox_v2.predict([y])
CPU times: user 171 ms, sys: 32.5 ms, total: 203 ms
Wall time: 51.5 ms
[7]:
['not a gender']
[9]:
%%time

quantized_vggvox_v2.predict([y])
CPU times: user 147 ms, sys: 34.7 ms, total: 182 ms
Wall time: 48.1 ms
[9]:
['not a gender']