Quantization
Contents
Quantization#
This tutorial is available as an IPython notebook at malaya-speech/example/quantization.
We provided Quantized model for all Malaya-Speech models, example, gender detection models,
[3]:
import malaya_speech
malaya_speech.gender.available_model()
INFO:root:last accuracy during training session before early stopping.
[3]:
Size (MB) | Quantized Size (MB) | Accuracy | |
---|---|---|---|
vggvox-v2 | 31.1 | 7.92 | 0.9756 |
deep-speaker | 96.9 | 24.40 | 0.9455 |
Usually quantized model able to compress 4x of original size. This quantized model will convert all possible floating constants to quantized constants, and only stored mean, standard deviation of floating constants and quantized constants.
Again, quantized model is not necessary faster, because tensorflow will cast back to FP32 during feed-forward for certain operations.
Use quantized model#
Simply pass quantized
parameter become True
, default is False
.
[4]:
quantized_vggvox_v2 = malaya_speech.gender.deep_model(model = 'vggvox-v2', quantized = True)
vggvox_v2 = malaya_speech.gender.deep_model(model = 'vggvox-v2')
WARNING:root:Load quantized model will cause accuracy drop.
INFO:root:running gender/vggvox-v2-quantized using device /device:CPU:0
INFO:root:running gender/vggvox-v2 using device /device:CPU:0
[5]:
y, sr = malaya_speech.load('speech/video/The-Singaporean-White-Boy.wav')
y = y[:int(sr * 0.5)]
len(y), sr
[5]:
(8000, 16000)
[7]:
%%time
vggvox_v2.predict([y])
CPU times: user 171 ms, sys: 32.5 ms, total: 203 ms
Wall time: 51.5 ms
[7]:
['not a gender']
[9]:
%%time
quantized_vggvox_v2.predict([y])
CPU times: user 147 ms, sys: 34.7 ms, total: 182 ms
Wall time: 48.1 ms
[9]:
['not a gender']