Welcome to Malaya-Speech’s documentation!
Contents
Welcome to Malaya-Speech’s documentation!#
Malaya-Speech is a Speech-Toolkit library for Malaysian language, powered by Tensorflow and PyTorch.
Documentation#
Stable released documentation is available at https://malaya-speech.readthedocs.io/en/stable/
Installing from the PyPI#
$ pip install malaya-speech
It will automatically install all dependencies except for Tensorflow and PyTorch. So you can choose your own Tensorflow CPU / GPU version and PyTorch CPU / GPU version.
Only Python >= 3.6.0, Tensorflow >= 1.15.0, and PyTorch >= 1.10 are supported.
Development Release#
Install from master branch,
$ pip install git+https://github.com/huseinzol05/malaya-speech.git
We recommend to use virtualenv for development.
While development released documentation is available at https://malaya-speech.readthedocs.io/en/latest/
Features#
Age Detection, detect age in speech using Finetuned Speaker Vector.
Speaker Diarization, diarizing speakers using Pretrained Speaker Vector.
Emotion Detection, detect emotions in speech using Finetuned Speaker Vector.
Force Alignment, generate a time-aligned transcription of an audio file using RNNT, Wav2Vec2 CTC and Whisper Seq2Seq.
Gender Detection, detect genders in speech using Finetuned Speaker Vector.
Clean speech Detection, detect clean speech using Finetuned Speaker Vector.
Language Detection, detect hyperlocal languages in speech using Finetuned Speaker Vector.
Language Model, using KenLM, Masked language model using BERT and RoBERTa, and GPT2 to do ASR decoder scoring.
Multispeaker Separation, Multispeaker separation using FastSep on 8k Wav.
Noise Reduction, reduce multilevel noises using STFT UNET.
Speaker Change Detection, detect changing speakers using Finetuned Speaker Vector.
Speaker Count Detection, detect number of speakers using Finetuned Speaker Vector.
Speaker overlap Detection, detect overlap speakers using Finetuned Speaker Vector.
Speaker Vector, calculate similarity between speakers using Pretrained Speaker Vector.
Speech Enhancement, enhance voice activities using Waveform UNET.
SpeechSplit Conversion, detailed speaking style conversion by disentangling speech into content, timbre, rhythm and pitch using PyWorld and PySPTK.
Speech-to-Text, End-to-End Speech to Text for Malay, Mixed (Malay, Singlish) and Singlish using RNNT, Wav2Vec2 CTC and Whisper Seq2Seq.
Super Resolution, Super Resolution 4x for Waveform using ResNet UNET and Neural Vocoder.
Text-to-Speech, Text to Speech for Malay and Singlish using Tacotron2, FastSpeech2, FastPitch, GlowTTS, LightSpeech and VITS.
Vocoder, convert Mel to Waveform using MelGAN, Multiband MelGAN and Universal MelGAN Vocoder.
Voice Activity Detection, detect voice activities using Finetuned Speaker Vector.
Voice Conversion, Many-to-One and Zero-shot Voice Conversion.
Real time interface, provide PyAudio and TorchAudio streaming interface to do real time inference.
Pretrained Models#
Malaya-Speech also released pretrained models, simply check at malaya-speech/pretrained-model
Wave UNET, Multi-Scale Neural Network for End-to-End Audio Source Separation, https://arxiv.org/abs/1806.03185
Wave ResNet UNET, added ResNet style into Wave UNET, no paper produced.
Wave ResNext UNET, added ResNext style into Wave UNET, no paper produced.
Deep Speaker, An End-to-End Neural Speaker Embedding System, https://arxiv.org/pdf/1705.02304.pdf
SpeakerNet, 1D Depth-wise Separable Convolutional Network for Text-Independent Speaker Recognition and Verification, https://arxiv.org/abs/2010.12653
VGGVox, a large-scale speaker identification dataset, https://arxiv.org/pdf/1706.08612.pdf
GhostVLAD, Utterance-level Aggregation For Speaker Recognition In The Wild, https://arxiv.org/abs/1902.10107
Conformer, Convolution-augmented Transformer for Speech Recognition, https://arxiv.org/abs/2005.08100
ALConformer, A lite Conformer, no paper produced.
Jasper, An End-to-End Convolutional Neural Acoustic Model, https://arxiv.org/abs/1904.03288
Tacotron2, Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions, https://arxiv.org/abs/1712.05884
FastSpeech2, Fast and High-Quality End-to-End Text to Speech, https://arxiv.org/abs/2006.04558
MelGAN, Generative Adversarial Networks for Conditional Waveform Synthesis, https://arxiv.org/abs/1910.06711
Multi-band MelGAN, Faster Waveform Generation for High-Quality Text-to-Speech, https://arxiv.org/abs/2005.05106
SRGAN, Modified version of SRGAN to do 1D Convolution, Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network, https://arxiv.org/abs/1609.04802
Speech Enhancement UNET, https://github.com/haoxiangsnr/Wave-U-Net-for-Speech-Enhancement
Speech Enhancement ResNet UNET, Added ResNet style into Speech Enhancement UNET, no paper produced.
Speech Enhancement ResNext UNET, Added ResNext style into Speech Enhancement UNET, no paper produced.
Universal MelGAN, Universal MelGAN: A Robust Neural Vocoder for High-Fidelity Waveform Generation in Multiple Domains, https://arxiv.org/abs/2011.09631
FastVC, Faster and Accurate Voice Conversion using Transformer, no paper produced.
FastSep, Faster and Accurate Speech Separation using Transformer, no paper produced.
wav2vec 2.0, A Framework for Self-Supervised Learning of Speech Representations, https://arxiv.org/abs/2006.11477
FastSpeechSplit, Unsupervised Speech Decomposition Via Triple Information Bottleneck using Transformer, no paper produced.
Sepformer, Attention is All You Need in Speech Separation, https://arxiv.org/abs/2010.13154
FastSpeechSplit, Faster and Accurate Speech Split Conversion using Transformer, no paper produced.
HuBERT, Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units, https://arxiv.org/pdf/2106.07447v1.pdf
FastPitch, Parallel Text-to-speech with Pitch Prediction, https://arxiv.org/abs/2006.06873
GlowTTS, A Generative Flow for Text-to-Speech via Monotonic Alignment Search, https://arxiv.org/abs/2005.11129
BEST-RQ, Self-supervised learning with random-projection quantizer for speech recognition, https://arxiv.org/pdf/2202.01855.pdf
LightSpeech, Lightweight and Fast Text to Speech with Neural Architecture Search, https://arxiv.org/abs/2102.04040
VITS, Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech, https://arxiv.org/abs/2106.06103
Squeezeformer, An Efficient Transformer for Automatic Speech Recognition, https://arxiv.org/abs/2206.00888
Whisper, Robust Speech Recognition via Large-Scale Weak Supervision, https://cdn.openai.com/papers/whisper.pdf
Emformer, Efficient Memory Transformer Based Acoustic Model For Low Latency Streaming Speech Recognition, https://arxiv.org/abs/2010.10759
References#
If you use our software for research, please cite:
@misc{Malaya, Speech-Toolkit library for bahasa Malaysia, powered by Deep Learning Tensorflow,
author = {Husein, Zolkepli},
title = {Malaya-Speech},
year = {2020},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huseinzol05/malaya-speech}}
}
Acknowledgement#
Thanks to KeyReply for private V100s cloud and Mesolitica for private RTXs cloud to train Malaya-Speech models,


Contents:#
Getting Started
GPU Environment
Pipeline Module
Language Model Module
ASR RNNT Module
ASR CTC Module
ASR Seq2Seq Module
Force Alignment Module
Vocoder Module
Conversion Module
TTS Module
- Text-to-Speech Tacotron2
- Text-to-Speech FastSpeech2
- More examples FastSpeech2
- FastSpeech2 long text
- Text-to-Speech End-to-End FastSpeech2
- Text-to-Speech FastPitch
- Text-to-Speech GlowTTS
- Text-to-Speech GlowTTS Multispeakers
- Text-to-Speech LightSpeech
- Text-to-Speech VITS
- Text-to-Speech VITS Multispeaker
- Text-to-Speech VITS Multispeaker Noisy
- VITS long text
- Text-to-Speech Singlish
- Tacotron2 description
- FastSpeech2 description
- GlowTTS description
- VITS description
- List available Tacotron2
- List available FastSpeech2
- List available GlowTTS
- List available VITS
- Load Tacotron2 model
- Load FastSpeech2 model
- Load GlowTTS model
- Load VITS model
- Predict Tacotron2
- Predict FastSpeech2
- Predict GlowTTS
- Predict VITS
- Load Vocoder model
- Predict Bahasa text
- Text-to-Speech web inference using Gradio
Classification Module
Enhancement Module
Voice Activity Module
Speaker Vector Module
Speaker Diarization Module
- Speaker Change Detection
- Speaker Similarity
- Unsupervised clustering
- Unsupervised clustering using Agglomerative
- Unsupervised clustering using HMM
- Unsupervised streaming clustering
- Diarization with Speaker Change
- Diarization timestamp
- Speaker Diarization using Features
- Combine longer speaker diarization
PyAudio streaming Module
TorchAudio streaming Module
Multispeaker Module
Extra Module
Misc
- API
- malaya_speech
- malaya_speech.augmentation.spectrogram
- malaya_speech.extra.rttm
- malaya_speech.extra.visualization
- malaya_speech.model.classification.Speakernet
- malaya_speech.model.classification.Speaker2Vec
- malaya_speech.model.classification.SpeakernetClassification
- malaya_speech.model.classification.Classification
- malaya_speech.model.clustering.AgglomerativeClustering
- malaya_speech.model.clustering.HiddenMarkovModelClustering
- malaya_speech.model.clustering.StreamingKMeansMaxCluster
- malaya_speech.model.clustering.StreamingKMeans
- malaya_speech.model.clustering.StreamingSpeakerSimilarity
- malaya_speech.model.splitter.Split_Wav
- malaya_speech.model.splitter.Split_Mel
- malaya_speech.model.splitter.FastSpeechSplit
- malaya_speech.model.synthesis.TTS
- malaya_speech.model.synthesis.Vocoder
- malaya_speech.model.synthesis.Tacotron
- malaya_speech.model.synthesis.Fastspeech
- malaya_speech.model.synthesis.FastspeechSDP
- malaya_speech.model.synthesis.E2E_FastSpeech
- malaya_speech.model.synthesis.FastVC
- malaya_speech.model.synthesis.Fastpitch
- malaya_speech.model.transducer.Transducer
- malaya_speech.model.transducer.TransducerAligner
- malaya_speech.model.unet.UNET
- malaya_speech.model.unet.UNETSTFT
- malaya_speech.model.unet.UNET1D
- malaya_speech.model.wav2vec.Wav2Vec2_CTC
- malaya_speech.model.wav2vec.Wav2Vec2_Aligner
- malaya_speech.model.webrtc.WebRTC
- malaya_speech.torch_model.huggingface.CTC
- malaya_speech.torch_model.huggingface.Aligner
- malaya_speech.torch_model.huggingface.Seq2Seq
- malaya_speech.torch_model.huggingface.Seq2SeqAligner
- malaya_speech.torch_model.huggingface.XVector
- malaya_speech.torch_model.nemo.SpeakerVector
- malaya_speech.torch_model.nemo.Classification
- malaya_speech.torch_model.super_resolution.VoiceFixer
- malaya_speech.torch_model.super_resolution.NVSR
- malaya_speech.torch_model.synthesis.VITS
- malaya_speech.torch_model.torchaudio.Conformer
- malaya_speech.torch_model.torchaudio.ForceAlignment
- malaya_speech.pipeline
- malaya_speech.pipeline.map
- malaya_speech.pipeline.batching
- malaya_speech.pipeline.partition
- malaya_speech.pipeline.sliding_window
- malaya_speech.pipeline.foreach_map
- malaya_speech.pipeline.flatten
- malaya_speech.pipeline.zip
- malaya_speech.streaming.pyaudio
- malaya_speech.streaming.torchaudio
- malaya_speech.utils.aligner
- malaya_speech.utils.astype
- malaya_speech.utils.char
- malaya_speech.utils.combine
- malaya_speech.utils.featurization
- malaya_speech.utils.generator
- malaya_speech.utils.griffin_lim
- malaya_speech.utils.group
- malaya_speech.utils.io
- malaya_speech.utils.padding
- malaya_speech.utils.read
- malaya_speech.utils.split
- malaya_speech.utils.subword
- malaya_speech.utils.tf_featurization
- malaya_speech.utils.torch_featurization
- malaya_speech.age_detection
- malaya_speech.diarization
- malaya_speech.emotion
- malaya_speech.force_alignment.ctc
- malaya_speech.force_alignment.seq2seq
- malaya_speech.force_alignment.transducer
- malaya_speech.gender
- malaya_speech.is_clean
- malaya_speech.language_detection
- malaya_speech.language_model
- malaya_speech.multispeaker_separation
- malaya_speech.noise_reduction
- malaya_speech.speaker_change
- malaya_speech.speaker_count
- malaya_speech.speaker_overlap
- malaya_speech.speaker_vector
- malaya_speech.speech_enhancement
- malaya_speech.speechsplit_conversion
- malaya_speech.stack
- malaya_speech.model.stack.Stack
- malaya_speech.stt.ctc
- malaya_speech.stt.seq2seq
- malaya_speech.stt.transducer
- malaya_speech.super_resolution
- malaya_speech.tts
- malaya_speech.vad
- malaya_speech.vocoder
- malaya_speech.voice_conversion
- Donation