Speech Dataset¶
How we gather dataset?¶
For semisupervised transcript, we use Google Speech to Text, after that verified / corrected by human.
We recorded using our own microphones.
License¶
Malay-Speech dataset is available to download for research purposes under a Creative Commons Attribution 4.0 International License.

This work is licensed under a Creative Commons Attribution 4.0 International License.
Dataset¶
Ambient¶
Simple ambients gathered from Youtube.
Audiobook¶
Gather Youtube urls for indonesian, english and low quality english audiobooks only.
Azure-TTS¶
Semisupervised Malay TTS dataset from Azure TTS cloud.
Emotion¶
Speech emotion dataset used by Malaya-Speech for speech emotion detection.
IIUM¶
Read random sentences from IIUM Confession.
voice by Husein Zolkepli and Shafiqah Idayu.
Heavily speaking in Selangor dialect.
Recorded using low-end tech microphone.
44100 sample rate, split by end of sentences.
approximate 2.4 hours.
Still on going recording.
@misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Speech Dataset from IIUM Confession texts,
author = {Husein, Zolkepli},
title = {Malay-Dataset},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huseinzol05/malaya-speech/tree/master/data/iium}}
}
IIUM-Clear¶
Read random sentences from IIUM Confession, cleaner version.
voice by Husein Zolkepli.
Heavily speaking in Selangor dialect.
Recorded using mid-end tech microphone.
44100 sample rate, random 7 - 11 words window.
approximate 0.1 hours.
@misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Speech Dataset from IIUM Confession texts,
author = {Husein, Zolkepli},
title = {Malay-Dataset},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huseinzol05/malaya-speech/tree/master/data/iium}}
}
IMDA¶
Mirror link for IMDA dataset, https://www.imda.gov.sg/programme-listing/digital-services-lab/national-speech-corpus, only downloaded PART 3 and SST dataset.
16000 sample rate.
supervised approximate 2024 hours.
language¶
Gather youtube urls for hyperlocal language detection from speech {malay, indonesian, manglish, english, mandarin}.
Check hyperlocal language detection models at https://malaya-speech.readthedocs.io/en/latest/load-language-detection.html
@misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Hyperlocal languages for speech dataset,
author = {Husein, Zolkepli},
title = {Malay-Dataset},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huseinzol05/malaya-speech/tree/master/data/language}}
}
mixed-stt¶
Malay, Singlish and Mandarin STT dataset in TFRecord format. Included scripts how to load using torch.dataset.
news¶
Read random sentences from bahasa news.
voice by Husein Zolkepli.
Heavily speaking in Selangor dialect.
Recorded using mid-end tech microphone, suitable for text to speech.
44100 sample rate, random 7 - 11 words window.
approximate 3.01 hours.
Still on going recording.
@misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Speech Dataset from local news texts,
author = {Husein, Zolkepli},
title = {Malay-Dataset},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huseinzol05/malaya-speech/tree/master/data/news}}
}
noise¶
Simple noises gathered from Youtube.
Sebut perkataan¶
Read random words from malay dictionary started with ‘tolong sebut
sebut-perkataan-manvoice by Husein Zolkeplitolong-sebutvoice by Khalil Noohsebut-perkataan-womanvoice by Mas Aisyah AhmadRecorded using low-end tech microphones.
@misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Short Speech Dataset,
author = {Husein, Zolkepli},
title = {Malay-Dataset},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huseinzol05/malaya-speech/tree/master/data/sebut-perkataan}}
}
Semisupervised audiobook¶
Semisupervised malay audiobooks from Nusantara Audiobook using Google Speech to Text.
44100 sample rate, super clean.
semisupervised approximate 45.29 hours.
windowed using Malaya-Speech VAD, each atleast 5 negative voice activities.
@misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Semisupervised Speech Recognition from Audiobook,
author = {Husein, Zolkepli},
title = {Malay-Dataset},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huseinzol05/malaya-speech/tree/master/data/semisupervised-audiobook}}
}
Semisupervised malay¶
Semisupervised malay youtube videos using Google Speech to Text, after that corrected by human.
16000 sample rate.
semisupervised approximate 1804 hours.
random length between 2 - 20 seconds, windowed using google VAD.
supervised 768 samples, approximate 1.3 hours.
@misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Semisupervised Speech Recognition from Malay Youtube Videos,
author = {Husein, Zolkepli},
title = {Malay-Dataset},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huseinzol05/malaya-speech/tree/master/data/semisupervised-malay}}
}
Semisupervised manglish¶
Semisupervised manglish youtube videos using Google Speech to Text.
16000 sample rate.
semisupervised approximate 107 hours.
random length between 2 - 20 seconds, windowed using google VAD.
@misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Semisupervised Speech Recognition from Manglish Youtube Videos,
author = {Husein, Zolkepli},
title = {Malay-Dataset},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huseinzol05/malaya-speech/tree/master/data/semisupervised-manglish}}
}
wattpad¶
Read random sentences from bahasa wattpad.
voice by Husein Zolkepli.
Heavily speaking in Selangor dialect.
Recorded using mid-end tech microphone, suitable for text to speech.
44100 sample rate, random 7 - 11 words window.
approximate 0.15 hours.
Still on going recording.
@misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Speech Dataset from Wattpad texts,
author = {Husein, Zolkepli},
title = {Malay-Dataset},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huseinzol05/malaya-speech/tree/master/data/wattpad}}
}
Wikipedia¶
Read random sentences from Bahasa Wikipedia.
voice by Husein Zolkepli.
Heavily speaking in Selangor dialect.
Recorded using low-end tech microphone.
44100 sample rate, 4 words window.
approximate 3.4 hours.
Still on going recording.
@misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Speech Dataset from Wikipedia texts,
author = {Husein, Zolkepli},
title = {Malay-Dataset},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huseinzol05/malaya-speech/tree/master/data/wikipedia}}
}
youtube¶
Gathered Jeorogan, malay, malaysian, the thirsty sisters, richroll podcasts.
Contribution¶
Contact us at husein.zol05@gmail.com or husein@mesolitica.com if want to contribute to speech bahasa dataset.