.. role:: raw-html-m2r(raw) :format: html Speech Dataset ============== How we gather dataset? ---------------------- #. For semisupervised transcript, we use Google Speech to Text, after that verified / corrected by human. #. We recorded using our own microphones. License ------- Malay-Speech dataset is available to download for research purposes under a Creative Commons Attribution 4.0 International License. :raw-html-m2r:`Creative Commons License`\ :raw-html-m2r:`
`\ This work is licensed under a :raw-html-m2r:`Creative Commons Attribution 4.0 International License`. Dataset ------- `Ambient `_ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Simple ambients gathered from Youtube. `Audiobook `_ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Gather Youtube urls for indonesian, english and low quality english audiobooks only. `Azure-TTS `_ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Semisupervised Malay TTS dataset from Azure TTS cloud. `GCP-TTS `_ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Semisupervised Malay TTS dataset from GCP TTS cloud. `Emotion `_ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Speech emotion dataset used by Malaya-Speech for speech emotion detection. `IIUM `_ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Read random sentences from IIUM Confession. * voice by `Husein Zolkepli `_ and `Shafiqah Idayu `_. * Heavily speaking in Selangor dialect. * Recorded using low-end tech microphone. * 44100 sample rate, split by end of sentences. * approximate 2.4 hours. * Still on going recording. .. code-block:: bibtex @misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Speech Dataset from IIUM Confession texts, author = {Husein, Zolkepli}, title = {Malay-Dataset}, year = {2018}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/huseinzol05/malaya-speech/tree/master/data/iium}} } `IIUM-Clear `_ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Read random sentences from IIUM Confession, cleaner version. * voice by `Husein Zolkepli `_. * Heavily speaking in Selangor dialect. * Recorded using mid-end tech microphone. * 44100 sample rate, random 7 - 11 words window. * approximate 0.1 hours. .. code-block:: bibtex @misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Speech Dataset from IIUM Confession texts, author = {Husein, Zolkepli}, title = {Malay-Dataset}, year = {2018}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/huseinzol05/malaya-speech/tree/master/data/iium}} } `IMDA `_ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Mirror link for IMDA dataset, https://www.imda.gov.sg/programme-listing/digital-services-lab/national-speech-corpus, only downloaded PART 3 and SST dataset. * 16000 sample rate. * supervised approximate 2024 hours. `language `_ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Gather youtube urls for hyperlocal language detection from speech {malay, indonesian, manglish, english, mandarin}. Check hyperlocal language detection models at https://malaya-speech.readthedocs.io/en/latest/load-language-detection.html .. code-block:: bibtex @misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Hyperlocal languages for speech dataset, author = {Husein, Zolkepli}, title = {Malay-Dataset}, year = {2018}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/huseinzol05/malaya-speech/tree/master/data/language}} } `mixed-stt `_ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Malay, Singlish and Mandarin STT dataset in TFRecord format. Included scripts how to load using ``torch.dataset``. `news `_ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Read random sentences from bahasa news. * voice by `Husein Zolkepli `_. * Heavily speaking in Selangor dialect. * Recorded using mid-end tech microphone, suitable for text to speech. * 44100 sample rate, random 7 - 11 words window. * approximate 3.01 hours. * Still on going recording. .. code-block:: bibtex @misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Speech Dataset from local news texts, author = {Husein, Zolkepli}, title = {Malay-Dataset}, year = {2018}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/huseinzol05/malaya-speech/tree/master/data/news}} } `noise `_ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Simple noises gathered from Youtube. `Sebut perkataan `_ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Read random words from malay dictionary started with 'tolong sebut :raw-html-m2r:``\ '. * ``sebut-perkataan-man`` voice by `Husein Zolkepli `_ * ``tolong-sebut`` voice by `Khalil Nooh `_ * ``sebut-perkataan-woman`` voice by `Mas Aisyah Ahmad `_ * Recorded using low-end tech microphones. .. code-block:: bibtex @misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Short Speech Dataset, author = {Husein, Zolkepli}, title = {Malay-Dataset}, year = {2018}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/huseinzol05/malaya-speech/tree/master/data/sebut-perkataan}} } `Semisupervised audiobook `_ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Semisupervised malay audiobooks from Nusantara Audiobook using Google Speech to Text. * 44100 sample rate, super clean. * semisupervised approximate 45.29 hours. * windowed using Malaya-Speech VAD, each atleast 5 negative voice activities. .. code-block:: bibtex @misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Semisupervised Speech Recognition from Audiobook, author = {Husein, Zolkepli}, title = {Malay-Dataset}, year = {2018}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/huseinzol05/malaya-speech/tree/master/data/semisupervised-audiobook}} } `Semisupervised malay `_ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Semisupervised malay youtube videos using Google Speech to Text, after that corrected by human. * 16000 sample rate. * semisupervised approximate 1804 hours. * random length between 2 - 20 seconds, windowed using google VAD. * supervised 768 samples, approximate 1.3 hours. .. code-block:: bibtex @misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Semisupervised Speech Recognition from Malay Youtube Videos, author = {Husein, Zolkepli}, title = {Malay-Dataset}, year = {2018}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/huseinzol05/malaya-speech/tree/master/data/semisupervised-malay}} } `Semisupervised manglish `_ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Semisupervised manglish youtube videos using Google Speech to Text. * 16000 sample rate. * semisupervised approximate 107 hours. * random length between 2 - 20 seconds, windowed using google VAD. .. code-block:: bibtex @misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Semisupervised Speech Recognition from Manglish Youtube Videos, author = {Husein, Zolkepli}, title = {Malay-Dataset}, year = {2018}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/huseinzol05/malaya-speech/tree/master/data/semisupervised-manglish}} } `wattpad `_ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Read random sentences from bahasa wattpad. * voice by `Husein Zolkepli `_. * Heavily speaking in Selangor dialect. * Recorded using mid-end tech microphone, suitable for text to speech. * 44100 sample rate, random 7 - 11 words window. * approximate 0.15 hours. * Still on going recording. .. code-block:: bibtex @misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Speech Dataset from Wattpad texts, author = {Husein, Zolkepli}, title = {Malay-Dataset}, year = {2018}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/huseinzol05/malaya-speech/tree/master/data/wattpad}} } `Wikipedia `_ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Read random sentences from Bahasa Wikipedia. * voice by `Husein Zolkepli `_. * Heavily speaking in Selangor dialect. * Recorded using low-end tech microphone. * 44100 sample rate, 4 words window. * approximate 3.4 hours. * Still on going recording. .. code-block:: bibtex @misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Speech Dataset from Wikipedia texts, author = {Husein, Zolkepli}, title = {Malay-Dataset}, year = {2018}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/huseinzol05/malaya-speech/tree/master/data/wikipedia}} } `youtube `_ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Semisupervised transcription and Unsupervised Speaker Diarization on 5k malay speakers youtube videos. Contribution ------------ Contact us at husein.zol05@gmail.com or husein@mesolitica.com if want to contribute to speech bahasa dataset.