.. role:: raw-html-m2r(raw)
:format: html
Speech Dataset
==============
How we gather dataset?
----------------------
#. For semisupervised transcript, we use Google Speech to Text, after that verified / corrected by human.
#. We recorded using our own microphones.
License
-------
Malay-Speech dataset is available to download for research purposes under a Creative Commons Attribution 4.0 International License.
:raw-html-m2r:``\ :raw-html-m2r:`
`\ This work is licensed under a :raw-html-m2r:`Creative Commons Attribution 4.0 International License`.
Dataset
-------
`Ambient `_
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Simple ambients gathered from Youtube.
`Audiobook `_
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Gather Youtube urls for indonesian, english and low quality english audiobooks only.
`Azure-TTS `_
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Semisupervised Malay TTS dataset from Azure TTS cloud.
`GCP-TTS `_
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Semisupervised Malay TTS dataset from GCP TTS cloud.
`Emotion `_
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Speech emotion dataset used by Malaya-Speech for speech emotion detection.
`IIUM `_
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Read random sentences from IIUM Confession.
* voice by `Husein Zolkepli `_ and `Shafiqah Idayu `_.
* Heavily speaking in Selangor dialect.
* Recorded using low-end tech microphone.
* 44100 sample rate, split by end of sentences.
* approximate 2.4 hours.
* Still on going recording.
.. code-block:: bibtex
@misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Speech Dataset from IIUM Confession texts,
author = {Husein, Zolkepli},
title = {Malay-Dataset},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huseinzol05/malaya-speech/tree/master/data/iium}}
}
`IIUM-Clear `_
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Read random sentences from IIUM Confession, cleaner version.
* voice by `Husein Zolkepli `_.
* Heavily speaking in Selangor dialect.
* Recorded using mid-end tech microphone.
* 44100 sample rate, random 7 - 11 words window.
* approximate 0.1 hours.
.. code-block:: bibtex
@misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Speech Dataset from IIUM Confession texts,
author = {Husein, Zolkepli},
title = {Malay-Dataset},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huseinzol05/malaya-speech/tree/master/data/iium}}
}
`IMDA `_
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Mirror link for IMDA dataset, https://www.imda.gov.sg/programme-listing/digital-services-lab/national-speech-corpus, only downloaded PART 3 and SST dataset.
* 16000 sample rate.
* supervised approximate 2024 hours.
`language `_
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Gather youtube urls for hyperlocal language detection from speech {malay, indonesian, manglish, english, mandarin}.
Check hyperlocal language detection models at https://malaya-speech.readthedocs.io/en/latest/load-language-detection.html
.. code-block:: bibtex
@misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Hyperlocal languages for speech dataset,
author = {Husein, Zolkepli},
title = {Malay-Dataset},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huseinzol05/malaya-speech/tree/master/data/language}}
}
`mixed-stt `_
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Malay, Singlish and Mandarin STT dataset in TFRecord format. Included scripts how to load using ``torch.dataset``.
`news `_
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Read random sentences from bahasa news.
* voice by `Husein Zolkepli `_.
* Heavily speaking in Selangor dialect.
* Recorded using mid-end tech microphone, suitable for text to speech.
* 44100 sample rate, random 7 - 11 words window.
* approximate 3.01 hours.
* Still on going recording.
.. code-block:: bibtex
@misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Speech Dataset from local news texts,
author = {Husein, Zolkepli},
title = {Malay-Dataset},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huseinzol05/malaya-speech/tree/master/data/news}}
}
`noise `_
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Simple noises gathered from Youtube.
`Sebut perkataan `_
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Read random words from malay dictionary started with 'tolong sebut :raw-html-m2r:``\ '.
* ``sebut-perkataan-man`` voice by `Husein Zolkepli `_
* ``tolong-sebut`` voice by `Khalil Nooh `_
* ``sebut-perkataan-woman`` voice by `Mas Aisyah Ahmad `_
* Recorded using low-end tech microphones.
.. code-block:: bibtex
@misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Short Speech Dataset,
author = {Husein, Zolkepli},
title = {Malay-Dataset},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huseinzol05/malaya-speech/tree/master/data/sebut-perkataan}}
}
`Semisupervised audiobook `_
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Semisupervised malay audiobooks from Nusantara Audiobook using Google Speech to Text.
* 44100 sample rate, super clean.
* semisupervised approximate 45.29 hours.
* windowed using Malaya-Speech VAD, each atleast 5 negative voice activities.
.. code-block:: bibtex
@misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Semisupervised Speech Recognition from Audiobook,
author = {Husein, Zolkepli},
title = {Malay-Dataset},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huseinzol05/malaya-speech/tree/master/data/semisupervised-audiobook}}
}
`Semisupervised malay `_
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Semisupervised malay youtube videos using Google Speech to Text, after that corrected by human.
* 16000 sample rate.
* semisupervised approximate 1804 hours.
* random length between 2 - 20 seconds, windowed using google VAD.
* supervised 768 samples, approximate 1.3 hours.
.. code-block:: bibtex
@misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Semisupervised Speech Recognition from Malay Youtube Videos,
author = {Husein, Zolkepli},
title = {Malay-Dataset},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huseinzol05/malaya-speech/tree/master/data/semisupervised-malay}}
}
`Semisupervised manglish `_
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Semisupervised manglish youtube videos using Google Speech to Text.
* 16000 sample rate.
* semisupervised approximate 107 hours.
* random length between 2 - 20 seconds, windowed using google VAD.
.. code-block:: bibtex
@misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Semisupervised Speech Recognition from Manglish Youtube Videos,
author = {Husein, Zolkepli},
title = {Malay-Dataset},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huseinzol05/malaya-speech/tree/master/data/semisupervised-manglish}}
}
`wattpad `_
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Read random sentences from bahasa wattpad.
* voice by `Husein Zolkepli `_.
* Heavily speaking in Selangor dialect.
* Recorded using mid-end tech microphone, suitable for text to speech.
* 44100 sample rate, random 7 - 11 words window.
* approximate 0.15 hours.
* Still on going recording.
.. code-block:: bibtex
@misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Speech Dataset from Wattpad texts,
author = {Husein, Zolkepli},
title = {Malay-Dataset},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huseinzol05/malaya-speech/tree/master/data/wattpad}}
}
`Wikipedia `_
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Read random sentences from Bahasa Wikipedia.
* voice by `Husein Zolkepli `_.
* Heavily speaking in Selangor dialect.
* Recorded using low-end tech microphone.
* 44100 sample rate, 4 words window.
* approximate 3.4 hours.
* Still on going recording.
.. code-block:: bibtex
@misc{Malay-Dataset, We gather Bahasa Malaysia corpus!, Speech Dataset from Wikipedia texts,
author = {Husein, Zolkepli},
title = {Malay-Dataset},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huseinzol05/malaya-speech/tree/master/data/wikipedia}}
}
`youtube `_
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Semisupervised transcription and Unsupervised Speaker Diarization on 5k malay speakers youtube videos.
Contribution
------------
Contact us at husein.zol05@gmail.com or husein@mesolitica.com if want to contribute to speech bahasa dataset.