{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Prepare Sebut Perkataan Dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The dataset is very simple,\n", "\n", "if file name is `sebut-perkataan/ayam.wav`, so text is `sebut perkataan ayam`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [malaya-speech/example/prepare-sst-data](https://github.com/huseinzol05/malaya-speech/tree/master/example/prepare-sst-data).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Download dataset\n", "\n", "Simply uncomment code below to download the dataset" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# https://github.com/huseinzol05/Malay-Dataset/tree/master/speech/sebut-perkataan\n", "# !wget https://f000.backblazeb2.com/file/malay-dataset/speech-bahasa.zip\n", "# !mkdir sebut-perkataan\n", "# !unzip speech-bahasa.zip -d sebut-perkataan" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import malaya_speech.train as train\n", "import malaya_speech" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Character encoding\n", "\n", "We simply use ASCII table to encode string into integer representation, just pass string into `malaya_speech.char.encode`." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[106,\n", " 103,\n", " 110,\n", " 110,\n", " 113,\n", " 34,\n", " 109,\n", " 103,\n", " 118,\n", " 107,\n", " 99,\n", " 109,\n", " 34,\n", " 117,\n", " 99,\n", " 123,\n", " 99,\n", " 34,\n", " 111,\n", " 99,\n", " 117,\n", " 106,\n", " 99,\n", " 111,\n", " 1]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "encoded = malaya_speech.char.encode('hello ketiak saya masham')\n", "encoded" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'hello ketiak saya masham'" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya_speech.char.decode(encoded)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Building the dataset" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1463" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from glob import glob\n", "import os\n", "\n", "files = glob('sebut-perkataan/*/*.wav', recursive = True)\n", "len(files)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'sebut perkataan amko'" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def get_text(file):\n", " file = file.replace('.wav', '')\n", " splitted = file.split('/')[1:]\n", " splitted[0] = splitted[0].replace('-woman', '').replace('-man', '').replace('-', ' ')\n", " return ' '.join(splitted).lower().strip()\n", "\n", "get_text(files[0])" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1463, 1463)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "audios, texts = [], []\n", "\n", "for file in files:\n", " text = get_text(file)\n", " audios.append(file)\n", " texts.append(text)\n", " \n", "len(audios), len(texts)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Change into TFRecord\n", "\n", "This is not necessary step, we recommend to use yield iterator to train the model, but we also can save our data into TFRecord to speed up data pipelines. To do that, we need to create a yield iterator." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "from tqdm import tqdm\n", "\n", "def generator():\n", " for i in tqdm(range(len(audios))):\n", " wav_data, sr = malaya_speech.load(audios[i])\n", "\n", " yield {\n", " 'waveforms': wav_data.tolist(),\n", " 'waveform_lens': [len(wav_data)],\n", " 'targets': malaya_speech.char.encode(texts[i]),\n", " 'raw_transcript': [texts[i]],\n", " }\n", " \n", "generator = generator()" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "import os\n", "import tensorflow as tf\n", "\n", "os.system('rm tolong-sebut/data/*')\n", "DATA_DIR = os.path.expanduser('tolong-sebut/data')\n", "tf.gfile.MakeDirs(DATA_DIR)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Define shards\n", "\n", "Like we defined below,\n", "\n", "```python\n", "shards = [{'split': 'train', 'shards': 99}, {'split': 'dev', 'shards': 1}]\n", "```\n", "\n", "If we have 100 samples, 99% of it will use for train, 1% of it will use for dev." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "shards = [{'split': 'train', 'shards': 99}, {'split': 'dev', 'shards': 1}]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Save to TFRecord\n", "\n", "Just pass yield iterator to `malaya_speech.train_prepare_dataset`,\n", "\n", "```python\n", "def prepare_dataset(\n", " generator,\n", " data_dir: str,\n", " shards: List[Dict],\n", " prefix: str = 'dataset',\n", " shuffle: bool = True,\n", " already_shuffled: bool = False,\n", "):\n", "```" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:From /home/husein/malaya-speech/malaya_speech/train/prepare_data.py:89: The name tf.gfile.Exists is deprecated. Please use tf.io.gfile.exists instead.\n", "\n", "WARNING:tensorflow:From /home/husein/malaya-speech/malaya_speech/train/prepare_data.py:199: The name tf.python_io.TFRecordWriter is deprecated. Please use tf.io.TFRecordWriter instead.\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " 0%| | 0/1463 [00:00