Prepare Sebut Perkataan Dataset#

The dataset is very simple,

if file name is sebut-perkataan/ayam.wav, so text is sebut perkataan ayam.

This tutorial is available as an IPython notebook at malaya-speech/example/prepare-sst-data.

Download dataset#

Simply uncomment code below to download the dataset

[8]:
# https://github.com/huseinzol05/Malay-Dataset/tree/master/speech/sebut-perkataan
# !wget https://f000.backblazeb2.com/file/malay-dataset/speech-bahasa.zip
# !mkdir sebut-perkataan
# !unzip speech-bahasa.zip -d sebut-perkataan
[2]:
import malaya_speech.train as train
import malaya_speech

Character encoding#

We simply use ASCII table to encode string into integer representation, just pass string into malaya_speech.char.encode.

[3]:
encoded = malaya_speech.char.encode('hello ketiak saya masham')
encoded
[3]:
[106,
 103,
 110,
 110,
 113,
 34,
 109,
 103,
 118,
 107,
 99,
 109,
 34,
 117,
 99,
 123,
 99,
 34,
 111,
 99,
 117,
 106,
 99,
 111,
 1]
[4]:
malaya_speech.char.decode(encoded)
[4]:
'hello ketiak saya masham<EOS>'

Building the dataset#

[5]:
from glob import glob
import os

files = glob('sebut-perkataan/*/*.wav', recursive = True)
len(files)
[5]:
1463
[6]:
def get_text(file):
    file = file.replace('.wav', '')
    splitted = file.split('/')[1:]
    splitted[0] = splitted[0].replace('-woman', '').replace('-man', '').replace('-', ' ')
    return ' '.join(splitted).lower().strip()

get_text(files[0])
[6]:
'sebut perkataan amko'
[7]:
audios, texts = [], []

for file in files:
    text = get_text(file)
    audios.append(file)
    texts.append(text)

len(audios), len(texts)
[7]:
(1463, 1463)

Change into TFRecord#

This is not necessary step, we recommend to use yield iterator to train the model, but we also can save our data into TFRecord to speed up data pipelines. To do that, we need to create a yield iterator.

[34]:
from tqdm import tqdm

def generator():
    for i in tqdm(range(len(audios))):
        wav_data, sr = malaya_speech.load(audios[i])

        yield {
            'waveforms': wav_data.tolist(),
            'waveform_lens': [len(wav_data)],
            'targets': malaya_speech.char.encode(texts[i]),
            'raw_transcript': [texts[i]],
        }

generator = generator()
[35]:
import os
import tensorflow as tf

os.system('rm tolong-sebut/data/*')
DATA_DIR = os.path.expanduser('tolong-sebut/data')
tf.gfile.MakeDirs(DATA_DIR)

Define shards#

Like we defined below,

shards = [{'split': 'train', 'shards': 99}, {'split': 'dev', 'shards': 1}]

If we have 100 samples, 99% of it will use for train, 1% of it will use for dev.

[36]:
shards = [{'split': 'train', 'shards': 99}, {'split': 'dev', 'shards': 1}]

Save to TFRecord#

Just pass yield iterator to malaya_speech.train_prepare_dataset,

def prepare_dataset(
    generator,
    data_dir: str,
    shards: List[Dict],
    prefix: str = 'dataset',
    shuffle: bool = True,
    already_shuffled: bool = False,
):
[37]:
train.prepare_dataset(generator, DATA_DIR, shards, prefix = 'tolong-sebut')
WARNING:tensorflow:From /home/husein/malaya-speech/malaya_speech/train/prepare_data.py:89: The name tf.gfile.Exists is deprecated. Please use tf.io.gfile.exists instead.

WARNING:tensorflow:From /home/husein/malaya-speech/malaya_speech/train/prepare_data.py:199: The name tf.python_io.TFRecordWriter is deprecated. Please use tf.io.TFRecordWriter instead.

  0%|          | 0/1463 [00:00<?, ?it/s]
WARNING:tensorflow:From /home/husein/malaya-speech/malaya_speech/train/prepare_data.py:205: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.

INFO:tensorflow:Generating case 0.
100%|██████████| 1463/1463 [00:24<00:00, 60.74it/s]
WARNING:tensorflow:From /home/husein/malaya-speech/malaya_speech/train/prepare_data.py:218: The name tf.gfile.Rename is deprecated. Please use tf.io.gfile.rename instead.

INFO:tensorflow:Generated 1463 Examples
INFO:tensorflow:Shuffling data...
WARNING:tensorflow:From /home/husein/malaya-speech/malaya_speech/train/prepare_data.py:26: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and:
`tf.data.TFRecordDataset(path)`
WARNING:tensorflow:From /home/husein/malaya-speech/malaya_speech/train/prepare_data.py:57: The name tf.gfile.Remove is deprecated. Please use tf.io.gfile.remove instead.


INFO:tensorflow:Data shuffled.