API¶

malaya_speech¶

malaya_speech.augmentation.spectrogram¶

malaya_speech.augmentation.spectrogram.mask_frequency(features, n_freq_mask: int = 2, width_freq_mask: int = 8, random_band=True)[source]¶

Mask frequency.

Parameters

features (np.array) –
n_freq_mask (int, optional (default=2)) – loop size for masking.
width_freq_mask (int, optional (default=8)) – masking size.

Returns

result

Return type

np.array

malaya_speech.augmentation.spectrogram.mask_time(features, n_time_mask=2, width_time_mask=8, random_band=True)[source]¶

Time frequency.

Parameters

features (np.array) –
n_time_mask (int, optional (default=2)) – loop size for masking.
width_time_mask (int, optional (default=8)) – masking size.

Returns

result

Return type

np.array

malaya_speech.augmentation.spectrogram.tf_mask_frequency(features, n_freq_mask=2, F=27)[source]¶

Mask frequency using Tensorflow.

Parameters

features (np.array) –
F (size of mask for frequency) –

malaya_speech.augmentation.spectrogram.tf_mask_time(features, n_time_mask=2, T=80)[source]¶

Mask time using Tensorflow.

Parameters

features (np.array) –
T (size of mask for time) –

malaya_speech.extra.rttm¶

malaya_speech.extra.rttm.load(file: str)[source]¶

Load RTTM file.

Parameters: file (str) –
Returns: result
Return type: Dict[str, malaya_speech.model.annotation.Annotation]

malaya_speech.extra.visualization¶

malaya_speech.extra.visualization.visualize_vad(signal, preds: List[Tuple[malaya_speech.model.frame.Frame, bool]], sample_rate: int = 16000, figsize: Tuple[int, int] = (15, 3), ax=None, **kwargs)[source]¶

Visualize signal given VAD labels. Green means got voice activity, while Red is not.

Parameters

signal (list / np.array) –
preds (List[Tuple[Frame, bool]]) –
sample_rate (int, optional (default=16000)) –
figsize (Tuple[int, int], optional (default=(15, 7))) – matplotlib figure size.

malaya_speech.extra.visualization.plot_classification(preds, description, ax=None, fontsize_text=14, x_text=0.05, y_text=0.2, ylim=(0.1, 0.9), figsize: Tuple[int, int] = (15, 3), **kwargs)[source]¶

Visualize probability / boolean.

Parameters

preds (List[Tuple[Frame, label]]) –
description (str) –
ax (ax, optional (default = None)) –
fontsize_text (int, optional (default = 14)) –
x_text (float, optional (default = 0.05)) –
y_text (float, optional (default = 0.2)) –

malaya_speech.model.classification.Speakernet¶

class malaya_speech.model.classification.Speakernet[source]¶

vectorize(inputs)[source]¶

Vectorize inputs.

Parameters: inputs (List[np.array]) – List[np.array] or List[malaya_speech.model.frame.Frame].
Returns: result – returned [B, D].
Return type: np.array

malaya_speech.model.classification.Speaker2Vec¶

class malaya_speech.model.classification.Speaker2Vec[source]¶

vectorize(inputs)[source]¶

Vectorize inputs.

Parameters: inputs (List[np.array]) – List[np.array] or List[malaya_speech.model.frame.Frame].
Returns: result – returned [B, D].
Return type: np.array

malaya_speech.model.classification.SpeakernetClassification¶

class malaya_speech.model.classification.SpeakernetClassification[source]¶

predict_proba(inputs)[source]¶

Predict inputs, will return probability.

Parameters: inputs (List[np.array]) – List[np.array] or List[malaya_speech.model.frame.Frame].
Returns: result – returned [B, D].
Return type: np.array

predict(inputs)[source]¶

Predict inputs, will return labels.

Parameters: inputs (List[np.array]) – List[np.array] or List[malaya_speech.model.frame.Frame].
Returns: result – returned [B].
Return type: List[str]

malaya_speech.model.classification.Classification¶

class malaya_speech.model.classification.Classification[source]¶

predict_proba(inputs)[source]¶

Predict inputs, will return probability.

Parameters: inputs (List[np.array]) – List[np.array] or List[malaya_speech.model.frame.Frame].
Returns: result – returned [B, D].
Return type: np.array

predict(inputs)[source]¶

Predict inputs, will return labels.

Parameters: inputs (List[np.array]) – List[np.array] or List[malaya_speech.model.frame.Frame].
Returns: result – returned [B].
Return type: List[str]

malaya_speech.model.huggingface.HuggingFace_CTC¶

class malaya_speech.model.huggingface.HuggingFace_CTC[source]¶

greedy_decoder(inputs)[source]¶

Transcribe inputs using greedy decoder.

Parameters: input (List[np.array]) – List[np.array] or List[malaya_speech.model.frame.Frame].
Returns: result
Return type: List[str]

predict(inputs)[source]¶

Predict logits from inputs using greedy decoder.

Parameters: input (List[np.array]) – List[np.array] or List[malaya_speech.model.frame.Frame].
Returns: result
Return type: List[str]

predict_logits(inputs, norm_func=<function softmax>)[source]¶

Predict logits from inputs.

Parameters

input (List[np.array]) – List[np.array] or List[malaya_speech.model.frame.Frame].
norm_func (Callable, optional (default=malaya.utils.activation.softmax)) –

Returns

result

Return type

List[np.array]

gradio(record_mode: bool = True, lm_func: Callable = None, **kwargs)[source]¶

Transcribe an input using beam decoder on Gradio interface.

Parameters

record_mode (bool, optional (default=True)) – if True, Gradio will use record mode, else, file upload mode.
lm_func (Callable, optional (default=None)) – if not None, will pass a logits with shape [T, D].
**kwargs (keyword arguments for iface.launch.) –

malaya_speech.model.huggingface.HuggingFace_Aligner¶

class malaya_speech.model.huggingface.HuggingFace_Aligner[source]¶

predict(input, transcription: str, sample_rate: int = 16000)[source]¶

Transcribe input, will return a string.

Parameters

input (np.array) – np.array or malaya_speech.model.frame.Frame.
transcription (str) – transcription of input audio.
sample_rate (int, optional (default=16000)) – sample rate for input.

Returns

result

Return type

Dict[chars_alignment, words_alignment, alignment]

malaya_speech.model.splitter.Split_Wav¶

class malaya_speech.model.splitter.Split_Wav[source]¶

predict(input)[source]¶

Split an audio into 4 different speakers.

Parameters: input (np.array or malaya_speech.model.frame.Frame) –
Returns: result
Return type: np.array

malaya_speech.model.splitter.Split_Mel¶

class malaya_speech.model.splitter.Split_Mel[source]¶

predict(input)[source]¶

Split an audio into 4 different speakers.

Parameters: input (np.array or malaya_speech.model.frame.Frame) –
Returns: result
Return type: np.array

malaya_speech.model.splitter.FastSpeechSplit¶

class malaya_speech.model.splitter.FastSpeechSplit[source]¶

predict(original_audio, target_audio, modes=['R', 'F', 'U', 'RF', 'RU', 'FU', 'RFU'])[source]¶

Change original voice audio to follow targeted voice.

Parameters

original_audio (np.array or malaya_speech.model.frame.Frame) –
target_audio (np.array or malaya_speech.model.frame.Frame) –
modes (List[str], optional (default = ['R', 'F', 'U', 'RF', 'RU', 'FU', 'RFU'])) –
R denotes rhythm, F denotes pitch target, U denotes speaker target (vector).
- 'R' - maintain original_audio F and U on target_audio R.
- 'F' - maintain original_audio R and U on target_audio F.
- 'U' - maintain original_audio R and F on target_audio U.
- 'RF' - maintain original_audio U on target_audio R and F.
- 'RU' - maintain original_audio F on target_audio R and U.
- 'FU' - maintain original_audio R on target_audio F and U.
- 'RFU' - no conversion happened, just do encoder-decoder on target_audio

Returns

result

Return type

Dict[modes]

malaya_speech.model.synthesis.TTS¶

class malaya_speech.model.synthesis.TTS[source]¶

gradio(vocoder: Callable, **kwargs)[source]¶

Text-to-Speech on Gradio interface.

Parameters

vocoder (bool, Callable) – vocoder object that has predict method, prefer from malaya_speech itself.
**kwargs (keyword arguments for predict and iface.launch.) –

malaya_speech.model.synthesis.Vocoder¶

class malaya_speech.model.synthesis.Vocoder[source]¶

predict(inputs)[source]¶

Change Mel to Waveform.

Parameters: inputs (List[np.array]) – List[np.array] or List[malaya_speech.model.frame.Frame].
Returns: result
Return type: List

malaya_speech.model.synthesis.Tacotron¶

class malaya_speech.model.synthesis.Tacotron[source]¶

predict(string, **kwargs)[source]¶

Change string to Mel.

Parameters: string (str) –
Returns: result
Return type: Dict[string, decoder-output, mel-output, universal-output, alignment]

malaya_speech.model.synthesis.Fastspeech¶

class malaya_speech.model.synthesis.Fastspeech[source]¶

predict(string, speed_ratio: float = 1.0, f0_ratio: float = 1.0, energy_ratio: float = 1.0, **kwargs)[source]¶

Change string to Mel.

Parameters

string (str) –
speed_ratio (float, optional (default=1.0)) – Increase this variable will increase time voice generated.
f0_ratio (float, optional (default=1.0)) – Increase this variable will increase frequency, low frequency will generate more deeper voice.
energy_ratio (float, optional (default=1.0)) – Increase this variable will increase loudness.

Returns

result

Return type

Dict[string, decoder-output, mel-output, universal-output]

malaya_speech.model.synthesis.FastVC¶

class malaya_speech.model.synthesis.FastVC[source]¶

predict(original_audio, target_audio)[source]¶

Change original voice audio to follow targeted voice.

Parameters

original_audio (np.array or malaya_speech.model.frame.Frame) –
target_audio (np.array or malaya_speech.model.frame.Frame) –

Returns

result

Return type

Dict[decoder-output, mel-output]

malaya_speech.model.synthesis.Fastpitch¶

class malaya_speech.model.synthesis.Fastpitch[source]¶

predict(string, speed_ratio: float = 1.0, pitch_ratio: float = 1.0, pitch_addition: float = 0.0, **kwargs)[source]¶

Change string to Mel.

Parameters

string (str) –
speed_ratio (float, optional (default=1.0)) – Increase this variable will increase time voice generated.
pitch_ratio (float, optional (default=1.0)) – pitch = pitch * pitch_ratio, amplify existing pitch contour.
pitch_addition (float, optional (default=0.0)) – pitch = pitch + pitch_addition, change pitch contour.

Returns

result

Return type

Dict[string, decoder-output, mel-output, pitch-output, universal-output]

malaya_speech.model.transducer.Transducer¶

class malaya_speech.model.transducer.Transducer[source]¶

predict_alignment(input, combined=True)[source]¶

Transcribe input and get timestamp, only support greedy decoder.

Parameters

input (np.array) – np.array or malaya_speech.model.frame.Frame.
combined (bool, optional (default=True)) – If True, will combined subwords to become a word.

Returns

result

Return type

List[Dict[text, start, end]]

greedy_decoder(inputs)[source]¶

Transcribe inputs using greedy decoder.

Parameters: inputs (List[np.array]) – List[np.array] or List[malaya_speech.model.frame.Frame].
Returns: result
Return type: List[str]

beam_decoder(inputs, beam_width: int = 5, temperature: float = 0.0, score_norm: bool = True)[source]¶

Transcribe inputs using beam decoder.

Parameters

inputs (List[np.array]) – List[np.array] or List[malaya_speech.model.frame.Frame].
beam_width (int, optional (default=5)) – beam size for beam decoder.
temperature (float, optional (default=0.0)) – apply temperature function for logits, can help for certain case, logits += -np.log(-np.log(uniform_noise_shape_logits)) * temperature
score_norm (bool, optional (default=True)) – descending sort beam based on score / length of decoded.

Returns

result

Return type

List[str]

beam_decoder_lm(inputs, language_model, beam_width: int = 5, token_min_logp: float = - 20.0, beam_prune_logp: float = - 5.0, temperature: float = 0.0, score_norm: bool = True)[source]¶

Transcribe inputs using beam decoder + KenLM.

Parameters

inputs (List[np.array]) – List[np.array] or List[malaya_speech.model.frame.Frame].
language_model (pyctcdecode.language_model.LanguageModel) – pyctcdecode language model, load from LanguageModel(kenlm_model, alpha = alpha, beta = beta).
beam_width (int, optional (default=5)) – beam size for beam decoder.
token_min_logp (float, optional (default=-20.0)) – minimum log probability to select a token.
beam_prune_logp (float, optional (default=-5.0)) – filter candidates >= max score lm + beam_prune_logp.
temperature (float, optional (default=0.0)) – apply temperature function for logits, can help for certain case, logits += -np.log(-np.log(uniform_noise_shape_logits)) * temperature
score_norm (bool, optional (default=True)) – descending sort beam based on score / length of decoded.

Returns

result

Return type

List[str]

predict(inputs)[source]¶

Transcribe inputs using greedy decoder, will return list of strings.

Parameters: inputs (List[np.array]) – List[np.array] or List[malaya_speech.model.frame.Frame].
Returns: result
Return type: List[str]

gradio(record_mode: bool = True, **kwargs)[source]¶

Transcribe an input using beam decoder on Gradio interface.

Parameters

record_mode (bool, optional (default=True)) – if True, Gradio will use record mode, else, file upload mode.
**kwargs (keyword arguments for beam decoder and iface.launch.) –

malaya_speech.model.transducer.TransducerAligner¶

class malaya_speech.model.transducer.TransducerAligner[source]¶

predict(input, transcription: str, sample_rate: int = 16000)[source]¶

Transcribe input, will return a string. :param input: np.array or malaya_speech.model.frame.Frame. :type input: np.array :param transcription: transcription of input audio :type transcription: str :param sample_rate: sample rate for input. :type sample_rate: int, optional (default=16000)

Returns: result
Return type: Dict[words_alignment, subwords_alignment, subwords, alignment]

malaya_speech.model.unet.UNET¶

class malaya_speech.model.unet.UNET[source]¶

predict(inputs)[source]¶

Enhance inputs, will return melspectrogram.

Parameters: inputs (List[np.array]) –
Returns: result
Return type: List

malaya_speech.model.unet.UNETSTFT¶

class malaya_speech.model.unet.UNETSTFT[source]¶

predict(input)[source]¶

Enhance inputs, will return waveform.

Parameters: input (np.array) – np.array or malaya_speech.model.frame.Frame.
Returns: result
Return type: Dict

malaya_speech.model.unet.UNET1D¶

class malaya_speech.model.unet.UNET1D[source]¶

predict(input)[source]¶

Enhance inputs, will return waveform.

Parameters: input (np.array) – np.array or malaya_speech.model.frame.Frame.
Returns: result
Return type: np.array

malaya_speech.model.wav2vec.Wav2Vec2_CTC¶

class malaya_speech.model.wav2vec.Wav2Vec2_CTC[source]¶

greedy_decoder(inputs)[source]¶

Transcribe inputs using greedy decoder.

Parameters: input (List[np.array]) – List[np.array] or List[malaya_speech.model.frame.Frame].
Returns: result
Return type: List[str]

beam_decoder(inputs, beam_width: int = 100, **kwargs)[source]¶

Transcribe inputs using beam decoder.

Parameters

input (List[np.array]) – List[np.array] or List[malaya_speech.model.frame.Frame].
beam_width (int, optional (default=100)) – beam size for beam decoder.

Returns

result

Return type

List[str]

predict(inputs)[source]¶

Predict logits from inputs using greedy decoder.

Parameters: input (List[np.array]) – List[np.array] or List[malaya_speech.model.frame.Frame].
Returns: result
Return type: List[str]

predict_logits(inputs, norm_func=<function softmax>)[source]¶

Predict logits from inputs.

Parameters

input (List[np.array]) – List[np.array] or List[malaya_speech.model.frame.Frame].
norm_func (Callable, optional (default=malaya.utils.activation.softmax)) –

Returns

result

Return type

List[np.array]

gradio(record_mode: bool = True, lm_func: Callable = None, **kwargs)[source]¶

Transcribe an input using beam decoder on Gradio interface.

Parameters

record_mode (bool, optional (default=True)) – if True, Gradio will use record mode, else, file upload mode.
lm_func (Callable, optional (default=None)) – if not None, will pass a logits with shape [T, D].
**kwargs (keyword arguments for beam decoder and iface.launch.) –

malaya_speech.model.wav2vec.Wav2Vec2_Aligner¶

class malaya_speech.model.wav2vec.Wav2Vec2_Aligner[source]¶

predict(input, transcription: str, sample_rate: int = 16000)[source]¶

Transcribe input, will return a string.

Parameters

input (np.array) – np.array or malaya_speech.model.frame.Frame.
transcription (str) – transcription of input audio.
sample_rate (int, optional (default=16000)) – sample rate for input.

Returns

result

Return type

Dict[chars_alignment, words_alignment, alignment]

malaya_speech.model.webrtc.WebRTC¶

class malaya_speech.model.webrtc.WebRTC[source]¶

malaya_speech.pipeline¶

class malaya_speech.pipeline.Pipeline[source]¶

visualize(filename='pipeline.png', **kwargs)[source]¶

Render the computation of this object’s task graph using graphviz.

Requires graphviz to be installed.

Parameters

filename (str, optional) – The name of the file to write to disk.
kwargs – Graph attributes to pass to graphviz like rankdir="LR"

batching = <function batching>¶

flatten = <function flatten>¶

foreach_map = <function foreach_map>¶

map = <function map>¶

partition = <function partition>¶

sliding_window = <function sliding_window>¶

zip = <function zip>¶

malaya_speech.pipeline.map¶

class malaya_speech.pipeline.map[source]¶

apply a function / method to the pipeline

Examples

>>> source = Pipeline()
>>> source.map(lambda x: x + 1).map(print)
>>> source.emit(1)
2

malaya_speech.pipeline.batching¶

class malaya_speech.pipeline.batching[source]¶

Batching stream into tuples

Examples

>>> source = Pipeline()
>>> source.batching(2).map(print)
>>> source.emit([1,2,3,4,5])
([1, 2], [3, 4], [5])

malaya_speech.pipeline.partition¶

class malaya_speech.pipeline.partition[source]¶

Partition stream into tuples of equal size

Examples

>>> source = Pipeline()
>>> source.partition(3).map(print)
>>> for i in range(10):
...     source.emit(i)
(0, 1, 2)
(3, 4, 5)
(6, 7, 8)

malaya_speech.pipeline.sliding_window¶

class malaya_speech.pipeline.sliding_window[source]¶

Produce overlapping tuples of size n

Parameters: return_partial (bool) – If True, yield tuples as soon as any events come in, each tuple being smaller or equal to the window size. If False, only start yielding tuples once a full window has accrued.

Examples

>>> source = Pipeline()
>>> source.sliding_window(3, return_partial=False).map(print)
>>> for i in range(8):
...     source.emit(i)
(0, 1, 2)
(1, 2, 3)
(2, 3, 4)
(3, 4, 5)
(4, 5, 6)
(5, 6, 7)

malaya_speech.pipeline.foreach_map¶

class malaya_speech.pipeline.foreach_map[source]¶

Apply a function to every element in a tuple in the stream.

Parameters

func (callable) –
method (str, optional (default='sync')) –
method to process each elements.
- 'sync' - loop one-by-one to process.
- 'async' - async process all elements at the same time.
- 'thread' - multithreading level to process all elements at the same time.
  Default is 1 worker. Override worker_size=n to increase.
- 'process' - multiprocessing level to process all elements at the same time.
  Default is 1 worker. Override worker_size=n to increase.
*args – The arguments to pass to the function.
**kwargs – Keyword arguments to pass to func.

Examples

>>> source = Pipeline()
>>> source.foreach_map(lambda x: 2*x).map(print)
>>> for i in range(3):
...     source.emit((i, i))
(0, 0)
(2, 2)
(4, 4)

malaya_speech.pipeline.flatten¶

class malaya_speech.pipeline.flatten[source]¶

Flatten streams of lists or iterables into a stream of elements

Examples

>>> source = Pipeline()
>>> source.flatten().map(print)
>>> source.emit([[1, 2, 3], [4, 5], [6, 7, 7]])
[1, 2, 3, 4, 5, 6, 7, 7]

malaya_speech.pipeline.zip¶

class malaya_speech.pipeline.zip[source]¶

Combine 2 branches into 1 branch.

Examples

>>> source = Pipeline()
>>> left = source.map(lambda x: x + 1, name = 'left')
>>> right = source.map(lambda x: x + 10, name = 'right')
>>> left.zip(right).map(sum).map(print)
>>> source.emit(2)
15

pack_literals(tup)[source]¶: Fill buffers for literals whenever we empty them

malaya_speech.streaming¶

malaya_speech.streaming.record(vad, asr_model=None, classification_model=None, device=None, input_rate: int = 16000, sample_rate: int = 16000, blocks_per_second: int = 50, padding_ms: int = 300, ratio: float = 0.75, min_length: float = 0.1, filename: str = None, spinner: bool = False)[source]¶

Record an audio using pyaudio library. This record interface required a VAD model.

Parameters

vad (object) – vad model / pipeline.
asr_model (object) – ASR model / pipeline, will transcribe each subsamples realtime.
classification_model (object) – classification pipeline, will classify each subsamples realtime.
device (None) – device parameter for pyaudio, check available devices from sounddevice.query_devices().
input_rate (int, optional (default = 16000)) – sample rate from input device, this will auto resampling.
sample_rate (int, optional (default = 16000)) – output sample rate.
blocks_per_second (int, optional (default = 50)) – size of frame returned from pyaudio, frame size = sample rate / (blocks_per_second / 2). 50 is good for WebRTC, 30 or less is good for Malaya Speech VAD.
padding_ms (int, optional (default = 300)) – size of queue to store frames, size = padding_ms // (1000 * blocks_per_second // sample_rate)
ratio (float, optional (default = 0.75)) – if 75% of the queue is positive, assumed it is a voice activity.
min_length (float, optional (default=0.1)) – minimum length (s) to accept a subsample.
filename (str, optional (default=None)) – if None, will auto generate name based on timestamp.
spinner (bool, optional (default=False)) – if True, will use spinner object from halo library.

Returns

result

Return type

[filename, samples]

malaya_speech.utils.aligner¶

class malaya_speech.utils.aligner.Point(token_index: int, time_index: int, score: float)[source]¶

class malaya_speech.utils.aligner.Segment(label: str, start: int, end: int, score: float)[source]¶

malaya_speech.utils.aligner.put_comma(alignment, min_threshold: float = 0.5)[source]¶

Put comma in alignment from force alignment model.

Parameters

alignment (List[Dict[text, start, end]]) –
min_threshold (float, optional (default=0.5)) – minimum threshold in term of seconds to assume a comma.

Returns

result

Return type

List[str]

malaya_speech.utils.aligner.plot_alignments(alignment, subs_alignment, words_alignment, waveform, separator: str = ' ', sample_rate: int = 16000, figsize: tuple = (16, 9), plot_score_char: bool = False, plot_score_word: bool = True)[source]¶

plot alignment.

Parameters

alignment (np.array) – usually alignment output.
subs_alignment (list) – usually chars_alignment or subwords_alignment output.
words_alignment (list) – usually words_alignment output.
waveform (np.array) – input audio.
separator (str, optional (default=' ')) – separator between words, only useful if subs_alignment is character based.
sample_rate (int, optional (default=16000)) –
figsize (tuple, optional (default=(16, 9))) – figure size for matplotlib figsize.
plot_score_char (bool, optional (default=False)) – plot score on top of character plots.
plot_score_word (bool, optional (default=True)) – plot score on top of word plots.

malaya_speech.utils.astype¶

malaya_speech.utils.astype.to_ndarray(array)[source]¶

Change list / tuple / bytes into np.array

Parameters: array (list / tuple / bytes) –
Returns: result
Return type: np.array

malaya_speech.utils.astype.to_byte(array)[source]¶

Change list / tuple / np.array into bytes

Parameters: array (list / tuple / np.array) –
Returns: result
Return type: bytes

malaya_speech.utils.astype.float_to_int(array, type=<class 'numpy.int16'>)[source]¶

Change np.array float32 / float64 into np.int16

Parameters

array (np.array) –
type (np.int16) –

Returns

result

Return type

np.array

malaya_speech.utils.astype.int_to_float(array, type=<class 'numpy.float32'>)[source]¶

Change np.array int16 into np.float32

Parameters

array (np.array) –
type (np.float32) –

Returns

result

Return type

np.array

malaya_speech.utils.char¶

malaya_speech.utils.char.strip_ids(ids, ids_to_strip)[source]¶: Strip ids_to_strip from the end ids.

malaya_speech.utils.char.generate_vocab(strings: List[str])[source]¶

Generate character vocab sorted based on frequency.

Parameters: strings (List[str]) –
Returns: result
Return type: List[str]

malaya_speech.utils.char.encode(string: str, add_eos: bool = True, add_blank: bool = False, lookup: List[str] = None)[source]¶

Encode string to integer representation based on ascii table or lookup variable.

Parameters

string (str) –
add_eos (bool, optional (default=True)) – add EOS token at the end of encoded.
add_blank (bool, optional (default=False)) – add BLANK token at the starting of encoded, this is for transducer / transformer based.
lookup (List[str], optional (default=None)) – list of unique strings.

Returns

result

Return type

List[int]

malaya_speech.utils.char.decode(ids, lookup: List[str] = None)[source]¶

Decode integer representation to string based on ascii table or lookup variable.

Parameters

ids (List[int]) –
lookup (List[str], optional (default=None)) – list of unique strings.

Returns

result

Return type

str

malaya_speech.utils.combine¶

malaya_speech.utils.combine.without_silent(frames, threshold_to_stop: float = 0.1, silent_trail: int = 500)[source]¶

Group multiple frames based on label and threshold to stop.

Parameters

frames (List[Tuple[Frame, label]]) – Output from VAD.
threshold_to_stop (float, optional (default = 0.1)) – If threshold_to_stop is 0.1, means that, length same label samples must at least 0.1 second.
silent_trail (int, optional (default = 500)) – if detected a silent, will append first N frames and last N frames.

Returns

result

Return type

np.array

malaya_speech.utils.featurization¶

malaya_speech.utils.featurization.normalize_signal(signal, gain=None)[source]¶: Normalize float32 signal to [-1, 1] range

malaya_speech.utils.generator¶

malaya_speech.utils.generator.frames(audio, frame_duration_ms: int = 30, sample_rate: int = 16000, append_ending_trail: bool = True)[source]¶

Generates audio frames from audio. Takes the desired frame duration in milliseconds, the audio, and the sample rate.

Parameters

audio (np.array) –
frame_duration_ms (int, optional (default=30)) –
sample_rate (int, optional (default=16000)) –
append_ending_trail (bool, optional (default=True)) – if True, will append last trail and this last trail might not same length as frame_duration_ms.

Returns

result

Return type

List[malaya_speech.model.frame.Frame]

malaya_speech.utils.generator.mel_sampling(audio, frame_duration_ms=1200, overlap_ms=200, sample_rate=16000)[source]¶

Generates audio frames from audio. This is for melspectrogram generative model. Takes the desired frame duration in milliseconds, the audio, and the sample rate.

Parameters

audio (np.array) –
frame_duration_ms (int, optional (default=1200)) –
overlap_ms (int, optional (default=200)) –
sample_rate (int, optional (default=16000)) –

Returns

result

Return type

List[np.array]

malaya_speech.utils.generator.combine_mel_sampling(samples, overlap_ms=200, sample_rate=16000, padded_ms=50)[source]¶

To combine results from mel_sampling, output from melspectrogram generative model.

Parameters

samples (List[np.array]) –
overlap_ms (int, optional (default=200)) –
sample_rate (int, optional (default=16000)) –

Returns

result

Return type

List[np.array]

malaya_speech.utils.griffin_lim¶

malaya_speech.utils.griffin_lim.from_mel(mel_, sr=16000, n_fft=2048, n_iter=32, win_length=1000, hop_length=100)[source]¶

Change melspectrogram into waveform using Librosa.

Parameters: spectrogram (np.array) –
Returns: result
Return type: np.array

malaya_speech.utils.griffin_lim.from_mel_vocoder(mel, sr=22050, n_fft=1024, n_mels=256, fmin=80, fmax=7600, n_iter=32, win_length=None, hop_length=256)[source]¶

Change melspectrogram into waveform using Librosa.

Parameters: spectrogram (np.array) –
Returns: result
Return type: np.array

malaya_speech.utils.group¶

malaya_speech.utils.group.combine_frames(frames: List[malaya_speech.model.frame.Frame])[source]¶

Combine multiple frames into one frame.

Parameters: frames (List[Frame]) –
Returns: result
Return type: Frame

malaya_speech.utils.group.group_frames(frames)[source]¶

Group multiple frames based on label.

Parameters: frames (List[Tuple[Frame, label]]) –
Returns: result
Return type: List[Tuple[Frame, label]]

malaya_speech.utils.group.group_frames_threshold(frames, threshold_to_stop: float = 0.3)[source]¶

Group multiple frames based on label and threshold to stop.

Parameters

frames (List[Tuple[Frame, label]]) –
threshold_to_stop (float, optional (default = 0.3)) – If threshold_to_stop is 0.3, means that, length same label samples must at least 0.3 second.

Returns

result

Return type

List[Tuple[Frame, label]]

malaya_speech.utils.padding¶

malaya_speech.utils.padding.sequence_1d(seq, maxlen=None, padding: str = 'post', pad_int=0, return_len=False)[source]¶

padding sequence of 1d to become 2d array.

Parameters

seq (List[List[int]]) –
maxlen (int, optional (default=None)) – If None, will calculate max length in the function.
padding (str, optional (default='post')) – If pre, will add 0 on the starting side, else add 0 on the end side.
pad_int – padding value.
int – padding value.
(default=0) (optional) – padding value.

Returns

result

Return type

np.array

malaya_speech.utils.padding.sequence_nd(seq, maxlen=None, padding: str = 'post', pad_val=0.0, dim: int = 1, return_len=False)[source]¶

padding sequence of nd to become (n+1)d array.

Parameters

seq (list of nd array) –
maxlen (int, optional (default=None)) – If None, will calculate max length in the function.
padding (str, optional (default='post')) – If pre, will add 0 on the starting side, else add 0 on the end side.
pad_val – padding value.
float – padding value.
(default=0.0) (optional) – padding value.
dim (int, optional (default=1)) –

Returns

result

Return type

np.array

malaya_speech.utils.padding.tf_sequence_nd(seq, maxlen=None, padding: str = 'post', pad_val=0.0, dim: int = 1, return_len=False)[source]¶

padding sequence of nd to become (n+1)d array.

Parameters

seq (list of nd array) –
maxlen (int, optional (default=None)) – If None, will calculate max length in the function.
padding (str, optional (default='post')) – If pre, will add 0 on the starting side, else add 0 on the end side.
pad_val – padding value.
float – padding value.
(default=0.0) (optional) – padding value.
dim (int, optional (default=1)) –

Returns

result

Return type

np.array

malaya_speech.utils.read¶

malaya_speech.utils.read.resample(data, old_samplerate, new_samplerate)[source]¶

Resample signal.

Parameters

data (np.array) –
old_samplerate (int) – old sample rate.
new_samplerate (int) – new sample rate.

Returns

result

Return type

data

malaya_speech.utils.read.load(file: str, sr=16000, scale: bool = True)[source]¶

Read sound file, any format supported by soundfile.read

Parameters

file (str) –
sr (int, (default=16000)) – new sample rate. If input sample rate is not same, will resample automatically.
scale (bool, (default=True)) – Scale to -1 and 1.

Returns

result

Return type

(y, sr)

malaya_speech.utils.split¶

malaya_speech.utils.split.split_vad(frames, n: int = 3, negative_threshold: float = 0.1)[source]¶

Split a sample into multiple samples based n size of negative VAD.

Parameters

frames (List[Tuple[Frame, label]]) –
n (int, optional (default=3)) – n size of negative VAD to assume in one subsample.
negative_threshold (float, optional (default = 0.1)) – If negative_threshold is 0.1, means that, length negative samples must at least 0.1 second.

Returns

result

Return type

List[Frame]

malaya_speech.utils.split.split_vad_duration(frames, max_duration: float = 5.0, negative_threshold: float = 0.1)[source]¶

Split a sample into multiple samples based maximum duration of voice activities.

Parameters

frames (List[Tuple[Frame, label]]) –
max_duration (float, optional (default = 5.0)) – Maximum duration to assume one sample combined from voice activities.
negative_threshold (float, optional (default = 0.1)) – If negative_threshold is 0.1, means that, length negative samples must at least 0.1 second.

Returns

result

Return type

List[Frame]

malaya_speech.utils.subword¶

malaya_speech.utils.subword.generate_tokenizer(strings: List[str], target_vocab_size: int = 1024, max_subword_length: int = 4, max_corpus_chars=None, reserved_tokens=None)[source]¶: Build a subword dictionary.

malaya_speech.utils.subword.save(tokenizer, path: str)[source]¶: Save subword dictionary to a text file.

malaya_speech.utils.subword.load(path: str)[source]¶: Load text file into subword dictionary.

malaya_speech.utils.subword.encode(tokenizer, string: str, add_blank: bool = False)[source]¶

Encode string to integer representation based on ascii table or lookup variable.

Parameters

tokenizer (object) – tokenizer object
string (str) –
add_blank (bool, optional (default=False)) – add BLANK token at the starting of encoded, this is for transducer / transformer based.
lookup (List[str], optional (default=None)) – list of unique strings.

Returns

result

Return type

List[int]

malaya_speech.utils.subword.decode(tokenizer, ids)[source]¶

Decode integer representation to string based on tokenizer vocab.

Parameters

tokenizer (object) – tokenizer object
ids (List[int]) –

Returns

result

Return type

str

malaya_speech.utils.subword.decode_multilanguage(tokenizers, ids)[source]¶

Decode integer representation to string using list of tokenizer objects.

Parameters

tokenizers (List[object]) – List of tokenizer objects.
ids (List[int]) –

Returns

result

Return type

str

malaya_speech.utils.tf_featurization¶

malaya_speech.age_detection¶

malaya_speech.age_detection.available_model()[source]¶: List available age detection deep models.

malaya_speech.age_detection.deep_model(model: str = 'vggvox-v2', quantized: bool = False, **kwargs)[source]¶

Load age detection deep model.

Parameters

model (str, optional (default='vggvox-v2')) –
Model architecture supported. Allowed values:
- 'vggvox-v2' - finetuned VGGVox V2.
- 'deep-speaker' - finetuned Deep Speaker.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result

Return type

malaya_speech.supervised.classification.load function

malaya_speech.diarization¶

malaya_speech.diarization.speaker_similarity(vad_results, speaker_vector, similarity_threshold: float = 0.8, norm_function: Callable = None, return_embedding: bool = False)[source]¶

Speaker diarization using L2-Norm similarity.

Parameters

vad_results (List[Tuple[Frame, label]]) – results from VAD.
speaker_vector (callable) – speaker vector object.
similarity_threshold (float, optional (default=0.8)) – if current voice activity sample similar at least 80%, we assumed it is from the same speaker.
norm_function (Callable, optional(default=None)) – normalize function for speaker vectors.
speaker_change_threshold (float, optional (default=0.5)) – in one voice activity sample can be more than one speaker, split it using this threshold.

Returns

result

Return type

List[Tuple[Frame, label]]

malaya_speech.diarization.n_clustering(vad_results, speaker_vector, model, norm_function: Callable = <function l2_normalize>, return_embedding=False)[source]¶

Speaker diarization using any clustering model.

Parameters

vad_results (List[Tuple[Frame, label]]) – results from VAD.
speaker_vector (callable) – speaker vector object.
model (callable) – Prefer any sklearn unsupervised clustering model. Required fit_predict or apply method.
norm_function (Callable, optional(default=malaya_speech.utils.dist.l2_normalize)) – normalize function for speaker vectors.
log_distance_metric (str, optional (default='cosine')) – post distance norm in log scale metrics.

Returns

result

Return type

List[Tuple[Frame, label]]

malaya_speech.diarization.affinity_propagation(vad_results, speaker_vector, norm_function: Callable = <function l2_normalize>, log_distance_metric: str = 'cosine', damping: float = 0.8, preference: float = None, return_embedding=False)[source]¶

Speaker diarization using sklearn Affinity Propagation.

Parameters

vad_results (List[Tuple[Frame, label]]) – results from VAD.
speaker_vector (callable) – speaker vector object.
norm_function (Callable, optional(default=malaya_speech.utils.dist.l2_normalize)) – normalize function for speaker vectors.
log_distance_metric (str, optional (default='cosine')) – post distance norm in log scale metrics.

Returns

result

Return type

List[Tuple[Frame, label]]

malaya_speech.diarization.spectral_cluster(vad_results, speaker_vector, min_clusters: int = None, max_clusters: int = None, norm_function: Callable = <function l2_normalize>, log_distance_metric: str = None, return_embedding=False, **kwargs)[source]¶

Speaker diarization using SpectralCluster, https://github.com/wq2012/SpectralCluster

Parameters

vad_results (List[Tuple[Frame, label]]) – results from VAD.
speaker_vector (callable) – speaker vector object.
min_clusters (int, optional (default=None)) – minimal number of clusters allowed (only effective if not None).
max_clusters (int, optional (default=None)) – maximal number of clusters allowed (only effective if not None). can be used together with min_clusters to fix the number of clusters.
norm_function (Callable, optional(default=malaya_speech.utils.dist.l2_normalize)) – normalize function for speaker vectors.
log_distance_metric (str, optional (default=None)) – post distance norm in log scale metrics.

Returns

result

Return type

List[Tuple[Frame, label]]

malaya_speech.emotion¶

malaya_speech.emotion.available_model()[source]¶: List available emotion detection deep models.

malaya_speech.emotion.deep_model(model: str = 'vggvox-v2', quantized: bool = False, **kwargs)[source]¶

Load emotion detection deep model.

Parameters

model (str, optional (default='vggvox-v2')) –
Model architecture supported. Allowed values:
- 'vggvox-v2' - finetuned VGGVox V2.
- 'deep-speaker' - finetuned Deep Speaker.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result

Return type

malaya_speech.supervised.classification.load function

malaya_speech.force_alignment¶

malaya_speech.force_alignment.available_transducer()[source]¶: List available Encoder-Transducer Aligner models.

malaya_speech.force_alignment.available_ctc()[source]¶: List available Encoder-CTC Aligner models.

malaya_speech.force_alignment.available_huggingface()[source]¶: List available HuggingFace Malaya-Speech Aligner models.

malaya_speech.force_alignment.deep_transducer(model: str = 'conformer-transducer', quantized: bool = False, **kwargs)[source]¶

Load Encoder-Transducer Aligner model.

Parameters

model (str, optional (default='conformer-transducer')) –
Model architecture supported. Allowed values:
- 'conformer-transducer' - Conformer + RNNT trained on Malay STT dataset.
- 'conformer-transducer-mixed' - Conformer + RNNT trained on Mixed STT dataset.
- 'conformer-transducer-singlish' - Conformer + RNNT trained on Singlish STT dataset.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result

Return type

malaya_speech.model.transducer.TransducerAligner class

malaya_speech.force_alignment.deep_ctc(model: str = 'hubert-conformer', quantized: bool = False, **kwargs)[source]¶

Load Encoder-CTC ASR model.

Parameters

model (str, optional (default='hubert-conformer')) –

Model architecture supported. Allowed values:

'hubert-conformer-tiny' - Finetuned HuBERT Conformer TINY.
'hubert-conformer' - Finetuned HuBERT Conformer.
'hubert-conformer-large' - Finetuned HuBERT Conformer LARGE.
'hubert-conformer-large-3mixed' - Finetuned HuBERT Conformer LARGE for (Malay + Singlish + Mandarin) languages.
'best-rq-conformer-tiny' - Finetuned BEST-RQ Conformer TINY.
'best-rq-conformer' - Finetuned BEST-RQ Conformer.
'best-rq-conformer-large' - Finetuned BEST-RQ Conformer LARGE.

quantizedbool, optional (default=False): if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns: result
Return type: malaya_speech.model.wav2vec.Wav2Vec2_Aligner class

malaya_speech.force_alignment.huggingface(model: str = 'mesolitica/wav2vec2-xls-r-300m-mixed')[source]¶

Load Finetuned models from HuggingFace.

Parameters

model (str, optional (default='mesolitica/wav2vec2-xls-r-300m-mixed')) –

Model architecture supported. Allowed values:

'mesolitica/wav2vec2-xls-r-300m-mixed' - wav2vec2 XLS-R 300M finetuned on (Malay + Singlish + Mandarin) languages.

Returns

result

Return type

malaya_speech.model.huggingface.CTC class

malaya_speech.gender¶

malaya_speech.gender.available_model()[source]¶: List available gender detection deep models.

malaya_speech.gender.deep_model(model: str = 'vggvox-v2', quantized: bool = False, **kwargs)[source]¶

Load gender detection deep model.

Parameters

model (str, optional (default='vggvox-v2')) –
Model architecture supported. Allowed values:
- 'vggvox-v2' - finetuned VGGVox V2.
- 'deep-speaker' - finetuned Deep Speaker.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result

Return type

malaya_speech.supervised.classification.load function

malaya_speech.language_detection¶

malaya_speech.language_detection.available_model()[source]¶: List available language detection deep models.

malaya_speech.language_detection.deep_model(model: str = 'vggvox-v2', quantized: bool = False, **kwargs)[source]¶

Load language detection deep model.

Parameters

model (str, optional (default='vggvox-v2')) –
Model architecture supported. Allowed values:
- 'vggvox-v2' - finetuned VGGVox V2.
- 'deep-speaker' - finetuned Deep Speaker.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result

Return type

malaya_speech.supervised.classification.load function

malaya_speech.multispeaker_separation¶

malaya_speech.multispeaker_separation.available_deep_wav()[source]¶: List available FastSep models trained on raw 8k wav.

malaya_speech.multispeaker_separation.deep_wav(model: str = 'fastsep-4', quantized: bool = False, **kwargs)[source]¶

Load FastSep model, trained on raw 8k wav using SISNR PIT loss.

Parameters

model (str, optional (default='fastsep-4')) –
Model architecture supported. Allowed values:
- 'fastsep-2' - FastSep 2 layers trained on raw 8k wav.
- 'fastsep-4' - FastSep 4 layers trained on raw 8k wav.
- 'fastsep-6' - FastSep 6 layers trained on raw 8k wav.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result

Return type

malaya_speech.model.tf.Split class

malaya_speech.noise_reduction¶

malaya_speech.noise_reduction.available_model()[source]¶: List available Noise Reduction deep learning models.

malaya_speech.noise_reduction.deep_model(model: str = 'resnet-unet', quantized: bool = False, **kwargs)[source]¶

Load Noise Reduction deep learning model.

Parameters

model (str, optional (default='wavenet')) –
Model architecture supported. Allowed values:
- 'unet' - pretrained UNET.
- 'resnet-unet' - pretrained resnet-UNET.
- 'resnext' - pretrained resnext-UNET.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result

Return type

malaya_speech.model.tf.UNET_STFT class

malaya_speech.speaker_change¶

malaya_speech.speaker_change.available_model()[source]¶: List available speaker change deep models.

malaya_speech.speaker_change.deep_model(model: str = 'speakernet', quantized: bool = False, **kwargs)[source]¶

Load speaker change deep model.

Parameters

model (str, optional (default='vggvox-v2')) –
Model architecture supported. Allowed values:
- 'vggvox-v2' - finetuned VGGVox V2.
- 'speakernet' - finetuned SpeakerNet.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result

Return type

malaya_speech.supervised.classification.load function

malaya_speech.speaker_change.split_activities(vad_results, speaker_change_results, speaker_change_threshold: float = 0.5, sr: int = 16000, ignore_not_activity=True)[source]¶

split VAD based on speaker change threshold, worse-case O(N^2).

Parameters

vad_results (List[Tuple[Frame, label]]) – results from VAD.
speaker_change_results (List[Tuple[Frame, float]], optional (default=None)) – results from speaker change module, must in float result.
speaker_change_threshold (float, optional (default=0.5)) – in one voice activity sample can be more than one speaker, split it using this threshold.
sr (int, optional (default=16000)) – sample rate, classification model in malaya-speech use 16k.
ignore_not_activity (bool, optional (default=True)) – If True, will ignore if result VAD is False, else will try to split.

Returns

result

Return type

List[Tuple[Frame, label]]

malaya_speech.speaker_overlap¶

malaya_speech.speaker_overlap.available_model()[source]¶: List available speaker overlap deep models.

malaya_speech.speaker_overlap.deep_model(model: str = 'vggvox-v2', quantized: bool = False, **kwargs)[source]¶

Load speaker overlap deep model.

Parameters

model (str, optional (default='vggvox-v2')) –
Model architecture supported. Allowed values:
- 'vggvox-v2' - finetuned VGGVox V2.
- 'speakernet' - finetuned SpeakerNet.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result

Return type

malaya_speech.supervised.classification.load function

malaya_speech.speaker_vector¶

malaya_speech.speaker_vector.available_model()[source]¶: List available speaker vector deep models.

malaya_speech.speaker_vector.deep_model(model: str = 'vggvox-v2', quantized: bool = False, **kwargs)[source]¶

Load Speaker2Vec model.

Parameters

model (str, optional (default='speakernet')) –
Model architecture supported. Allowed values:
- 'vggvox-v1' - VGGVox V1, embedding size 1024, exported from https://github.com/linhdvu14/vggvox-speaker-identification
- 'vggvox-v2' - VGGVox V2, embedding size 512, exported from https://github.com/WeidiXie/VGG-Speaker-Recognition
- 'deep-speaker' - Deep Speaker, embedding size 512, exported from https://github.com/philipperemy/deep-speaker
- 'speakernet' - SpeakerNet, embedding size 7205, exported from https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_recognition
- 'conformer-base' - Conformer BASE size, embedding size 512.
- 'conformer-tiny' - Conformer TINY size, embedding size 512.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result

Return type

malaya_speech.supervised.classification.load function

malaya_speech.speech_enhancement¶

malaya_speech.speech_enhancement.available_deep_masking()[source]¶: List available Speech Enhancement STFT masking deep learning model.

malaya_speech.speech_enhancement.available_deep_enhance()[source]¶: List available Speech Enhancement UNET Waveform sampling deep learning model.

malaya_speech.speech_enhancement.deep_masking(model: str = 'resnet-unet', quantized: bool = False, **kwargs)[source]¶

Load Speech Enhancement STFT UNET masking deep learning model.

Parameters

model (str, optional (default='resnet-unet')) –
Model architecture supported. Allowed values:
- 'unet' - pretrained UNET.
- 'resnet-unet' - pretrained resnet-UNET.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result

Return type

malaya_speech.model.unet.UNETSTFT class

malaya_speech.speech_enhancement.deep_enhance(model: str = 'unet', quantized: bool = False, **kwargs)[source]¶

Load Speech Enhancement UNET Waveform sampling deep learning model.

Parameters

model (str, optional (default='unet')) –
Model architecture supported. Allowed values:
- 'unet' - pretrained UNET Speech Enhancement.
- 'resnet-unet' - pretrained resnet-UNET Speech Enhancement.
- 'resnext-unet' - pretrained resnext-UNET Speech Enhancement.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result

Return type

malaya_speech.model.unet.UNET1D class

malaya_speech.speechsplit_conversion¶

malaya_speech.speechsplit_conversion.available_deep_conversion(f0_mode='pysptk')[source]¶: List available Voice Conversion models.

malaya_speech.speechsplit_conversion.deep_conversion(model: str = 'fastspeechsplit-v2-vggvox-v2', f0_mode: str = 'pysptk', quantized: bool = False, **kwargs)[source]¶

Load Voice Conversion model.

Parameters

model (str, optional (default='fastvc-32-vggvox-v2')) –
Model architecture supported. Allowed values:
- 'fastspeechsplit-vggvox-v2' - FastSpeechSplit with VGGVox-v2 Speaker Vector.
- 'fastspeechsplit-v2-vggvox-v2' - FastSpeechSplit V2 with VGGVox-v2 Speaker Vector.
f0_mode (str, optional (default='pysptk)) –
F0 conversion supported. Allowed values:
- 'pysptk' - https://github.com/r9y9/pysptk, sensitive towards gender.
- 'pyworld' - https://pypi.org/project/pyworld/
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result

Return type

malaya_speech.model.splitter.FastSpeechSplit class

malaya_speech.stack¶

malaya_speech.stack.classification_stack(models)[source]¶

Stacking for classification models. All models should be in the same domain classification.

Parameters: models (List[Callable]) – list of models.
Returns: result
Return type: malaya_speech.stack.Stack class

malaya_speech.model.stack.Stack¶

class malaya_speech.stack.Stack[source]¶

predict_proba(inputs, aggregate: Callable = <function gmean>)[source]¶

Stacking for predictive models, will return probability.

Parameters

inputs (List[np.array]) –
aggregate (Callable, optional (default=scipy.stats.mstats.gmean)) –
function. (Aggregate) –

Returns

result

Return type

np.array

predict(inputs, aggregate: Callable = <function gmean>)[source]¶

Stacking for predictive models, will return labels.

Parameters

inputs (List[np.array]) –
aggregate (Callable, optional (default=scipy.stats.mstats.gmean)) –
function. (Aggregate) –

Returns

result

Return type

List[str]

malaya_speech.stt¶

malaya_speech.stt.available_ctc()[source]¶: List available Encoder-CTC ASR models.

malaya_speech.stt.available_language_model()[source]¶: List available Language Model for CTC.

malaya_speech.stt.available_transducer()[source]¶: List available Encoder-Transducer ASR models.

malaya_speech.stt.available_huggingface()[source]¶: List available HuggingFace Malaya-Speech ASR models.

malaya_speech.stt.language_model(model: str = 'dump-combined', **kwargs)[source]¶

Load KenLM language model.

Parameters

model (str, optional (default='dump-combined')) –

Model architecture supported. Allowed values:

'bahasa' - Gathered from malaya-speech ASR bahasa transcript.
'bahasa-news' - Gathered from malaya-speech ASR bahasa transcript + Bahasa News (Random sample 300k sentences).
'bahasa-combined' - Gathered from malaya-speech ASR bahasa transcript + Bahasa News (Random sample 300k sentences) + Bahasa Wikipedia (Random sample 150k sentences).
'redape-community' - Mirror for https://github.com/redapesolutions/suara-kami-community
'dump-combined' - Academia + News + IIUM + Parliament + Watpadd + Wikipedia + Common Crawl + training set from https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/prepare-stt.
'manglish' - Manglish News + Manglish Reddit + Manglish forum + training set from https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/prepare-stt.
'bahasa-manglish-combined' - Combined dump-combined and manglish.

Returns

result

Return type

str

malaya_speech.stt.deep_ctc(model: str = 'hubert-conformer', quantized: bool = False, **kwargs)[source]¶

Load Encoder-CTC ASR model.

Parameters

model (str, optional (default='hubert-conformer')) –

Model architecture supported. Allowed values:

'hubert-conformer-tiny' - Finetuned HuBERT Conformer TINY.
'hubert-conformer' - Finetuned HuBERT Conformer.
'hubert-conformer-large' - Finetuned HuBERT Conformer LARGE.
'hubert-conformer-large-3mixed' - Finetuned HuBERT Conformer LARGE for (Malay + Singlish + Mandarin) languages.
'best-rq-conformer-tiny' - Finetuned BEST-RQ Conformer TINY.
'best-rq-conformer' - Finetuned BEST-RQ Conformer.
'best-rq-conformer-large' - Finetuned BEST-RQ Conformer LARGE.

quantizedbool, optional (default=False): if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns: result
Return type: malaya_speech.model.wav2vec.Wav2Vec2_CTC class

malaya_speech.stt.deep_transducer(model: str = 'conformer', quantized: bool = False, **kwargs)[source]¶

Load Encoder-Transducer ASR model.

Parameters

model (str, optional (default='conformer')) –
Model architecture supported. Allowed values:
- 'tiny-conformer' - TINY size Google Conformer.
- 'small-conformer' - SMALL size Google Conformer.
- 'conformer' - BASE size Google Conformer.
- 'large-conformer' - LARGE size Google Conformer.
- 'conformer-stack-2mixed' - BASE size Stacked Google Conformer for (Malay + Singlish) languages.
- 'conformer-stack-3mixed' - BASE size Stacked Google Conformer for (Malay + Singlish + Mandarin) languages.
- 'small-conformer-singlish' - SMALL size Google Conformer for singlish language.
- 'conformer-singlish' - BASE size Google Conformer for singlish language.
- 'large-conformer-singlish' - LARGE size Google Conformer for singlish language.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result

Return type

malaya_speech.model.transducer.Transducer class

malaya_speech.stt.huggingface(model: str = 'mesolitica/wav2vec2-xls-r-300m-mixed', **kwargs)[source]¶

Load Finetuned models from HuggingFace. Required Tensorflow >= 2.0.

Parameters

model (str, optional (default='mesolitica/wav2vec2-xls-r-300m-mixed')) –

Model architecture supported. Allowed values:

'mesolitica/wav2vec2-xls-r-300m-mixed' - wav2vec2 XLS-R 300M finetuned on (Malay + Singlish + Mandarin) languages.

Returns

result

Return type

malaya_speech.model.huggingface.CTC class

malaya_speech.super_resolution¶

malaya_speech.super_resolution.available_model()[source]¶: List available Super Resolution 4x deep learning models.

malaya_speech.super_resolution.deep_model(model: str = 'srgan-256', quantized: bool = False, **kwargs)[source]¶

Load Super Resolution 4x deep learning model.

Parameters

model (str, optional (default='srgan-256')) –
Model architecture supported. Allowed values:
- 'srgan-128' - srgan with 128 filter size and 16 residual blocks.
- 'srgan-256' - srgan with 256 filter size and 16 residual blocks.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result

Return type

malaya_speech.model.tf.UNET1D class

malaya_speech.tts¶

malaya_speech.tts.available_tacotron2()[source]¶: List available Tacotron2, Text to Mel models.

malaya_speech.tts.available_fastspeech2()[source]¶: List available FastSpeech2, Text to Mel models.

malaya_speech.tts.available_fastpitch()[source]¶: List available FastPitch, Text to Mel models.

malaya_speech.tts.available_glowtts()[source]¶: List available GlowTTS, Text to Mel models.

malaya_speech.tts.available_lightspeech()[source]¶: List available LightSpeech, Text to Mel models.

malaya_speech.tts.load_text_ids(pad_to: int = 8, understand_punct: bool = True, is_lower: bool = True, **kwargs)[source]¶: Load text normalizer module use by Malaya-Speech TTS.

malaya_speech.tts.tacotron2(model: str = 'yasmin', quantized: bool = False, pad_to: int = 8, **kwargs)[source]¶

Load Tacotron2 TTS model.

Parameters

model (str, optional (default='yasmin')) –
Model architecture supported. Allowed values:
- 'female' - Tacotron2 trained on female voice.
- 'male' - Tacotron2 trained on male voice.
- 'husein' - Tacotron2 trained on Husein voice, https://www.linkedin.com/in/husein-zolkepli/
- 'haqkiem' - Tacotron2 trained on Haqkiem voice, https://www.linkedin.com/in/haqkiem-daim/
- 'yasmin' - Tacotron2 trained on female Yasmin voice.
- 'osman' - Tacotron2 trained on male Osman voice.
- 'female-singlish' - Tacotron2 trained on female Singlish voice, https://www.imda.gov.sg/programme-listing/digital-services-lab/national-speech-corpus
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
pad_to (int, optional (default=8)) – size of pad character with 0. Increase can stable up prediction on short sentence, we trained on 8.

Returns

result

Return type

malaya_speech.model.synthesis.Tacotron class

malaya_speech.tts.fastspeech2(model: str = 'male', quantized: bool = False, pad_to: int = 8, **kwargs)[source]¶

Load Fastspeech2 TTS model.

Parameters

model (str, optional (default='male')) –
Model architecture supported. Allowed values:
- 'female' - Fastspeech2 trained on female voice.
- 'male' - Fastspeech2 trained on male voice.
- 'husein' - Fastspeech2 trained on Husein voice, https://www.linkedin.com/in/husein-zolkepli/
- 'haqkiem' - Fastspeech2 trained on Haqkiem voice, https://www.linkedin.com/in/haqkiem-daim/
- 'yasmin' - Fastspeech2 trained on female Yasmin voice.
- 'osman' - Fastspeech2 trained on male Osman voice.
- 'female-singlish' - Fastspeech2 trained on female Singlish voice, https://www.imda.gov.sg/programme-listing/digital-services-lab/national-speech-corpus
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
pad_to (int, optional (default=8)) – size of pad character with 0. Increase can stable up prediction on short sentence, we trained on 8.

Returns

result

Return type

malaya_speech.model.synthesis.Fastspeech class

malaya_speech.tts.fastpitch(model: str = 'male', quantized: bool = False, pad_to: int = 8, **kwargs)[source]¶

Load Fastspitch TTS model.

Parameters

model (str, optional (default='male')) –
Model architecture supported. Allowed values:
- 'female' - Fastpitch trained on female voice.
- 'male' - Fastpitch trained on male voice.
- 'husein' - Fastpitch trained on Husein voice, https://www.linkedin.com/in/husein-zolkepli/
- 'haqkiem' - Fastpitch trained on Haqkiem voice, https://www.linkedin.com/in/haqkiem-daim/
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
pad_to (int, optional (default=8)) – size of pad character with 0. Increase can stable up prediction on short sentence, we trained on 8.

Returns

result

Return type

malaya_speech.model.synthesis.Fastpitch class

malaya_speech.tts.glowtts(model: str = 'yasmin', quantized: bool = False, pad_to: int = 2, **kwargs)[source]¶

Load GlowTTS TTS model.

Parameters

model (str, optional (default='yasmin')) –
Model architecture supported. Allowed values:
- 'female' - GlowTTS trained on female voice.
- 'male' - GlowTTS trained on male voice.
- 'haqkiem' - GlowTTS trained on Haqkiem voice, https://www.linkedin.com/in/haqkiem-daim/
- 'female-singlish' - GlowTTS trained on female Singlish voice, https://www.imda.gov.sg/programme-listing/digital-services-lab/national-speech-corpus
- 'yasmin' - GlowTTS trained on female Yasmin voice.
- 'osman' - GlowTTS trained on male Osman voice.
- 'multispeaker' - Multispeaker GlowTTS trained on male, female, husein and haqkiem voices, also able to do voice conversion.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
pad_to (int, optional (default=2)) – size of pad character with 0. Increase can stable up prediction on short sentence, we trained on 2.

Returns

result

Return type

malaya_speech.model.synthesis.GlowTTS class

malaya_speech.tts.lightspeech(model: str = 'male', quantized: bool = False, pad_to: int = 8, **kwargs)[source]¶

Load LightSpeech TTS model.

Parameters

model (str, optional (default='male')) –
Model architecture supported. Allowed values:
- 'yasmin' - LightSpeech trained on female Yasmin voice.
- 'osman' - LightSpeech trained on male Osman voice.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
pad_to (int, optional (default=8)) – size of pad character with 0. Increase can stable up prediction on short sentence, we trained on 8.

Returns

result

Return type

malaya_speech.model.synthesis.Fastspeech class

malaya_speech.vad¶

malaya_speech.vad.available_model()[source]¶: List available VAD deep models.

malaya_speech.vad.webrtc(aggressiveness: int = 3, sample_rate: int = 16000, minimum_amplitude: int = 100)[source]¶

Load WebRTC VAD model.

Parameters

aggressiveness (int, optional (default=3)) – an integer between 0 and 3. 0 is the least aggressive about filtering out non-speech, 3 is the most aggressive.
sample_rate (int, optional (default=16000)) – sample rate for samples.
minimum_amplitude (int, optional (default=100)) – abs(minimum_amplitude) to assume a sample is a voice activity. Else, automatically False.

Returns

result

Return type

malaya_speech.model.webrtc.WebRTC class

malaya_speech.vad.deep_model(model: str = 'marblenet-factor1', quantized: bool = False, **kwargs)[source]¶

Load VAD model.

Parameters

model (str, optional (default='vggvox-v2')) –
Model architecture supported. Allowed values:
- 'vggvox-v1' - finetuned VGGVox V1.
- 'vggvox-v2' - finetuned VGGVox V2.
- 'speakernet' - finetuned SpeakerNet.
- 'marblenet-factor1' - Pretrained MarbleNet * factor 1.
- 'marblenet-factor3' - Pretrained MarbleNet * factor 3.
- 'marblenet-factor5' - Pretrained MarbleNet * factor 5.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result

Return type

malaya_speech.supervised.classification.load function

malaya_speech.vocoder¶

malaya_speech.vocoder.available_melgan()[source]¶: List available MelGAN Mel-to-Speech models.

malaya_speech.vocoder.available_mbmelgan()[source]¶: List available Multiband MelGAN Mel-to-Speech models.

malaya_speech.vocoder.available_hifigan()[source]¶: List available HiFiGAN Mel-to-Speech models.

malaya_speech.vocoder.melgan(model: str = 'universal-1024', quantized: bool = False, **kwargs)[source]¶

Load MelGAN Vocoder model.

Parameters

model (str, optional (default='universal-1024')) –
Model architecture supported. Allowed values:
- 'female' - MelGAN trained on female voice.
- 'male' - MelGAN trained on male voice.
- 'husein' - MelGAN trained on Husein voice, https://www.linkedin.com/in/husein-zolkepli/
- 'haqkiem' - MelGAN trained on Haqkiem voice, https://www.linkedin.com/in/haqkiem-daim/
- 'yasmin' - MelGAN trained on female Yasmin voice.
- 'osman' - MelGAN trained on male Osman voice.
- 'female-singlish' - MelGAN trained on Female Singlish voice, https://www.imda.gov.sg/programme-listing/digital-services-lab/national-speech-corpus
- 'universal' - Universal MelGAN trained on multiple speakers.
- 'universal-1024' - Universal MelGAN with 1024 filters trained on multiple speakers.
- 'universal-384' - Universal MelGAN with 384 filters trained on multiple speakers.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result

Return type

malaya_speech.model.synthesis.Vocoder class

malaya_speech.vocoder.mbmelgan(model: str = 'female', quantized: bool = False, **kwargs)[source]¶

Load Multiband MelGAN Vocoder model.

Parameters

model (str, optional (default='female')) –
Model architecture supported. Allowed values:
- 'female' - MBMelGAN trained on female voice.
- 'male' - MBMelGAN trained on male voice.
- 'husein' - MBMelGAN trained on Husein voice, https://www.linkedin.com/in/husein-zolkepli/
- 'haqkiem' - MBMelGAN trained on Haqkiem voice, https://www.linkedin.com/in/haqkiem-daim/
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result

Return type

malaya_speech.model.synthesis.Vocoder class

malaya_speech.vocoder.hifigan(model: str = 'universal-768', quantized: bool = False, **kwargs)[source]¶

Load HiFiGAN Vocoder model.

Parameters

model (str, optional (default='universal-768')) –
Model architecture supported. Allowed values:
- 'female' - HiFiGAN trained on female voice.
- 'male' - HiFiGAN trained on male voice.
- 'universal-1024' - Universal HiFiGAN with 1024 filters trained on multiple speakers.
- 'universal-768' - Universal HiFiGAN with 768 filters trained on multiple speakers.
- 'universal-512' - Universal HiFiGAN with 512 filters trained on multiple speakers.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result

Return type

malaya_speech.model.synthesis.Vocoder class

malaya_speech.voice_conversion¶

malaya_speech.voice_conversion.available_deep_conversion()[source]¶: List available Voice Conversion models.

malaya_speech.voice_conversion.deep_conversion(model: str = 'fastvc-32-vggvox-v2', quantized: bool = False, **kwargs)[source]¶

Load Voice Conversion model.

Parameters

model (str, optional (default='fastvc-32-vggvox-v2')) –
Model architecture supported. Allowed values:
- 'fastvc-32-vggvox-v2' - FastVC bottleneck size 32 with VGGVox-v2 Speaker Vector.
- 'fastvc-64-vggvox-v2' - FastVC bottleneck size 64 with VGGVox-v2 Speaker Vector.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.

Returns

result

Return type

malaya_speech.model.synthesis.FastVC class