API¶
malaya_speech¶
malaya_speech.augmentation.spectrogram¶
-
malaya_speech.augmentation.spectrogram.mask_frequency(features, n_freq_mask: int = 2, width_freq_mask: int = 8, random_band=True)[source]¶ Mask frequency.
- Parameters
features (np.array) –
n_freq_mask (int, optional (default=2)) – loop size for masking.
width_freq_mask (int, optional (default=8)) – masking size.
- Returns
result
- Return type
np.array
-
malaya_speech.augmentation.spectrogram.mask_time(features, n_time_mask=2, width_time_mask=8, random_band=True)[source]¶ Time frequency.
- Parameters
features (np.array) –
n_time_mask (int, optional (default=2)) – loop size for masking.
width_time_mask (int, optional (default=8)) – masking size.
- Returns
result
- Return type
np.array
-
malaya_speech.augmentation.spectrogram.tf_mask_frequency(features, n_freq_mask=2, F=27)[source]¶ Mask frequency using Tensorflow.
- Parameters
features (np.array) –
F (size of mask for frequency) –
-
malaya_speech.augmentation.spectrogram.tf_mask_time(features, n_time_mask=2, T=80)[source]¶ Mask time using Tensorflow.
- Parameters
features (np.array) –
T (size of mask for time) –
malaya_speech.extra.rttm¶
-
malaya_speech.extra.rttm.load(file: str)[source]¶ Load RTTM file.
- Parameters
file (str) –
- Returns
result
- Return type
Dict[str, malaya_speech.model.annotation.Annotation]
malaya_speech.extra.visualization¶
-
malaya_speech.extra.visualization.visualize_vad(signal, preds: List[Tuple[malaya_speech.model.frame.Frame, bool]], sample_rate: int = 16000, figsize: Tuple[int, int] = (15, 3), ax=None, **kwargs)[source]¶ Visualize signal given VAD labels. Green means got voice activity, while Red is not.
- Parameters
signal (list / np.array) –
preds (List[Tuple[Frame, bool]]) –
sample_rate (int, optional (default=16000)) –
figsize (Tuple[int, int], optional (default=(15, 7))) – matplotlib figure size.
-
malaya_speech.extra.visualization.plot_classification(preds, description, ax=None, fontsize_text=14, x_text=0.05, y_text=0.2, ylim=(0.1, 0.9), figsize: Tuple[int, int] = (15, 3), **kwargs)[source]¶ Visualize probability / boolean.
- Parameters
preds (List[Tuple[Frame, label]]) –
description (str) –
ax (ax, optional (default = None)) –
fontsize_text (int, optional (default = 14)) –
x_text (float, optional (default = 0.05)) –
y_text (float, optional (default = 0.2)) –
malaya_speech.model.classification.Speakernet¶
malaya_speech.model.classification.Speaker2Vec¶
malaya_speech.model.classification.SpeakernetClassification¶
-
class
malaya_speech.model.classification.SpeakernetClassification[source]¶
malaya_speech.model.classification.Classification¶
-
class
malaya_speech.model.classification.Classification[source]¶
malaya_speech.model.huggingface.HuggingFace_CTC¶
-
class
malaya_speech.model.huggingface.HuggingFace_CTC[source]¶ -
greedy_decoder(inputs)[source]¶ Transcribe inputs using greedy decoder.
- Parameters
input (List[np.array]) – List[np.array] or List[malaya_speech.model.frame.Frame].
- Returns
result
- Return type
List[str]
-
predict(inputs)[source]¶ Predict logits from inputs using greedy decoder.
- Parameters
input (List[np.array]) – List[np.array] or List[malaya_speech.model.frame.Frame].
- Returns
result
- Return type
List[str]
-
predict_logits(inputs, norm_func=<function softmax>)[source]¶ Predict logits from inputs.
- Parameters
input (List[np.array]) – List[np.array] or List[malaya_speech.model.frame.Frame].
norm_func (Callable, optional (default=malaya.utils.activation.softmax)) –
- Returns
result
- Return type
List[np.array]
-
gradio(record_mode: bool = True, lm_func: Callable = None, **kwargs)[source]¶ Transcribe an input using beam decoder on Gradio interface.
- Parameters
record_mode (bool, optional (default=True)) – if True, Gradio will use record mode, else, file upload mode.
lm_func (Callable, optional (default=None)) – if not None, will pass a logits with shape [T, D].
**kwargs (keyword arguments for iface.launch.) –
-
malaya_speech.model.huggingface.HuggingFace_Aligner¶
-
class
malaya_speech.model.huggingface.HuggingFace_Aligner[source]¶ -
predict(input, transcription: str, sample_rate: int = 16000)[source]¶ Transcribe input, will return a string.
- Parameters
input (np.array) – np.array or malaya_speech.model.frame.Frame.
transcription (str) – transcription of input audio.
sample_rate (int, optional (default=16000)) – sample rate for input.
- Returns
result
- Return type
Dict[chars_alignment, words_alignment, alignment]
-
malaya_speech.model.splitter.Split_Wav¶
malaya_speech.model.splitter.Split_Mel¶
malaya_speech.model.splitter.FastSpeechSplit¶
-
class
malaya_speech.model.splitter.FastSpeechSplit[source]¶ -
predict(original_audio, target_audio, modes=['R', 'F', 'U', 'RF', 'RU', 'FU', 'RFU'])[source]¶ Change original voice audio to follow targeted voice.
- Parameters
original_audio (np.array or malaya_speech.model.frame.Frame) –
target_audio (np.array or malaya_speech.model.frame.Frame) –
modes (List[str], optional (default = ['R', 'F', 'U', 'RF', 'RU', 'FU', 'RFU'])) –
R denotes rhythm, F denotes pitch target, U denotes speaker target (vector).
'R'- maintain original_audio F and U on target_audio R.'F'- maintain original_audio R and U on target_audio F.'U'- maintain original_audio R and F on target_audio U.'RF'- maintain original_audio U on target_audio R and F.'RU'- maintain original_audio F on target_audio R and U.'FU'- maintain original_audio R on target_audio F and U.'RFU'- no conversion happened, just do encoder-decoder on target_audio
- Returns
result
- Return type
Dict[modes]
-
malaya_speech.model.synthesis.TTS¶
malaya_speech.model.synthesis.Vocoder¶
malaya_speech.model.synthesis.Tacotron¶
malaya_speech.model.synthesis.Fastspeech¶
-
class
malaya_speech.model.synthesis.Fastspeech[source]¶ -
predict(string, speed_ratio: float = 1.0, f0_ratio: float = 1.0, energy_ratio: float = 1.0, **kwargs)[source]¶ Change string to Mel.
- Parameters
string (str) –
speed_ratio (float, optional (default=1.0)) – Increase this variable will increase time voice generated.
f0_ratio (float, optional (default=1.0)) – Increase this variable will increase frequency, low frequency will generate more deeper voice.
energy_ratio (float, optional (default=1.0)) – Increase this variable will increase loudness.
- Returns
result
- Return type
Dict[string, decoder-output, mel-output, universal-output]
-
malaya_speech.model.synthesis.FastVC¶
malaya_speech.model.synthesis.Fastpitch¶
-
class
malaya_speech.model.synthesis.Fastpitch[source]¶ -
predict(string, speed_ratio: float = 1.0, pitch_ratio: float = 1.0, pitch_addition: float = 0.0, **kwargs)[source]¶ Change string to Mel.
- Parameters
string (str) –
speed_ratio (float, optional (default=1.0)) – Increase this variable will increase time voice generated.
pitch_ratio (float, optional (default=1.0)) – pitch = pitch * pitch_ratio, amplify existing pitch contour.
pitch_addition (float, optional (default=0.0)) – pitch = pitch + pitch_addition, change pitch contour.
- Returns
result
- Return type
Dict[string, decoder-output, mel-output, pitch-output, universal-output]
-
malaya_speech.model.transducer.Transducer¶
-
class
malaya_speech.model.transducer.Transducer[source]¶ -
predict_alignment(input, combined=True)[source]¶ Transcribe input and get timestamp, only support greedy decoder.
- Parameters
input (np.array) – np.array or malaya_speech.model.frame.Frame.
combined (bool, optional (default=True)) – If True, will combined subwords to become a word.
- Returns
result
- Return type
List[Dict[text, start, end]]
-
greedy_decoder(inputs)[source]¶ Transcribe inputs using greedy decoder.
- Parameters
inputs (List[np.array]) – List[np.array] or List[malaya_speech.model.frame.Frame].
- Returns
result
- Return type
List[str]
-
beam_decoder(inputs, beam_width: int = 5, temperature: float = 0.0, score_norm: bool = True)[source]¶ Transcribe inputs using beam decoder.
- Parameters
inputs (List[np.array]) – List[np.array] or List[malaya_speech.model.frame.Frame].
beam_width (int, optional (default=5)) – beam size for beam decoder.
temperature (float, optional (default=0.0)) – apply temperature function for logits, can help for certain case, logits += -np.log(-np.log(uniform_noise_shape_logits)) * temperature
score_norm (bool, optional (default=True)) – descending sort beam based on score / length of decoded.
- Returns
result
- Return type
List[str]
-
beam_decoder_lm(inputs, language_model, beam_width: int = 5, token_min_logp: float = - 20.0, beam_prune_logp: float = - 5.0, temperature: float = 0.0, score_norm: bool = True)[source]¶ Transcribe inputs using beam decoder + KenLM.
- Parameters
inputs (List[np.array]) – List[np.array] or List[malaya_speech.model.frame.Frame].
language_model (pyctcdecode.language_model.LanguageModel) – pyctcdecode language model, load from LanguageModel(kenlm_model, alpha = alpha, beta = beta).
beam_width (int, optional (default=5)) – beam size for beam decoder.
token_min_logp (float, optional (default=-20.0)) – minimum log probability to select a token.
beam_prune_logp (float, optional (default=-5.0)) – filter candidates >= max score lm + beam_prune_logp.
temperature (float, optional (default=0.0)) – apply temperature function for logits, can help for certain case, logits += -np.log(-np.log(uniform_noise_shape_logits)) * temperature
score_norm (bool, optional (default=True)) – descending sort beam based on score / length of decoded.
- Returns
result
- Return type
List[str]
-
malaya_speech.model.transducer.TransducerAligner¶
-
class
malaya_speech.model.transducer.TransducerAligner[source]¶ -
predict(input, transcription: str, sample_rate: int = 16000)[source]¶ Transcribe input, will return a string. :param input: np.array or malaya_speech.model.frame.Frame. :type input: np.array :param transcription: transcription of input audio :type transcription: str :param sample_rate: sample rate for input. :type sample_rate: int, optional (default=16000)
- Returns
result
- Return type
Dict[words_alignment, subwords_alignment, subwords, alignment]
-
malaya_speech.model.unet.UNET¶
malaya_speech.model.unet.UNETSTFT¶
malaya_speech.model.unet.UNET1D¶
malaya_speech.model.wav2vec.Wav2Vec2_CTC¶
-
class
malaya_speech.model.wav2vec.Wav2Vec2_CTC[source]¶ -
greedy_decoder(inputs)[source]¶ Transcribe inputs using greedy decoder.
- Parameters
input (List[np.array]) – List[np.array] or List[malaya_speech.model.frame.Frame].
- Returns
result
- Return type
List[str]
-
beam_decoder(inputs, beam_width: int = 100, **kwargs)[source]¶ Transcribe inputs using beam decoder.
- Parameters
input (List[np.array]) – List[np.array] or List[malaya_speech.model.frame.Frame].
beam_width (int, optional (default=100)) – beam size for beam decoder.
- Returns
result
- Return type
List[str]
-
predict(inputs)[source]¶ Predict logits from inputs using greedy decoder.
- Parameters
input (List[np.array]) – List[np.array] or List[malaya_speech.model.frame.Frame].
- Returns
result
- Return type
List[str]
-
predict_logits(inputs, norm_func=<function softmax>)[source]¶ Predict logits from inputs.
- Parameters
input (List[np.array]) – List[np.array] or List[malaya_speech.model.frame.Frame].
norm_func (Callable, optional (default=malaya.utils.activation.softmax)) –
- Returns
result
- Return type
List[np.array]
-
gradio(record_mode: bool = True, lm_func: Callable = None, **kwargs)[source]¶ Transcribe an input using beam decoder on Gradio interface.
- Parameters
record_mode (bool, optional (default=True)) – if True, Gradio will use record mode, else, file upload mode.
lm_func (Callable, optional (default=None)) – if not None, will pass a logits with shape [T, D].
**kwargs (keyword arguments for beam decoder and iface.launch.) –
-
malaya_speech.model.wav2vec.Wav2Vec2_Aligner¶
-
class
malaya_speech.model.wav2vec.Wav2Vec2_Aligner[source]¶ -
predict(input, transcription: str, sample_rate: int = 16000)[source]¶ Transcribe input, will return a string.
- Parameters
input (np.array) – np.array or malaya_speech.model.frame.Frame.
transcription (str) – transcription of input audio.
sample_rate (int, optional (default=16000)) – sample rate for input.
- Returns
result
- Return type
Dict[chars_alignment, words_alignment, alignment]
-
malaya_speech.model.webrtc.WebRTC¶
malaya_speech.pipeline¶
-
class
malaya_speech.pipeline.Pipeline[source]¶ -
visualize(filename='pipeline.png', **kwargs)[source]¶ Render the computation of this object’s task graph using graphviz.
Requires
graphvizto be installed.- Parameters
filename (str, optional) – The name of the file to write to disk.
kwargs – Graph attributes to pass to graphviz like
rankdir="LR"
-
batching= <function batching>¶
-
flatten= <function flatten>¶
-
foreach_map= <function foreach_map>¶
-
map= <function map>¶
-
partition= <function partition>¶
-
sliding_window= <function sliding_window>¶
-
zip= <function zip>¶
-
malaya_speech.pipeline.map¶
-
class
malaya_speech.pipeline.map[source]¶ apply a function / method to the pipeline
Examples
>>> source = Pipeline() >>> source.map(lambda x: x + 1).map(print) >>> source.emit(1) 2
malaya_speech.pipeline.batching¶
-
class
malaya_speech.pipeline.batching[source]¶ Batching stream into tuples
Examples
>>> source = Pipeline() >>> source.batching(2).map(print) >>> source.emit([1,2,3,4,5]) ([1, 2], [3, 4], [5])
malaya_speech.pipeline.partition¶
-
class
malaya_speech.pipeline.partition[source]¶ Partition stream into tuples of equal size
Examples
>>> source = Pipeline() >>> source.partition(3).map(print) >>> for i in range(10): ... source.emit(i) (0, 1, 2) (3, 4, 5) (6, 7, 8)
malaya_speech.pipeline.sliding_window¶
-
class
malaya_speech.pipeline.sliding_window[source]¶ Produce overlapping tuples of size n
- Parameters
return_partial (bool) – If True, yield tuples as soon as any events come in, each tuple being smaller or equal to the window size. If False, only start yielding tuples once a full window has accrued.
Examples
>>> source = Pipeline() >>> source.sliding_window(3, return_partial=False).map(print) >>> for i in range(8): ... source.emit(i) (0, 1, 2) (1, 2, 3) (2, 3, 4) (3, 4, 5) (4, 5, 6) (5, 6, 7)
malaya_speech.pipeline.foreach_map¶
-
class
malaya_speech.pipeline.foreach_map[source]¶ Apply a function to every element in a tuple in the stream.
- Parameters
func (callable) –
method (str, optional (default='sync')) –
method to process each elements.
'sync'- loop one-by-one to process.'async'- async process all elements at the same time.'thread'- multithreading level to process all elements at the same time.Default is 1 worker. Override worker_size=n to increase.
'process'- multiprocessing level to process all elements at the same time.Default is 1 worker. Override worker_size=n to increase.
*args – The arguments to pass to the function.
**kwargs – Keyword arguments to pass to func.
Examples
>>> source = Pipeline() >>> source.foreach_map(lambda x: 2*x).map(print) >>> for i in range(3): ... source.emit((i, i)) (0, 0) (2, 2) (4, 4)
malaya_speech.pipeline.flatten¶
-
class
malaya_speech.pipeline.flatten[source]¶ Flatten streams of lists or iterables into a stream of elements
Examples
>>> source = Pipeline() >>> source.flatten().map(print) >>> source.emit([[1, 2, 3], [4, 5], [6, 7, 7]]) [1, 2, 3, 4, 5, 6, 7, 7]
malaya_speech.pipeline.zip¶
-
class
malaya_speech.pipeline.zip[source]¶ Combine 2 branches into 1 branch.
Examples
>>> source = Pipeline() >>> left = source.map(lambda x: x + 1, name = 'left') >>> right = source.map(lambda x: x + 10, name = 'right') >>> left.zip(right).map(sum).map(print) >>> source.emit(2) 15
malaya_speech.streaming¶
-
malaya_speech.streaming.record(vad, asr_model=None, classification_model=None, device=None, input_rate: int = 16000, sample_rate: int = 16000, blocks_per_second: int = 50, padding_ms: int = 300, ratio: float = 0.75, min_length: float = 0.1, filename: str = None, spinner: bool = False)[source]¶ Record an audio using pyaudio library. This record interface required a VAD model.
- Parameters
vad (object) – vad model / pipeline.
asr_model (object) – ASR model / pipeline, will transcribe each subsamples realtime.
classification_model (object) – classification pipeline, will classify each subsamples realtime.
device (None) – device parameter for pyaudio, check available devices from sounddevice.query_devices().
input_rate (int, optional (default = 16000)) – sample rate from input device, this will auto resampling.
sample_rate (int, optional (default = 16000)) – output sample rate.
blocks_per_second (int, optional (default = 50)) – size of frame returned from pyaudio, frame size = sample rate / (blocks_per_second / 2). 50 is good for WebRTC, 30 or less is good for Malaya Speech VAD.
padding_ms (int, optional (default = 300)) – size of queue to store frames, size = padding_ms // (1000 * blocks_per_second // sample_rate)
ratio (float, optional (default = 0.75)) – if 75% of the queue is positive, assumed it is a voice activity.
min_length (float, optional (default=0.1)) – minimum length (s) to accept a subsample.
filename (str, optional (default=None)) – if None, will auto generate name based on timestamp.
spinner (bool, optional (default=False)) – if True, will use spinner object from halo library.
- Returns
result
- Return type
[filename, samples]
malaya_speech.utils.aligner¶
-
malaya_speech.utils.aligner.put_comma(alignment, min_threshold: float = 0.5)[source]¶ Put comma in alignment from force alignment model.
- Parameters
alignment (List[Dict[text, start, end]]) –
min_threshold (float, optional (default=0.5)) – minimum threshold in term of seconds to assume a comma.
- Returns
result
- Return type
List[str]
-
malaya_speech.utils.aligner.plot_alignments(alignment, subs_alignment, words_alignment, waveform, separator: str = ' ', sample_rate: int = 16000, figsize: tuple = (16, 9), plot_score_char: bool = False, plot_score_word: bool = True)[source]¶ plot alignment.
- Parameters
alignment (np.array) – usually alignment output.
subs_alignment (list) – usually chars_alignment or subwords_alignment output.
words_alignment (list) – usually words_alignment output.
waveform (np.array) – input audio.
separator (str, optional (default=' ')) – separator between words, only useful if subs_alignment is character based.
sample_rate (int, optional (default=16000)) –
figsize (tuple, optional (default=(16, 9))) – figure size for matplotlib figsize.
plot_score_char (bool, optional (default=False)) – plot score on top of character plots.
plot_score_word (bool, optional (default=True)) – plot score on top of word plots.
malaya_speech.utils.astype¶
-
malaya_speech.utils.astype.to_ndarray(array)[source]¶ Change list / tuple / bytes into np.array
- Parameters
array (list / tuple / bytes) –
- Returns
result
- Return type
np.array
-
malaya_speech.utils.astype.to_byte(array)[source]¶ Change list / tuple / np.array into bytes
- Parameters
array (list / tuple / np.array) –
- Returns
result
- Return type
bytes
-
malaya_speech.utils.astype.float_to_int(array, type=<class 'numpy.int16'>)[source]¶ Change np.array float32 / float64 into np.int16
- Parameters
array (np.array) –
type (np.int16) –
- Returns
result
- Return type
np.array
-
malaya_speech.utils.astype.int_to_float(array, type=<class 'numpy.float32'>)[source]¶ Change np.array int16 into np.float32
- Parameters
array (np.array) –
type (np.float32) –
- Returns
result
- Return type
np.array
malaya_speech.utils.char¶
-
malaya_speech.utils.char.generate_vocab(strings: List[str])[source]¶ Generate character vocab sorted based on frequency.
- Parameters
strings (List[str]) –
- Returns
result
- Return type
List[str]
-
malaya_speech.utils.char.encode(string: str, add_eos: bool = True, add_blank: bool = False, lookup: List[str] = None)[source]¶ Encode string to integer representation based on ascii table or lookup variable.
- Parameters
string (str) –
add_eos (bool, optional (default=True)) – add EOS token at the end of encoded.
add_blank (bool, optional (default=False)) – add BLANK token at the starting of encoded, this is for transducer / transformer based.
lookup (List[str], optional (default=None)) – list of unique strings.
- Returns
result
- Return type
List[int]
-
malaya_speech.utils.char.decode(ids, lookup: List[str] = None)[source]¶ Decode integer representation to string based on ascii table or lookup variable.
- Parameters
ids (List[int]) –
lookup (List[str], optional (default=None)) – list of unique strings.
- Returns
result
- Return type
str
malaya_speech.utils.combine¶
-
malaya_speech.utils.combine.without_silent(frames, threshold_to_stop: float = 0.1, silent_trail: int = 500)[source]¶ Group multiple frames based on label and threshold to stop.
- Parameters
frames (List[Tuple[Frame, label]]) – Output from VAD.
threshold_to_stop (float, optional (default = 0.1)) – If threshold_to_stop is 0.1, means that, length same label samples must at least 0.1 second.
silent_trail (int, optional (default = 500)) – if detected a silent, will append first N frames and last N frames.
- Returns
result
- Return type
np.array
malaya_speech.utils.featurization¶
-
malaya_speech.utils.featurization.normalize_signal(signal, gain=None)[source]¶ Normalize float32 signal to [-1, 1] range
malaya_speech.utils.generator¶
-
malaya_speech.utils.generator.frames(audio, frame_duration_ms: int = 30, sample_rate: int = 16000, append_ending_trail: bool = True)[source]¶ Generates audio frames from audio. Takes the desired frame duration in milliseconds, the audio, and the sample rate.
- Parameters
audio (np.array) –
frame_duration_ms (int, optional (default=30)) –
sample_rate (int, optional (default=16000)) –
append_ending_trail (bool, optional (default=True)) – if True, will append last trail and this last trail might not same length as frame_duration_ms.
- Returns
result
- Return type
List[malaya_speech.model.frame.Frame]
-
malaya_speech.utils.generator.mel_sampling(audio, frame_duration_ms=1200, overlap_ms=200, sample_rate=16000)[source]¶ Generates audio frames from audio. This is for melspectrogram generative model. Takes the desired frame duration in milliseconds, the audio, and the sample rate.
- Parameters
audio (np.array) –
frame_duration_ms (int, optional (default=1200)) –
overlap_ms (int, optional (default=200)) –
sample_rate (int, optional (default=16000)) –
- Returns
result
- Return type
List[np.array]
-
malaya_speech.utils.generator.combine_mel_sampling(samples, overlap_ms=200, sample_rate=16000, padded_ms=50)[source]¶ To combine results from mel_sampling, output from melspectrogram generative model.
- Parameters
samples (List[np.array]) –
overlap_ms (int, optional (default=200)) –
sample_rate (int, optional (default=16000)) –
- Returns
result
- Return type
List[np.array]
malaya_speech.utils.griffin_lim¶
-
malaya_speech.utils.griffin_lim.from_mel(mel_, sr=16000, n_fft=2048, n_iter=32, win_length=1000, hop_length=100)[source]¶ Change melspectrogram into waveform using Librosa.
- Parameters
spectrogram (np.array) –
- Returns
result
- Return type
np.array
-
malaya_speech.utils.griffin_lim.from_mel_vocoder(mel, sr=22050, n_fft=1024, n_mels=256, fmin=80, fmax=7600, n_iter=32, win_length=None, hop_length=256)[source]¶ Change melspectrogram into waveform using Librosa.
- Parameters
spectrogram (np.array) –
- Returns
result
- Return type
np.array
malaya_speech.utils.group¶
-
malaya_speech.utils.group.combine_frames(frames: List[malaya_speech.model.frame.Frame])[source]¶ Combine multiple frames into one frame.
- Parameters
frames (List[Frame]) –
- Returns
result
- Return type
Frame
-
malaya_speech.utils.group.group_frames(frames)[source]¶ Group multiple frames based on label.
- Parameters
frames (List[Tuple[Frame, label]]) –
- Returns
result
- Return type
List[Tuple[Frame, label]]
-
malaya_speech.utils.group.group_frames_threshold(frames, threshold_to_stop: float = 0.3)[source]¶ Group multiple frames based on label and threshold to stop.
- Parameters
frames (List[Tuple[Frame, label]]) –
threshold_to_stop (float, optional (default = 0.3)) – If threshold_to_stop is 0.3, means that, length same label samples must at least 0.3 second.
- Returns
result
- Return type
List[Tuple[Frame, label]]
malaya_speech.utils.padding¶
-
malaya_speech.utils.padding.sequence_1d(seq, maxlen=None, padding: str = 'post', pad_int=0, return_len=False)[source]¶ padding sequence of 1d to become 2d array.
- Parameters
seq (List[List[int]]) –
maxlen (int, optional (default=None)) – If None, will calculate max length in the function.
padding (str, optional (default='post')) – If pre, will add 0 on the starting side, else add 0 on the end side.
pad_int – padding value.
int – padding value.
(default=0) (optional) – padding value.
- Returns
result
- Return type
np.array
-
malaya_speech.utils.padding.sequence_nd(seq, maxlen=None, padding: str = 'post', pad_val=0.0, dim: int = 1, return_len=False)[source]¶ padding sequence of nd to become (n+1)d array.
- Parameters
seq (list of nd array) –
maxlen (int, optional (default=None)) – If None, will calculate max length in the function.
padding (str, optional (default='post')) – If pre, will add 0 on the starting side, else add 0 on the end side.
pad_val – padding value.
float – padding value.
(default=0.0) (optional) – padding value.
dim (int, optional (default=1)) –
- Returns
result
- Return type
np.array
-
malaya_speech.utils.padding.tf_sequence_nd(seq, maxlen=None, padding: str = 'post', pad_val=0.0, dim: int = 1, return_len=False)[source]¶ padding sequence of nd to become (n+1)d array.
- Parameters
seq (list of nd array) –
maxlen (int, optional (default=None)) – If None, will calculate max length in the function.
padding (str, optional (default='post')) – If pre, will add 0 on the starting side, else add 0 on the end side.
pad_val – padding value.
float – padding value.
(default=0.0) (optional) – padding value.
dim (int, optional (default=1)) –
- Returns
result
- Return type
np.array
malaya_speech.utils.read¶
-
malaya_speech.utils.read.resample(data, old_samplerate, new_samplerate)[source]¶ Resample signal.
- Parameters
data (np.array) –
old_samplerate (int) – old sample rate.
new_samplerate (int) – new sample rate.
- Returns
result
- Return type
data
-
malaya_speech.utils.read.load(file: str, sr=16000, scale: bool = True)[source]¶ Read sound file, any format supported by soundfile.read
- Parameters
file (str) –
sr (int, (default=16000)) – new sample rate. If input sample rate is not same, will resample automatically.
scale (bool, (default=True)) – Scale to -1 and 1.
- Returns
result
- Return type
(y, sr)
malaya_speech.utils.split¶
-
malaya_speech.utils.split.split_vad(frames, n: int = 3, negative_threshold: float = 0.1)[source]¶ Split a sample into multiple samples based n size of negative VAD.
- Parameters
frames (List[Tuple[Frame, label]]) –
n (int, optional (default=3)) – n size of negative VAD to assume in one subsample.
negative_threshold (float, optional (default = 0.1)) – If negative_threshold is 0.1, means that, length negative samples must at least 0.1 second.
- Returns
result
- Return type
List[Frame]
-
malaya_speech.utils.split.split_vad_duration(frames, max_duration: float = 5.0, negative_threshold: float = 0.1)[source]¶ Split a sample into multiple samples based maximum duration of voice activities.
- Parameters
frames (List[Tuple[Frame, label]]) –
max_duration (float, optional (default = 5.0)) – Maximum duration to assume one sample combined from voice activities.
negative_threshold (float, optional (default = 0.1)) – If negative_threshold is 0.1, means that, length negative samples must at least 0.1 second.
- Returns
result
- Return type
List[Frame]
malaya_speech.utils.subword¶
-
malaya_speech.utils.subword.generate_tokenizer(strings: List[str], target_vocab_size: int = 1024, max_subword_length: int = 4, max_corpus_chars=None, reserved_tokens=None)[source]¶ Build a subword dictionary.
-
malaya_speech.utils.subword.save(tokenizer, path: str)[source]¶ Save subword dictionary to a text file.
-
malaya_speech.utils.subword.encode(tokenizer, string: str, add_blank: bool = False)[source]¶ Encode string to integer representation based on ascii table or lookup variable.
- Parameters
tokenizer (object) – tokenizer object
string (str) –
add_blank (bool, optional (default=False)) – add BLANK token at the starting of encoded, this is for transducer / transformer based.
lookup (List[str], optional (default=None)) – list of unique strings.
- Returns
result
- Return type
List[int]
-
malaya_speech.utils.subword.decode(tokenizer, ids)[source]¶ Decode integer representation to string based on tokenizer vocab.
- Parameters
tokenizer (object) – tokenizer object
ids (List[int]) –
- Returns
result
- Return type
str
-
malaya_speech.utils.subword.decode_multilanguage(tokenizers, ids)[source]¶ Decode integer representation to string using list of tokenizer objects.
- Parameters
tokenizers (List[object]) – List of tokenizer objects.
ids (List[int]) –
- Returns
result
- Return type
str
malaya_speech.utils.tf_featurization¶
malaya_speech.age_detection¶
-
malaya_speech.age_detection.deep_model(model: str = 'vggvox-v2', quantized: bool = False, **kwargs)[source]¶ Load age detection deep model.
- Parameters
model (str, optional (default='vggvox-v2')) –
Model architecture supported. Allowed values:
'vggvox-v2'- finetuned VGGVox V2.'deep-speaker'- finetuned Deep Speaker.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result
- Return type
malaya_speech.supervised.classification.load function
malaya_speech.diarization¶
-
malaya_speech.diarization.speaker_similarity(vad_results, speaker_vector, similarity_threshold: float = 0.8, norm_function: Callable = None, return_embedding: bool = False)[source]¶ Speaker diarization using L2-Norm similarity.
- Parameters
vad_results (List[Tuple[Frame, label]]) – results from VAD.
speaker_vector (callable) – speaker vector object.
similarity_threshold (float, optional (default=0.8)) – if current voice activity sample similar at least 80%, we assumed it is from the same speaker.
norm_function (Callable, optional(default=None)) – normalize function for speaker vectors.
speaker_change_threshold (float, optional (default=0.5)) – in one voice activity sample can be more than one speaker, split it using this threshold.
- Returns
result
- Return type
List[Tuple[Frame, label]]
-
malaya_speech.diarization.n_clustering(vad_results, speaker_vector, model, norm_function: Callable = <function l2_normalize>, return_embedding=False)[source]¶ Speaker diarization using any clustering model.
- Parameters
vad_results (List[Tuple[Frame, label]]) – results from VAD.
speaker_vector (callable) – speaker vector object.
model (callable) – Prefer any sklearn unsupervised clustering model. Required fit_predict or apply method.
norm_function (Callable, optional(default=malaya_speech.utils.dist.l2_normalize)) – normalize function for speaker vectors.
log_distance_metric (str, optional (default='cosine')) – post distance norm in log scale metrics.
- Returns
result
- Return type
List[Tuple[Frame, label]]
-
malaya_speech.diarization.affinity_propagation(vad_results, speaker_vector, norm_function: Callable = <function l2_normalize>, log_distance_metric: str = 'cosine', damping: float = 0.8, preference: float = None, return_embedding=False)[source]¶ Speaker diarization using sklearn Affinity Propagation.
- Parameters
vad_results (List[Tuple[Frame, label]]) – results from VAD.
speaker_vector (callable) – speaker vector object.
norm_function (Callable, optional(default=malaya_speech.utils.dist.l2_normalize)) – normalize function for speaker vectors.
log_distance_metric (str, optional (default='cosine')) – post distance norm in log scale metrics.
- Returns
result
- Return type
List[Tuple[Frame, label]]
-
malaya_speech.diarization.spectral_cluster(vad_results, speaker_vector, min_clusters: int = None, max_clusters: int = None, norm_function: Callable = <function l2_normalize>, log_distance_metric: str = None, return_embedding=False, **kwargs)[source]¶ Speaker diarization using SpectralCluster, https://github.com/wq2012/SpectralCluster
- Parameters
vad_results (List[Tuple[Frame, label]]) – results from VAD.
speaker_vector (callable) – speaker vector object.
min_clusters (int, optional (default=None)) – minimal number of clusters allowed (only effective if not None).
max_clusters (int, optional (default=None)) – maximal number of clusters allowed (only effective if not None). can be used together with min_clusters to fix the number of clusters.
norm_function (Callable, optional(default=malaya_speech.utils.dist.l2_normalize)) – normalize function for speaker vectors.
log_distance_metric (str, optional (default=None)) – post distance norm in log scale metrics.
- Returns
result
- Return type
List[Tuple[Frame, label]]
malaya_speech.emotion¶
-
malaya_speech.emotion.deep_model(model: str = 'vggvox-v2', quantized: bool = False, **kwargs)[source]¶ Load emotion detection deep model.
- Parameters
model (str, optional (default='vggvox-v2')) –
Model architecture supported. Allowed values:
'vggvox-v2'- finetuned VGGVox V2.'deep-speaker'- finetuned Deep Speaker.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result
- Return type
malaya_speech.supervised.classification.load function
malaya_speech.force_alignment¶
-
malaya_speech.force_alignment.available_transducer()[source]¶ List available Encoder-Transducer Aligner models.
-
malaya_speech.force_alignment.available_huggingface()[source]¶ List available HuggingFace Malaya-Speech Aligner models.
-
malaya_speech.force_alignment.deep_transducer(model: str = 'conformer-transducer', quantized: bool = False, **kwargs)[source]¶ Load Encoder-Transducer Aligner model.
- Parameters
model (str, optional (default='conformer-transducer')) –
Model architecture supported. Allowed values:
'conformer-transducer'- Conformer + RNNT trained on Malay STT dataset.'conformer-transducer-mixed'- Conformer + RNNT trained on Mixed STT dataset.'conformer-transducer-singlish'- Conformer + RNNT trained on Singlish STT dataset.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result
- Return type
malaya_speech.model.transducer.TransducerAligner class
-
malaya_speech.force_alignment.deep_ctc(model: str = 'hubert-conformer', quantized: bool = False, **kwargs)[source]¶ Load Encoder-CTC ASR model.
- Parameters
model (str, optional (default='hubert-conformer')) –
Model architecture supported. Allowed values:
'hubert-conformer-tiny'- Finetuned HuBERT Conformer TINY.'hubert-conformer'- Finetuned HuBERT Conformer.'hubert-conformer-large'- Finetuned HuBERT Conformer LARGE.'hubert-conformer-large-3mixed'- Finetuned HuBERT Conformer LARGE for (Malay + Singlish + Mandarin) languages.'best-rq-conformer-tiny'- Finetuned BEST-RQ Conformer TINY.'best-rq-conformer'- Finetuned BEST-RQ Conformer.'best-rq-conformer-large'- Finetuned BEST-RQ Conformer LARGE.
- quantizedbool, optional (default=False)
if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result
- Return type
malaya_speech.model.wav2vec.Wav2Vec2_Aligner class
-
malaya_speech.force_alignment.huggingface(model: str = 'mesolitica/wav2vec2-xls-r-300m-mixed')[source]¶ Load Finetuned models from HuggingFace.
- Parameters
model (str, optional (default='mesolitica/wav2vec2-xls-r-300m-mixed')) –
Model architecture supported. Allowed values:
'mesolitica/wav2vec2-xls-r-300m-mixed'- wav2vec2 XLS-R 300M finetuned on (Malay + Singlish + Mandarin) languages.
- Returns
result
- Return type
malaya_speech.model.huggingface.CTC class
malaya_speech.gender¶
-
malaya_speech.gender.deep_model(model: str = 'vggvox-v2', quantized: bool = False, **kwargs)[source]¶ Load gender detection deep model.
- Parameters
model (str, optional (default='vggvox-v2')) –
Model architecture supported. Allowed values:
'vggvox-v2'- finetuned VGGVox V2.'deep-speaker'- finetuned Deep Speaker.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result
- Return type
malaya_speech.supervised.classification.load function
malaya_speech.language_detection¶
-
malaya_speech.language_detection.available_model()[source]¶ List available language detection deep models.
-
malaya_speech.language_detection.deep_model(model: str = 'vggvox-v2', quantized: bool = False, **kwargs)[source]¶ Load language detection deep model.
- Parameters
model (str, optional (default='vggvox-v2')) –
Model architecture supported. Allowed values:
'vggvox-v2'- finetuned VGGVox V2.'deep-speaker'- finetuned Deep Speaker.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result
- Return type
malaya_speech.supervised.classification.load function
malaya_speech.multispeaker_separation¶
-
malaya_speech.multispeaker_separation.available_deep_wav()[source]¶ List available FastSep models trained on raw 8k wav.
-
malaya_speech.multispeaker_separation.deep_wav(model: str = 'fastsep-4', quantized: bool = False, **kwargs)[source]¶ Load FastSep model, trained on raw 8k wav using SISNR PIT loss.
- Parameters
model (str, optional (default='fastsep-4')) –
Model architecture supported. Allowed values:
'fastsep-2'- FastSep 2 layers trained on raw 8k wav.'fastsep-4'- FastSep 4 layers trained on raw 8k wav.'fastsep-6'- FastSep 6 layers trained on raw 8k wav.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result
- Return type
malaya_speech.model.tf.Split class
malaya_speech.noise_reduction¶
-
malaya_speech.noise_reduction.available_model()[source]¶ List available Noise Reduction deep learning models.
-
malaya_speech.noise_reduction.deep_model(model: str = 'resnet-unet', quantized: bool = False, **kwargs)[source]¶ Load Noise Reduction deep learning model.
- Parameters
model (str, optional (default='wavenet')) –
Model architecture supported. Allowed values:
'unet'- pretrained UNET.'resnet-unet'- pretrained resnet-UNET.'resnext'- pretrained resnext-UNET.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result
- Return type
malaya_speech.model.tf.UNET_STFT class
malaya_speech.speaker_change¶
-
malaya_speech.speaker_change.deep_model(model: str = 'speakernet', quantized: bool = False, **kwargs)[source]¶ Load speaker change deep model.
- Parameters
model (str, optional (default='vggvox-v2')) –
Model architecture supported. Allowed values:
'vggvox-v2'- finetuned VGGVox V2.'speakernet'- finetuned SpeakerNet.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result
- Return type
malaya_speech.supervised.classification.load function
-
malaya_speech.speaker_change.split_activities(vad_results, speaker_change_results, speaker_change_threshold: float = 0.5, sr: int = 16000, ignore_not_activity=True)[source]¶ split VAD based on speaker change threshold, worse-case O(N^2).
- Parameters
vad_results (List[Tuple[Frame, label]]) – results from VAD.
speaker_change_results (List[Tuple[Frame, float]], optional (default=None)) – results from speaker change module, must in float result.
speaker_change_threshold (float, optional (default=0.5)) – in one voice activity sample can be more than one speaker, split it using this threshold.
sr (int, optional (default=16000)) – sample rate, classification model in malaya-speech use 16k.
ignore_not_activity (bool, optional (default=True)) – If True, will ignore if result VAD is False, else will try to split.
- Returns
result
- Return type
List[Tuple[Frame, label]]
malaya_speech.speaker_overlap¶
-
malaya_speech.speaker_overlap.available_model()[source]¶ List available speaker overlap deep models.
-
malaya_speech.speaker_overlap.deep_model(model: str = 'vggvox-v2', quantized: bool = False, **kwargs)[source]¶ Load speaker overlap deep model.
- Parameters
model (str, optional (default='vggvox-v2')) –
Model architecture supported. Allowed values:
'vggvox-v2'- finetuned VGGVox V2.'speakernet'- finetuned SpeakerNet.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result
- Return type
malaya_speech.supervised.classification.load function
malaya_speech.speaker_vector¶
-
malaya_speech.speaker_vector.deep_model(model: str = 'vggvox-v2', quantized: bool = False, **kwargs)[source]¶ Load Speaker2Vec model.
- Parameters
model (str, optional (default='speakernet')) –
Model architecture supported. Allowed values:
'vggvox-v1'- VGGVox V1, embedding size 1024, exported from https://github.com/linhdvu14/vggvox-speaker-identification'vggvox-v2'- VGGVox V2, embedding size 512, exported from https://github.com/WeidiXie/VGG-Speaker-Recognition'deep-speaker'- Deep Speaker, embedding size 512, exported from https://github.com/philipperemy/deep-speaker'speakernet'- SpeakerNet, embedding size 7205, exported from https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_recognition'conformer-base'- Conformer BASE size, embedding size 512.'conformer-tiny'- Conformer TINY size, embedding size 512.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result
- Return type
malaya_speech.supervised.classification.load function
malaya_speech.speech_enhancement¶
-
malaya_speech.speech_enhancement.available_deep_masking()[source]¶ List available Speech Enhancement STFT masking deep learning model.
-
malaya_speech.speech_enhancement.available_deep_enhance()[source]¶ List available Speech Enhancement UNET Waveform sampling deep learning model.
-
malaya_speech.speech_enhancement.deep_masking(model: str = 'resnet-unet', quantized: bool = False, **kwargs)[source]¶ Load Speech Enhancement STFT UNET masking deep learning model.
- Parameters
model (str, optional (default='resnet-unet')) –
Model architecture supported. Allowed values:
'unet'- pretrained UNET.'resnet-unet'- pretrained resnet-UNET.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result
- Return type
malaya_speech.model.unet.UNETSTFT class
-
malaya_speech.speech_enhancement.deep_enhance(model: str = 'unet', quantized: bool = False, **kwargs)[source]¶ Load Speech Enhancement UNET Waveform sampling deep learning model.
- Parameters
model (str, optional (default='unet')) –
Model architecture supported. Allowed values:
'unet'- pretrained UNET Speech Enhancement.'resnet-unet'- pretrained resnet-UNET Speech Enhancement.'resnext-unet'- pretrained resnext-UNET Speech Enhancement.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result
- Return type
malaya_speech.model.unet.UNET1D class
malaya_speech.speechsplit_conversion¶
-
malaya_speech.speechsplit_conversion.available_deep_conversion(f0_mode='pysptk')[source]¶ List available Voice Conversion models.
-
malaya_speech.speechsplit_conversion.deep_conversion(model: str = 'fastspeechsplit-v2-vggvox-v2', f0_mode: str = 'pysptk', quantized: bool = False, **kwargs)[source]¶ Load Voice Conversion model.
- Parameters
model (str, optional (default='fastvc-32-vggvox-v2')) –
Model architecture supported. Allowed values:
'fastspeechsplit-vggvox-v2'- FastSpeechSplit with VGGVox-v2 Speaker Vector.'fastspeechsplit-v2-vggvox-v2'- FastSpeechSplit V2 with VGGVox-v2 Speaker Vector.
f0_mode (str, optional (default='pysptk)) –
F0 conversion supported. Allowed values:
'pysptk'- https://github.com/r9y9/pysptk, sensitive towards gender.'pyworld'- https://pypi.org/project/pyworld/
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result
- Return type
malaya_speech.model.splitter.FastSpeechSplit class
malaya_speech.stack¶
-
malaya_speech.stack.classification_stack(models)[source]¶ Stacking for classification models. All models should be in the same domain classification.
- Parameters
models (List[Callable]) – list of models.
- Returns
result
- Return type
malaya_speech.stack.Stack class
malaya_speech.model.stack.Stack¶
-
class
malaya_speech.stack.Stack[source]¶
malaya_speech.stt¶
-
malaya_speech.stt.available_huggingface()[source]¶ List available HuggingFace Malaya-Speech ASR models.
-
malaya_speech.stt.language_model(model: str = 'dump-combined', **kwargs)[source]¶ Load KenLM language model.
- Parameters
model (str, optional (default='dump-combined')) –
Model architecture supported. Allowed values:
'bahasa'- Gathered from malaya-speech ASR bahasa transcript.'bahasa-news'- Gathered from malaya-speech ASR bahasa transcript + Bahasa News (Random sample 300k sentences).'bahasa-combined'- Gathered from malaya-speech ASR bahasa transcript + Bahasa News (Random sample 300k sentences) + Bahasa Wikipedia (Random sample 150k sentences).'redape-community'- Mirror for https://github.com/redapesolutions/suara-kami-community'dump-combined'- Academia + News + IIUM + Parliament + Watpadd + Wikipedia + Common Crawl + training set from https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/prepare-stt.'manglish'- Manglish News + Manglish Reddit + Manglish forum + training set from https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/prepare-stt.'bahasa-manglish-combined'- Combined dump-combined and manglish.
- Returns
result
- Return type
str
-
malaya_speech.stt.deep_ctc(model: str = 'hubert-conformer', quantized: bool = False, **kwargs)[source]¶ Load Encoder-CTC ASR model.
- Parameters
model (str, optional (default='hubert-conformer')) –
Model architecture supported. Allowed values:
'hubert-conformer-tiny'- Finetuned HuBERT Conformer TINY.'hubert-conformer'- Finetuned HuBERT Conformer.'hubert-conformer-large'- Finetuned HuBERT Conformer LARGE.'hubert-conformer-large-3mixed'- Finetuned HuBERT Conformer LARGE for (Malay + Singlish + Mandarin) languages.'best-rq-conformer-tiny'- Finetuned BEST-RQ Conformer TINY.'best-rq-conformer'- Finetuned BEST-RQ Conformer.'best-rq-conformer-large'- Finetuned BEST-RQ Conformer LARGE.
- quantizedbool, optional (default=False)
if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result
- Return type
malaya_speech.model.wav2vec.Wav2Vec2_CTC class
-
malaya_speech.stt.deep_transducer(model: str = 'conformer', quantized: bool = False, **kwargs)[source]¶ Load Encoder-Transducer ASR model.
- Parameters
model (str, optional (default='conformer')) –
Model architecture supported. Allowed values:
'tiny-conformer'- TINY size Google Conformer.'small-conformer'- SMALL size Google Conformer.'conformer'- BASE size Google Conformer.'large-conformer'- LARGE size Google Conformer.'conformer-stack-2mixed'- BASE size Stacked Google Conformer for (Malay + Singlish) languages.'conformer-stack-3mixed'- BASE size Stacked Google Conformer for (Malay + Singlish + Mandarin) languages.'small-conformer-singlish'- SMALL size Google Conformer for singlish language.'conformer-singlish'- BASE size Google Conformer for singlish language.'large-conformer-singlish'- LARGE size Google Conformer for singlish language.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result
- Return type
malaya_speech.model.transducer.Transducer class
-
malaya_speech.stt.huggingface(model: str = 'mesolitica/wav2vec2-xls-r-300m-mixed', **kwargs)[source]¶ Load Finetuned models from HuggingFace. Required Tensorflow >= 2.0.
- Parameters
model (str, optional (default='mesolitica/wav2vec2-xls-r-300m-mixed')) –
Model architecture supported. Allowed values:
'mesolitica/wav2vec2-xls-r-300m-mixed'- wav2vec2 XLS-R 300M finetuned on (Malay + Singlish + Mandarin) languages.
- Returns
result
- Return type
malaya_speech.model.huggingface.CTC class
malaya_speech.super_resolution¶
-
malaya_speech.super_resolution.available_model()[source]¶ List available Super Resolution 4x deep learning models.
-
malaya_speech.super_resolution.deep_model(model: str = 'srgan-256', quantized: bool = False, **kwargs)[source]¶ Load Super Resolution 4x deep learning model.
- Parameters
model (str, optional (default='srgan-256')) –
Model architecture supported. Allowed values:
'srgan-128'- srgan with 128 filter size and 16 residual blocks.'srgan-256'- srgan with 256 filter size and 16 residual blocks.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result
- Return type
malaya_speech.model.tf.UNET1D class
malaya_speech.tts¶
-
malaya_speech.tts.load_text_ids(pad_to: int = 8, understand_punct: bool = True, is_lower: bool = True, **kwargs)[source]¶ Load text normalizer module use by Malaya-Speech TTS.
-
malaya_speech.tts.tacotron2(model: str = 'yasmin', quantized: bool = False, pad_to: int = 8, **kwargs)[source]¶ Load Tacotron2 TTS model.
- Parameters
model (str, optional (default='yasmin')) –
Model architecture supported. Allowed values:
'female'- Tacotron2 trained on female voice.'male'- Tacotron2 trained on male voice.'husein'- Tacotron2 trained on Husein voice, https://www.linkedin.com/in/husein-zolkepli/'haqkiem'- Tacotron2 trained on Haqkiem voice, https://www.linkedin.com/in/haqkiem-daim/'yasmin'- Tacotron2 trained on female Yasmin voice.'osman'- Tacotron2 trained on male Osman voice.'female-singlish'- Tacotron2 trained on female Singlish voice, https://www.imda.gov.sg/programme-listing/digital-services-lab/national-speech-corpus
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
pad_to (int, optional (default=8)) – size of pad character with 0. Increase can stable up prediction on short sentence, we trained on 8.
- Returns
result
- Return type
malaya_speech.model.synthesis.Tacotron class
-
malaya_speech.tts.fastspeech2(model: str = 'male', quantized: bool = False, pad_to: int = 8, **kwargs)[source]¶ Load Fastspeech2 TTS model.
- Parameters
model (str, optional (default='male')) –
Model architecture supported. Allowed values:
'female'- Fastspeech2 trained on female voice.'male'- Fastspeech2 trained on male voice.'husein'- Fastspeech2 trained on Husein voice, https://www.linkedin.com/in/husein-zolkepli/'haqkiem'- Fastspeech2 trained on Haqkiem voice, https://www.linkedin.com/in/haqkiem-daim/'yasmin'- Fastspeech2 trained on female Yasmin voice.'osman'- Fastspeech2 trained on male Osman voice.'female-singlish'- Fastspeech2 trained on female Singlish voice, https://www.imda.gov.sg/programme-listing/digital-services-lab/national-speech-corpus
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
pad_to (int, optional (default=8)) – size of pad character with 0. Increase can stable up prediction on short sentence, we trained on 8.
- Returns
result
- Return type
malaya_speech.model.synthesis.Fastspeech class
-
malaya_speech.tts.fastpitch(model: str = 'male', quantized: bool = False, pad_to: int = 8, **kwargs)[source]¶ Load Fastspitch TTS model.
- Parameters
model (str, optional (default='male')) –
Model architecture supported. Allowed values:
'female'- Fastpitch trained on female voice.'male'- Fastpitch trained on male voice.'husein'- Fastpitch trained on Husein voice, https://www.linkedin.com/in/husein-zolkepli/'haqkiem'- Fastpitch trained on Haqkiem voice, https://www.linkedin.com/in/haqkiem-daim/
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
pad_to (int, optional (default=8)) – size of pad character with 0. Increase can stable up prediction on short sentence, we trained on 8.
- Returns
result
- Return type
malaya_speech.model.synthesis.Fastpitch class
-
malaya_speech.tts.glowtts(model: str = 'yasmin', quantized: bool = False, pad_to: int = 2, **kwargs)[source]¶ Load GlowTTS TTS model.
- Parameters
model (str, optional (default='yasmin')) –
Model architecture supported. Allowed values:
'female'- GlowTTS trained on female voice.'male'- GlowTTS trained on male voice.'haqkiem'- GlowTTS trained on Haqkiem voice, https://www.linkedin.com/in/haqkiem-daim/'female-singlish'- GlowTTS trained on female Singlish voice, https://www.imda.gov.sg/programme-listing/digital-services-lab/national-speech-corpus'yasmin'- GlowTTS trained on female Yasmin voice.'osman'- GlowTTS trained on male Osman voice.'multispeaker'- Multispeaker GlowTTS trained on male, female, husein and haqkiem voices, also able to do voice conversion.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
pad_to (int, optional (default=2)) – size of pad character with 0. Increase can stable up prediction on short sentence, we trained on 2.
- Returns
result
- Return type
malaya_speech.model.synthesis.GlowTTS class
-
malaya_speech.tts.lightspeech(model: str = 'male', quantized: bool = False, pad_to: int = 8, **kwargs)[source]¶ Load LightSpeech TTS model.
- Parameters
model (str, optional (default='male')) –
Model architecture supported. Allowed values:
'yasmin'- LightSpeech trained on female Yasmin voice.'osman'- LightSpeech trained on male Osman voice.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
pad_to (int, optional (default=8)) – size of pad character with 0. Increase can stable up prediction on short sentence, we trained on 8.
- Returns
result
- Return type
malaya_speech.model.synthesis.Fastspeech class
malaya_speech.vad¶
-
malaya_speech.vad.webrtc(aggressiveness: int = 3, sample_rate: int = 16000, minimum_amplitude: int = 100)[source]¶ Load WebRTC VAD model.
- Parameters
aggressiveness (int, optional (default=3)) – an integer between 0 and 3. 0 is the least aggressive about filtering out non-speech, 3 is the most aggressive.
sample_rate (int, optional (default=16000)) – sample rate for samples.
minimum_amplitude (int, optional (default=100)) – abs(minimum_amplitude) to assume a sample is a voice activity. Else, automatically False.
- Returns
result
- Return type
malaya_speech.model.webrtc.WebRTC class
-
malaya_speech.vad.deep_model(model: str = 'marblenet-factor1', quantized: bool = False, **kwargs)[source]¶ Load VAD model.
- Parameters
model (str, optional (default='vggvox-v2')) –
Model architecture supported. Allowed values:
'vggvox-v1'- finetuned VGGVox V1.'vggvox-v2'- finetuned VGGVox V2.'speakernet'- finetuned SpeakerNet.'marblenet-factor1'- Pretrained MarbleNet * factor 1.'marblenet-factor3'- Pretrained MarbleNet * factor 3.'marblenet-factor5'- Pretrained MarbleNet * factor 5.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result
- Return type
malaya_speech.supervised.classification.load function
malaya_speech.vocoder¶
-
malaya_speech.vocoder.available_mbmelgan()[source]¶ List available Multiband MelGAN Mel-to-Speech models.
-
malaya_speech.vocoder.melgan(model: str = 'universal-1024', quantized: bool = False, **kwargs)[source]¶ Load MelGAN Vocoder model.
- Parameters
model (str, optional (default='universal-1024')) –
Model architecture supported. Allowed values:
'female'- MelGAN trained on female voice.'male'- MelGAN trained on male voice.'husein'- MelGAN trained on Husein voice, https://www.linkedin.com/in/husein-zolkepli/'haqkiem'- MelGAN trained on Haqkiem voice, https://www.linkedin.com/in/haqkiem-daim/'yasmin'- MelGAN trained on female Yasmin voice.'osman'- MelGAN trained on male Osman voice.'female-singlish'- MelGAN trained on Female Singlish voice, https://www.imda.gov.sg/programme-listing/digital-services-lab/national-speech-corpus'universal'- Universal MelGAN trained on multiple speakers.'universal-1024'- Universal MelGAN with 1024 filters trained on multiple speakers.'universal-384'- Universal MelGAN with 384 filters trained on multiple speakers.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result
- Return type
malaya_speech.model.synthesis.Vocoder class
-
malaya_speech.vocoder.mbmelgan(model: str = 'female', quantized: bool = False, **kwargs)[source]¶ Load Multiband MelGAN Vocoder model.
- Parameters
model (str, optional (default='female')) –
Model architecture supported. Allowed values:
'female'- MBMelGAN trained on female voice.'male'- MBMelGAN trained on male voice.'husein'- MBMelGAN trained on Husein voice, https://www.linkedin.com/in/husein-zolkepli/'haqkiem'- MBMelGAN trained on Haqkiem voice, https://www.linkedin.com/in/haqkiem-daim/
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result
- Return type
malaya_speech.model.synthesis.Vocoder class
-
malaya_speech.vocoder.hifigan(model: str = 'universal-768', quantized: bool = False, **kwargs)[source]¶ Load HiFiGAN Vocoder model.
- Parameters
model (str, optional (default='universal-768')) –
Model architecture supported. Allowed values:
'female'- HiFiGAN trained on female voice.'male'- HiFiGAN trained on male voice.'universal-1024'- Universal HiFiGAN with 1024 filters trained on multiple speakers.'universal-768'- Universal HiFiGAN with 768 filters trained on multiple speakers.'universal-512'- Universal HiFiGAN with 512 filters trained on multiple speakers.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result
- Return type
malaya_speech.model.synthesis.Vocoder class
malaya_speech.voice_conversion¶
-
malaya_speech.voice_conversion.available_deep_conversion()[source]¶ List available Voice Conversion models.
-
malaya_speech.voice_conversion.deep_conversion(model: str = 'fastvc-32-vggvox-v2', quantized: bool = False, **kwargs)[source]¶ Load Voice Conversion model.
- Parameters
model (str, optional (default='fastvc-32-vggvox-v2')) –
Model architecture supported. Allowed values:
'fastvc-32-vggvox-v2'- FastVC bottleneck size 32 with VGGVox-v2 Speaker Vector.'fastvc-64-vggvox-v2'- FastVC bottleneck size 64 with VGGVox-v2 Speaker Vector.
quantized (bool, optional (default=False)) – if True, will load 8-bit quantized model. Quantized model not necessary faster, totally depends on the machine.
- Returns
result
- Return type
malaya_speech.model.synthesis.FastVC class