\n",
"\n",
"This tutorial is available as an IPython notebook at [malaya-speech/example/gpt2-lm](https://github.com/huseinzol05/malaya-speech/tree/master/example/gpt2-lm).\n",
" \n",
"
\n",
"\n",
"This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.\n",
" \n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Purpose"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When doing CTC or RNNT beam decoding, we want to add language bias during find the optimum alignment using GPT2 Causal model."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### List available GPT2\n",
"\n",
"We provided a few GPT2 models for our ASR models,"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import malaya_speech"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Size (MB)
\n",
"
\n",
" \n",
" \n",
"
\n",
"
mesolitica/gpt2-117m-bahasa-cased
\n",
"
454
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Size (MB)\n",
"mesolitica/gpt2-117m-bahasa-cased 454"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"malaya_speech.language_model.available_gpt2()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Load KenLM Model\n",
"\n",
"```python\n",
"def gpt2(model: str = 'mesolitica/gpt2-117m-bahasa-cased', force_check: bool = True, **kwargs):\n",
" \"\"\"\n",
" Load GPT2 language model.\n",
"\n",
" Parameters\n",
" ----------\n",
" model: str, optional (default='mesolitica/gpt2-117m-bahasa-cased')\n",
" Check available models at `malaya_speech.language_model.available_gpt2()`.\n",
" force_check: bool, optional (default=True)\n",
" Force check model one of malaya model.\n",
" Set to False if you have your own huggingface model.\n",
"\n",
" Returns\n",
" -------\n",
" result: malaya.torch_model.gpt2_lm.LM class\n",
" \"\"\"\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"lm = malaya_speech.language_model.gpt2()\n",
"lm"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"malaya-speech GPT2 LM need to combine with `pyctcdecode` to decode CTC logits."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Use pyctcdecode\n",
"\n",
"#### From PYPI\n",
"\n",
"```bash\n",
"pip3 install pyctcdecode==0.1.0 pypi-kenlm==0.1.20210121\n",
"```\n",
"\n",
"#### From source\n",
"\n",
"Check https://github.com/kensho-technologies/pyctcdecode how to build from source incase there is no available wheel for your operating system.\n",
"\n",
"Building from source should only take a few minutes."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"from pyctcdecode import Alphabet, BeamSearchDecoderCTC\n",
"from malaya_speech.utils.char import CTC_VOCAB\n",
"\n",
"labels = CTC_VOCAB + ['_']\n",
"ctc_token_idx = len(CTC_VOCAB)\n",
"alphabet = Alphabet.build_alphabet(labels, ctc_token_idx=ctc_token_idx)\n",
"decoder = BeamSearchDecoderCTC(alphabet, lm)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.10"
},
"varInspector": {
"cols": {
"lenName": 16,
"lenType": 16,
"lenVar": 40
},
"kernels_config": {
"python": {
"delete_cmd_postfix": "",
"delete_cmd_prefix": "del ",
"library": "var_list.py",
"varRefreshCmd": "print(var_dic_list())"
},
"r": {
"delete_cmd_postfix": ") ",
"delete_cmd_prefix": "rm(",
"library": "var_list.r",
"varRefreshCmd": "cat(var_dic_list()) "
}
},
"types_to_exclude": [
"module",
"function",
"builtin_function_or_method",
"instance",
"_Feature"
],
"window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 4
}