← Back to Projects
GPT2-Nepali
A GPT2 model pretrained on a 12.5GB Nepali dataset from the NepBERTa project, featuring custom tokenizer training and comprehensive preprocessing pipeline.
PythonGPT2PyTorchTransformersCustom Tokenizer
Overview
A GPT2 model pretrained on a 12.5GB Nepali dataset from the NepBERTa project, featuring custom tokenizer training and comprehensive preprocessing pipeline.
Project Structure
📁 1_preprocessing
This directory contains scripts for preprocessing the NepBERTa dataset:
- Data cleaning
- Pre-tokenizing
- Data preparation: context_length = stride = 512
📁 2_tokenizer
This directory includes tools and scripts for:
- Training a custom tokenizer for the Nepali dataset
- Visualizing and analyzing token distributions
📁 3_GPT2-Nepali
This directory contains the core code for:
- Training the GPT2 model on the Nepali dataset
- Running inference with the trained model
Note: Most of the code in this section is adapted from the book: Build a Large Language Model (From Scratch) by Sebastian Raschka and the corresponding GitHub repository: LLMs-from-scratch.
Todo
- ☐Multi-GPU training (PyTorch DDP)
- ☐Use bigger training data and larger model size
References
Primary References
- 1. NepBERTa - https://nepberta.github.io/
- 2. Book: build-a-large-language-model-from-scratch
- 3. GitHub: rasbt/LLMs-from-scratch
- 4. GitHub: karpathy/nanoGPT
Other Nepali Language Models
Technologies Used
PythonGPT2PyTorchTransformersCustom Tokenizer