GPT2-Nepali

A GPT2 model pretrained on a 12.5GB Nepali dataset from the NepBERTa project, featuring custom tokenizer training and comprehensive preprocessing pipeline.

PythonGPT2PyTorchTransformersCustom Tokenizer

📁 View Code 🚀 Live Demo

Overview

A GPT2 model pretrained on a 12.5GB Nepali dataset from the NepBERTa project, featuring custom tokenizer training and comprehensive preprocessing pipeline.

Project Structure

📁 1_preprocessing

This directory contains scripts for preprocessing the NepBERTa dataset:

Data cleaning
Pre-tokenizing
Data preparation: context_length = stride = 512

📁 2_tokenizer

This directory includes tools and scripts for:

Training a custom tokenizer for the Nepali dataset
Visualizing and analyzing token distributions

📁 3_GPT2-Nepali

This directory contains the core code for:

Training the GPT2 model on the Nepali dataset
Running inference with the trained model

Note: Most of the code in this section is adapted from the book: Build a Large Language Model (From Scratch) by Sebastian Raschka and the corresponding GitHub repository: LLMs-from-scratch.

Todo

☐Multi-GPU training (PyTorch DDP)
☐Use bigger training data and larger model size

References

Primary References

1. NepBERTa - https://nepberta.github.io/
2. Book: build-a-large-language-model-from-scratch
3. GitHub: rasbt/LLMs-from-scratch
4. GitHub: karpathy/nanoGPT

Other Nepali Language Models

Technologies Used

PythonGPT2PyTorchTransformersCustom Tokenizer

Links

GitHub Repository →

Live Demo →