← Back to Projects

GPT2-Nepali

A GPT2 model pretrained on a 12.5GB Nepali dataset from the NepBERTa project, featuring custom tokenizer training and comprehensive preprocessing pipeline.

PythonGPT2PyTorchTransformersCustom Tokenizer

Overview

A GPT2 model pretrained on a 12.5GB Nepali dataset from the NepBERTa project, featuring custom tokenizer training and comprehensive preprocessing pipeline.

Project Structure

📁 1_preprocessing

This directory contains scripts for preprocessing the NepBERTa dataset:

  • Data cleaning
  • Pre-tokenizing
  • Data preparation: context_length = stride = 512

📁 2_tokenizer

This directory includes tools and scripts for:

  • Training a custom tokenizer for the Nepali dataset
  • Visualizing and analyzing token distributions

📁 3_GPT2-Nepali

This directory contains the core code for:

  • Training the GPT2 model on the Nepali dataset
  • Running inference with the trained model

Note: Most of the code in this section is adapted from the book: Build a Large Language Model (From Scratch) by Sebastian Raschka and the corresponding GitHub repository: LLMs-from-scratch.

Todo

  • Multi-GPU training (PyTorch DDP)
  • Use bigger training data and larger model size

Technologies Used

PythonGPT2PyTorchTransformersCustom Tokenizer