A Short Intro

August 13, 2025 · 3 min read

I wanted to start my journey into the LLM-verse by going back to the basics — revisiting the key ideas from Attention Is All You Need and Language Models are Unsupervised Multitask Learners.

My plan for this first step was simple:

  1. Prepare a dataset for training
  2. Build a small GPT-style model from scratch
  3. Run a single training pass to see it in action

This felt like a natural starting point before diving into more advanced topics around large language models.


Data

Data is the lifeblood of machine learning. My goal was to mimic the process of training an actual LLM as closely as possible while staying within the limits of the hardware I have on hand.

After exploring Hugging Face datasets, I chose Cosmopedia-v2 — a synthetic dataset of 39 million examples spanning textbooks, blog posts, and more. It’s designed to be clean and diverse, which makes it a good first step before tackling noisier, real-world data. Hugging Face’s blog post covers the dataset in more detail.

For tokenization, I wanted something proven in a production environment. I selected Google’s Gemma tokenizer, which supports large vocabularies and efficient subword tokenization.

Since I’m training on a single GPU, I pre-tokenized and stored the dataset to disk. This allows for high-throughput loading during training without having to tokenize on the fly — a simple step that makes a big difference in keeping the GPU fed with data.


Model

For the model, I implemented a decoder-only transformer — the same general architecture used by GPT models.

The main components are:

  • Token and Positional Embeddings — Convert token IDs to dense vectors and add positional information so the model knows the order of tokens.
  • Masked Multi-Head Attention — Each token attends only to previous tokens in the sequence, ensuring the model predicts the next token without “peeking ahead.”
  • Feed-Forward Layers — Process each position independently after attention.
  • Residual Connections and Layer Normalization — Stabilize training and help information flow across layers.

The model is small by modern standards — just a few layers and heads — but it’s built with the same structure and components you’d find in larger GPT models.


Training Pass

For training, I used mixed-precision with torch.amp and GradScaler. This lets the GPU process most operations in half precision (FP16) while keeping critical operations in full precision (FP32), reducing memory usage and improving speed without significantly affecting accuracy.

The training loop is straightforward:

  1. Load pre-tokenized batches from disk
  2. Forward pass through the model
  3. Compute cross-entropy loss against the next-token targets
  4. Backpropagate and update weights with AdamW

I ran a single epoch over the dataset to get a sense of the computational demands for further experiments. Even at this small scale, it’s clear that training LLMs — or even small GPT-style models — quickly becomes compute-bound.


Final Thoughts

This isn’t production code, and it’s not research work — it’s a starting point.

The goal here was to set up a reproducible environment for experimenting with transformer architectures, data pipelines, and training methods. From here, I can explore scaling up, experimenting with different attention mechanisms, or trying out alternative tokenization strategies.

It feels good to have taken the first step. The concepts from “Attention Is All You Need” are now running in code I wrote myself, and that’s a solid foundation for the rest of this journey.

Code can be found at GitHub