bimals.net

Tiny Karpathy: Pretraining a Character-Level GPT on Andrej Karpathy's Deep Learning Series

Overview

Andrej Karpathy’s YouTube series is hands down the best free of cost resource out there to learn something important and valuable (3Blue1Brown comes in as a close second). In one of the videos in the series, Karpathy walks through the process of pre-training a GPT model using Tiny Shakespeare, which is awesome—but I wanted to take things up a notch. What if I trained a tiny GPT model on Karpathy’s own deep learning lectures? Thus, the birth of what I’m calling Tiny Karpathy!

Building the Dataset

The dataset-building process was as simple as it gets. I copied all the transcripts from Karpathy’s deep learning series into a text file, did some light formatting—fixed up the paragraphs, added some punctuation—and that was it.

with open('/kaggle/input/karpathy/tiny.txt', 'r', encoding='utf-8-sig') as f:
    text = f.read()

Since this is a character-level model, I had to map each character to a unique integer and vice versa.

chars = sorted(list(set(text)))

# Encoder and Decoder

stoi = {s:i for i,s in enumerate(chars)}
itos = {i:s for i,s in enumerate(chars)}
encode = lambda txt: [stoi[c] for c in txt]
decode = lambda idx: "".join([itos[i] for i in idx])
# Data Loader
def get_batch(split):
    data = training_data if split == 'train' else val_data

    idx = torch.randint(len(data)-block_size, (batch_size,))
    xb = torch.stack([data[i:i+block_size] for i in idx])
    yb = torch.stack([data[i+1:i+block_size+1] for i in idx])
    return xb.to(device), yb.to(device)
  

Setting Up the Transformer Model

I followed Karpathy’s Let's build GPT: from scratch, in code, spelled out video almost to the letter. The only real modifications? I trained the model using Kaggle’s GPU setup, which meant tweaking the code to fully utilize both GPUs. That included handling distributed losses by taking the mean and wrapping the model with DataParallel for multi-GPU training.

# Hyper-parameters block
vocab_size = len(chars) # Number of classes
batch_size = 64 # No. of batches
block_size = 256 # No. of tokens in each context
device = 'cuda' if torch.cuda.is_available() else 'cpu'

n_embd = 384 # No. of embedding dimensions for each token
learning_rate = 3e-4 # Rate of stepping

max_iters = 5000 # Number of loops in training
eval_iters = 200 # No. of indices to be used to estimate the losses
eval_interval = 500 # No. of iters before calling the estimator function

n_head = 6 # No. of heads in the multihead
n_layer = 6 # No. of blocks
dropout = 0.2 # Regularization to avoid overfitting

The model itself follows a standard transformer structure: multi-head self-attention, layer normalization, and feedforward layers.

# Transformers implementation
class Head(nn.Module):
    def __init__(self, head_size):
        super().__init__()
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)


    def forward(self, x):
        B,T,C = x.shape
        
        q = self.query(x) # B,T,head_size
        k = self.key(x) # B,T,head_size

        affinity = q @ k.transpose(-2,-1) # B,T,head_size @ B,head_size,T -> B,T,T
        wei = affinity.masked_fill(self.tril[:T,:T] == 0, float('-inf'))
        wei = F.softmax(wei, dim=-1)

        wei = self.dropout(wei)

        v = self.value(x) # B,T,head_size
        out = wei @ v # B,T,T @ B,T,head_size -> B,T,head_size
        return out

class MultiHead(nn.Module):
    def __init__(self, n_head, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(n_head)])
        self.proj = nn.Linear(head_size * n_head, n_embd)
        self.dropout = nn.Dropout(dropout)


    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

class FeedForward(nn.Module):
    # MLP layer
    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4*n_embd),
            nn.ReLU(),
            nn.Linear(4*n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

# Block that calls all the above classes
class Block(nn.Module):
    def __init__(self, n_embd, n_head):
        super().__init__()
        head_size = n_embd//n_head
        self.sa = MultiHead(n_head, head_size)
        self.ff = FeedForward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x)) # LayerNorm before passing into attention
        x = x + self.ff(self.ln2(x)) # Residual connection with the addition
        return x
# GPT Implementation
class LanguageModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.embedding_table = nn.Embedding(vocab_size, n_embd) # Each vocab will have a row of n_embd dim. vector
        self.pos_embedding_table = nn.Embedding(block_size, n_embd) # Each index is embedded based on the context
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd)
        self.ln_head = nn.Linear(n_embd, vocab_size) # Conveting the embeddings back to vocab_size at the end
        
    def forward(self, idx, targets=None):
        B,T = idx.shape
        tok_embd = self.embedding_table(idx) # (B,T) embedded returns with B,T,C
        pos_embd = self.pos_embedding_table(torch.arange(T, device=device))
        x = tok_embd + pos_embd
        x = self.blocks(x)
        x = self.ln_f(x)
        
        logits = self.ln_head(x) # (B,T,C) @ (C,vocab_size) -> (B,T,vocab_size)
        if targets == None:
            loss = None
        else:
            B,T,C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_tokens=10):
        for _ in range(max_tokens):
            B,T = idx.shape
            idx_cond = idx[:,-block_size:]
            logits, loss = self(idx_cond) # B,T,C
            logits = logits [:,-1,:]
            probs = F.softmax(logits, dim=-1)
            next_idx = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, next_idx), dim=-1)

        return idx

# Initializing the model
model = LanguageModel()

# Condition to use more than 1 GPU if available
if torch.cuda.device_count() > 1:
    print(f"Using {torch.cuda.device_count()} GPUs!")
    model = nn.DataParallel(model)
model = model.to(device)

# Printing the number of parameters
print(sum(p.numel() for p in model.parameters())/1e6, 'M parameters')

# Adam as the optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
Using 2 GPUs!
10.808154 M parameters

Training Time!

With Kaggle’s dual-GPU setup, training was relatively quick. It took me around 15 minutes to reach iteration 2500 out of 5000, at which point the training loss dropped below 1, and a nice separation was forming with the validation loss. Rather than risk overfitting, I stopped the training early. Here’s the training loop:

# Training loop
for itr in range(max_iters):

    if itr % eval_interval == 0 or itr == max_iters - 1:
        loss = estimate_loss()
        print(f"Iteration: {itr} Training Loss: {loss.get('train'):.4f}, Validation Loss: {loss.get('val'):.4f}")
        
    xb, yb = get_batch('train')
    logits, loss = model(xb, yb)
    loss = loss.mean()
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

context = model.module.generate(torch.zeros((1,1), dtype=torch.long, device=device), max_tokens=100)[0].tolist()
print(decode(context))   
Iteration: 0 Training Loss: 4.6603, Validation Loss: 4.6615
Iteration: 500 Training Loss: 1.5698, Validation Loss: 1.6308
Iteration: 1000 Training Loss: 1.2064, Validation Loss: 1.3264
Iteration: 1500 Training Loss: 1.0940, Validation Loss: 1.2387
Iteration: 2000 Training Loss: 1.0233, Validation Loss: 1.1922
Iteration: 2500 Training Loss: 0.9722, Validation Loss: 1.1606

Generating Some Karpathy-Style Text

After training, I let the model loose and generated 10,000 tokens of AI-Karpathy. Here’s a small snippet, the full file is in GitHub.

Okay so the tokens otherwise or is this little bit and exactly talling of the matrix multiplic plus and on humant this is three and the actual transformers and we make the one one-dimensionalizative memorize this is one of the gpt2 term dires thrivative by a single like prefectively single node optimize these lines and that's now how many the element of the way that is powed following it seefitely and pytorch we're doing that we ent so the um problems will be low by d so that's then we have because of the chain here before we go into the projectivations so on.
guessive I can've train into previous convolution I can over Yask linearity to think the letters we can looked some supeound of like a just find the loss B and it's a too instead torch mlpful brach with and so now we see the inkild the that instead we would have to be a tund then then than you go to have to implemend in my the bring case on if you need to an text of tensor in a biable how default are deceively so this madded.

The results? Well, obviously the text doesn’t make sense at all. But it has learnt the structure pretty well and is definitely better than I expected it to be.

Final Thoughts

This whole experience was a blast. Training GPT on Karpathy’s own lectures was a great way to reinforce what I had learned from him. The results, while not perfect (it’s still character-level, after all), were fun to see in action.

What’s next? I don’t plan to improve this model anytime soon.

GitHub Link: https://github.com/bimalpaudels/tiny-karpathy

Last updated: 3/26/2025
Tags:Deep LearningGPTAndrej KarpathyTransformersNLPAIKagglePyTorch