bimals.net
Tiny Karpathy: Pretraining a Character-Level GPT on Andrej Karpathy's Deep Learning Series
Overview
Andrej Karpathy’s YouTube series is hands down the best free of cost resource out there to learn something important and valuable (3Blue1Brown comes in as a close second). In one of the videos in the series, Karpathy walks through the process of pre-training a GPT model using Tiny Shakespeare, which is awesome—but I wanted to take things up a notch. What if I trained a tiny GPT model on Karpathy’s own deep learning lectures? Thus, the birth of what I’m calling Tiny Karpathy!
Building the Dataset
The dataset-building process was as simple as it gets. I copied all the transcripts from Karpathy’s deep learning series into a text file, did some light formatting—fixed up the paragraphs, added some punctuation—and that was it.
with open('/kaggle/input/karpathy/tiny.txt', 'r', encoding='utf-8-sig') as f:
text = f.read()
Since this is a character-level model, I had to map each character to a unique integer and vice versa.
chars = sorted(list(set(text)))
# Encoder and Decoder
stoi = {s:i for i,s in enumerate(chars)}
itos = {i:s for i,s in enumerate(chars)}
encode = lambda txt: [stoi[c] for c in txt]
decode = lambda idx: "".join([itos[i] for i in idx])
# Data Loader
def get_batch(split):
data = training_data if split == 'train' else val_data
idx = torch.randint(len(data)-block_size, (batch_size,))
xb = torch.stack([data[i:i+block_size] for i in idx])
yb = torch.stack([data[i+1:i+block_size+1] for i in idx])
return xb.to(device), yb.to(device)
Setting Up the Transformer Model
I followed Karpathy’s Let's build GPT: from scratch, in code, spelled out video almost to the letter. The only real modifications? I trained the model using Kaggle’s GPU setup, which meant tweaking the code to fully utilize both GPUs. That included handling distributed losses by taking the mean and wrapping the model with DataParallel for multi-GPU training.
# Hyper-parameters block
vocab_size = len(chars) # Number of classes
batch_size = 64 # No. of batches
block_size = 256 # No. of tokens in each context
device = 'cuda' if torch.cuda.is_available() else 'cpu'
n_embd = 384 # No. of embedding dimensions for each token
learning_rate = 3e-4 # Rate of stepping
max_iters = 5000 # Number of loops in training
eval_iters = 200 # No. of indices to be used to estimate the losses
eval_interval = 500 # No. of iters before calling the estimator function
n_head = 6 # No. of heads in the multihead
n_layer = 6 # No. of blocks
dropout = 0.2 # Regularization to avoid overfitting
The model itself follows a standard transformer structure: multi-head self-attention, layer normalization, and feedforward layers.
# Transformers implementation
class Head(nn.Module):
def __init__(self, head_size):
super().__init__()
self.query = nn.Linear(n_embd, head_size, bias=False)
self.key = nn.Linear(n_embd, head_size, bias=False)
self.value = nn.Linear(n_embd, head_size, bias=False)
self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
self.dropout = nn.Dropout(dropout)
def forward(self, x):
B,T,C = x.shape
q = self.query(x) # B,T,head_size
k = self.key(x) # B,T,head_size
affinity = q @ k.transpose(-2,-1) # B,T,head_size @ B,head_size,T -> B,T,T
wei = affinity.masked_fill(self.tril[:T,:T] == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
wei = self.dropout(wei)
v = self.value(x) # B,T,head_size
out = wei @ v # B,T,T @ B,T,head_size -> B,T,head_size
return out
class MultiHead(nn.Module):
def __init__(self, n_head, head_size):
super().__init__()
self.heads = nn.ModuleList([Head(head_size) for _ in range(n_head)])
self.proj = nn.Linear(head_size * n_head, n_embd)
self.dropout = nn.Dropout(dropout)
def forward(self, x):
out = torch.cat([h(x) for h in self.heads], dim=-1)
out = self.dropout(self.proj(out))
return out
class FeedForward(nn.Module):
# MLP layer
def __init__(self, n_embd):
super().__init__()
self.net = nn.Sequential(
nn.Linear(n_embd, 4*n_embd),
nn.ReLU(),
nn.Linear(4*n_embd, n_embd),
nn.Dropout(dropout),
)
def forward(self, x):
return self.net(x)
# Block that calls all the above classes
class Block(nn.Module):
def __init__(self, n_embd, n_head):
super().__init__()
head_size = n_embd//n_head
self.sa = MultiHead(n_head, head_size)
self.ff = FeedForward(n_embd)
self.ln1 = nn.LayerNorm(n_embd)
self.ln2 = nn.LayerNorm(n_embd)
def forward(self, x):
x = x + self.sa(self.ln1(x)) # LayerNorm before passing into attention
x = x + self.ff(self.ln2(x)) # Residual connection with the addition
return x
# GPT Implementation
class LanguageModel(nn.Module):
def __init__(self):
super().__init__()
self.embedding_table = nn.Embedding(vocab_size, n_embd) # Each vocab will have a row of n_embd dim. vector
self.pos_embedding_table = nn.Embedding(block_size, n_embd) # Each index is embedded based on the context
self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
self.ln_f = nn.LayerNorm(n_embd)
self.ln_head = nn.Linear(n_embd, vocab_size) # Conveting the embeddings back to vocab_size at the end
def forward(self, idx, targets=None):
B,T = idx.shape
tok_embd = self.embedding_table(idx) # (B,T) embedded returns with B,T,C
pos_embd = self.pos_embedding_table(torch.arange(T, device=device))
x = tok_embd + pos_embd
x = self.blocks(x)
x = self.ln_f(x)
logits = self.ln_head(x) # (B,T,C) @ (C,vocab_size) -> (B,T,vocab_size)
if targets == None:
loss = None
else:
B,T,C = logits.shape
logits = logits.view(B*T, C)
targets = targets.view(B*T)
loss = F.cross_entropy(logits, targets)
return logits, loss
def generate(self, idx, max_tokens=10):
for _ in range(max_tokens):
B,T = idx.shape
idx_cond = idx[:,-block_size:]
logits, loss = self(idx_cond) # B,T,C
logits = logits [:,-1,:]
probs = F.softmax(logits, dim=-1)
next_idx = torch.multinomial(probs, num_samples=1)
idx = torch.cat((idx, next_idx), dim=-1)
return idx
# Initializing the model
model = LanguageModel()
# Condition to use more than 1 GPU if available
if torch.cuda.device_count() > 1:
print(f"Using {torch.cuda.device_count()} GPUs!")
model = nn.DataParallel(model)
model = model.to(device)
# Printing the number of parameters
print(sum(p.numel() for p in model.parameters())/1e6, 'M parameters')
# Adam as the optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
Using 2 GPUs!
10.808154 M parameters
Training Time!
With Kaggle’s dual-GPU setup, training was relatively quick. It took me around 15 minutes to reach iteration 2500 out of 5000, at which point the training loss dropped below 1, and a nice separation was forming with the validation loss. Rather than risk overfitting, I stopped the training early. Here’s the training loop:
# Training loop
for itr in range(max_iters):
if itr % eval_interval == 0 or itr == max_iters - 1:
loss = estimate_loss()
print(f"Iteration: {itr} Training Loss: {loss.get('train'):.4f}, Validation Loss: {loss.get('val'):.4f}")
xb, yb = get_batch('train')
logits, loss = model(xb, yb)
loss = loss.mean()
optimizer.zero_grad(set_to_none=True)
loss.backward()
optimizer.step()
context = model.module.generate(torch.zeros((1,1), dtype=torch.long, device=device), max_tokens=100)[0].tolist()
print(decode(context))
Iteration: 0 Training Loss: 4.6603, Validation Loss: 4.6615
Iteration: 500 Training Loss: 1.5698, Validation Loss: 1.6308
Iteration: 1000 Training Loss: 1.2064, Validation Loss: 1.3264
Iteration: 1500 Training Loss: 1.0940, Validation Loss: 1.2387
Iteration: 2000 Training Loss: 1.0233, Validation Loss: 1.1922
Iteration: 2500 Training Loss: 0.9722, Validation Loss: 1.1606
Generating Some Karpathy-Style Text
After training, I let the model loose and generated 10,000 tokens of AI-Karpathy. Here’s a small snippet, the full file is in GitHub.
Okay so the tokens otherwise or is this little bit and exactly talling of the matrix multiplic plus and on humant this is three and the actual transformers and we make the one one-dimensionalizative memorize this is one of the gpt2 term dires thrivative by a single like prefectively single node optimize these lines and that's now how many the element of the way that is powed following it seefitely and pytorch we're doing that we ent so the um problems will be low by d so that's then we have because of the chain here before we go into the projectivations so on.
guessive I can've train into previous convolution I can over Yask linearity to think the letters we can looked some supeound of like a just find the loss B and it's a too instead torch mlpful brach with and so now we see the inkild the that instead we would have to be a tund then then than you go to have to implemend in my the bring case on if you need to an text of tensor in a biable how default are deceively so this madded.
The results? Well, obviously the text doesn’t make sense at all. But it has learnt the structure pretty well and is definitely better than I expected it to be.
Final Thoughts
This whole experience was a blast. Training GPT on Karpathy’s own lectures was a great way to reinforce what I had learned from him. The results, while not perfect (it’s still character-level, after all), were fun to see in action.
What’s next? I don’t plan to improve this model anytime soon.
GitHub Link: https://github.com/bimalpaudels/tiny-karpathy