Large Language Models (LLMs)

The complete technical guide to ChatGPT, Claude, and the AI revolution

⏱️ 3 hours 🧠 Advanced concepts 💻 Code examples 🔬 Technical depth

Course Outline

Part 1: What Are Large Language Models?

Large Language Models (LLMs) are AI systems trained on massive amounts of text data to understand and generate human-like text. They're called "large" because they have billions of parameters (GPT-3: 175B, GPT-4: ~1.7T estimated).

The Scale of Modern LLMs

GPT-4

~1.7T

parameters (estimated)

Claude 3

Unknown

likely 100B-500B range

LLaMA 2

70B

largest open model

How LLMs Differ from Traditional NLP

Traditional NLP

  • • Task-specific models
  • • Rule-based systems
  • • Limited context understanding
  • • Requires structured data
  • • Narrow capabilities

Modern LLMs

  • • General-purpose models
  • • Pattern-based learning
  • • Deep context awareness
  • • Works with any text
  • • Emergent abilities

The Key Innovation: Next Token Prediction

LLMs are trained on a deceptively simple task: predict the next token (word/subword) given all previous tokens.

Input:  "The capital of France is"
Prediction: "Paris" (probability: 0.95)
            "Lyon"  (probability: 0.02)
            "the"   (probability: 0.01)
            ...

Training objective: Maximize P(next_token | previous_tokens)

This simple objective, when scaled to billions of parameters and trillions of tokens, creates models that can write code, solve math problems, and engage in complex reasoning.

Part 2: The Transformer Revolution

The Transformer architecture, introduced in 2017's "Attention Is All You Need" paper, is the foundation of all modern LLMs. It replaced RNNs and LSTMs with a more efficient parallel architecture.

Transformer Architecture

Input Embeddings Positional Encoder Stack Multi-Head Attention Feed Forward Decoder Stack Masked Attention Cross Attention Feed Forward Output 6x Layers 6x Layers

Key Components Explained

🔤 Embeddings

Converts tokens (words/subwords) into high-dimensional vectors that capture semantic meaning.

"cat" → [0.2, -0.5, 0.8, ..., 0.3]  # 768-dimensional vector
"dog" → [0.3, -0.4, 0.7, ..., 0.4]  # Similar to cat (both animals)
"car" → [-0.8, 0.2, -0.1, ..., 0.9] # Very different (vehicle)

📍 Positional Encoding

Since Transformers process all positions in parallel, they need position information injected.

PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

# This creates unique position patterns the model can learn

🎯 Multi-Head Attention

The core innovation - allows the model to attend to different positions simultaneously.

class MultiHeadAttention:
    def __init__(self, d_model=768, n_heads=12):
        self.heads = n_heads
        self.d_k = d_model // n_heads  # 64 dims per head
        
        self.W_q = Linear(d_model, d_model)  # Query projection
        self.W_k = Linear(d_model, d_model)  # Key projection  
        self.W_v = Linear(d_model, d_model)  # Value projection
        
    def forward(self, x):
        Q = self.W_q(x).reshape(..., self.heads, self.d_k)
        K = self.W_k(x).reshape(..., self.heads, self.d_k)
        V = self.W_v(x).reshape(..., self.heads, self.d_k)
        
        attention = softmax(Q @ K.T / sqrt(self.d_k))
        output = attention @ V
        return output

Part 3: Attention Is All You Need

The attention mechanism allows models to focus on relevant parts of the input when generating each output token. This is what gives LLMs their incredible context understanding.

Self-Attention in Action

Consider the sentence: "The cat sat on the mat because it was tired"

The model needs to understand what "it" refers to. Through attention, it learns:

it → cat:
0.85
it → mat:
0.10
it → sat:
0.05

The Attention Formula

Attention(Q, K, V) = softmax(QK^T / √d_k) V

Where:
- Q: Query matrix (what information am I looking for?)
- K: Key matrix (what information do I have?)
- V: Value matrix (what is the actual content?)
- d_k: Dimension of key vectors (for scaling)

Step by step:
1. Compute attention scores: QK^T
2. Scale by √d_k to prevent vanishing gradients
3. Apply softmax to get probabilities
4. Multiply by V to get weighted values

Visualizing Attention Patterns

💡 Key Insight

Notice how "it" (row 7) pays strong attention to "student" and "assignment" - the model learned that pronouns refer to nouns, and contextually "it" likely refers to the assignment.

Part 4: Tokenization & Embeddings

Before text can be processed by an LLM, it must be converted into tokens - the fundamental units that the model understands.

Common Tokenization Methods

Word-Level

Split by spaces/punctuation

"Hello world" →
["Hello", "world"]

❌ Huge vocabulary needed

Character-Level

Individual characters

"Hello" →
["H","e","l","l","o"]

❌ Very long sequences

BPE (Subword)

Smart subword units

"unbelievable" →
["un","believ","able"]

✅ Best of both worlds

Byte-Pair Encoding (BPE) in Action

import tiktoken  # OpenAI's tokenizer

# Initialize GPT-4 tokenizer
enc = tiktoken.encoding_for_model("gpt-4")

# Example tokenization
text = "The quick brown fox jumps over the lazy dog"
tokens = enc.encode(text)
print(f"Text: {text}")
print(f"Tokens: {tokens}")
print(f"Decoded: {[enc.decode([t]) for t in tokens]}")

# Output:
# Tokens: [791, 4062, 14198, 39935, 35308, 927, 279, 16053, 5679]
# Decoded: ['The', ' quick', ' brown', ' fox', ' jumps', ' over', ' the', ' lazy', ' dog']

# Interesting cases
examples = [
    "artificial intelligence",  # Common phrase
    "🤖",                       # Emoji
    "GPT-4",                    # Model name
    "antidisestablishmentarianism"  # Long word
]

for text in examples:
    tokens = enc.encode(text)
    print(f"\n'{text}'")
    print(f"  Tokens: {len(tokens)}")
    print(f"  Breakdown: {[enc.decode([t]) for t in tokens]}")

# Output:
# 'artificial intelligence'
#   Tokens: 2
#   Breakdown: ['artificial', ' intelligence']
#
# '🤖'
#   Tokens: 1
#   Breakdown: ['🤖']
#
# 'GPT-4'
#   Tokens: 3
#   Breakdown: ['G', 'PT', '-4']
#
# 'antidisestablishmentarianism'
#   Tokens: 6
#   Breakdown: ['ant', 'idis', 'establish', 'ment', 'arian', 'ism']

Token Embeddings: From IDs to Meaning

Each token ID is mapped to a high-dimensional vector that encodes semantic meaning:

class TokenEmbedding(nn.Module):
    def __init__(self, vocab_size=50257, d_model=768):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.d_model = d_model
        
    def forward(self, token_ids):
        # Scale embeddings by sqrt(d_model)
        return self.embedding(token_ids) * math.sqrt(self.d_model)

# Example usage
embed = TokenEmbedding()
tokens = torch.tensor([791, 4062, 14198])  # "The quick brown"
embeddings = embed(tokens)
print(embeddings.shape)  # [3, 768] - 3 tokens, 768 dimensions each

# These 768-dimensional vectors encode:
# - Semantic meaning (cat ≈ dog)
# - Syntactic role (noun, verb, etc.)
# - Contextual relationships

📊 Token Economics

Most LLMs charge by tokens. Understanding tokenization helps optimize costs:

  • • English: ~1.3 tokens per word average
  • • Code: Often more efficient (keywords are single tokens)
  • • Chinese/Japanese: Less efficient (~2-3 tokens per character)
  • • Numbers: Each digit is often a separate token

Part 5: How LLMs Are Trained

Training an LLM is a massive undertaking involving petabytes of text, thousands of GPUs, and months of computation. Here's how it works.

The Three-Stage Training Process

Stage 1: Pre-training (Foundation)

Train on massive unlabeled text datasets using next-token prediction.

Dataset Examples
  • • Common Crawl (petabytes of web data)
  • • Wikipedia (all languages)
  • • Books (Project Gutenberg, etc.)
  • • Academic papers (arXiv)
  • • Code (GitHub)
Training Stats
  • • Duration: 3-6 months
  • • Cost: $5-100M+
  • • GPUs: 1000s of A100s/H100s
  • • Tokens: 1-10 trillion
  • • Parameters: 7B-1T+

Stage 2: Supervised Fine-Tuning (SFT)

Train on high-quality instruction-response pairs to make the model helpful.

# Example training data
{
    "instruction": "Explain quantum computing to a 10-year-old",
    "response": "Imagine if your computer could try many solutions 
                 at once, like having multiple helpers working on a 
                 puzzle simultaneously. That's quantum computing..."
}

# Fine-tuning objective
loss = CrossEntropy(model_output, expected_response)

Stage 3: RLHF (Reinforcement Learning from Human Feedback)

Use human preferences to align the model with human values and intentions.

  1. 1 Collect comparison data: humans rank model outputs
  2. 2 Train reward model to predict human preferences
  3. 3 Use PPO (Proximal Policy Optimization) to optimize for reward
  4. 4 Balance between reward and maintaining language modeling

The Training Loop

import torch
import torch.nn as nn
from torch.optim import AdamW
from torch.nn.parallel import DistributedDataParallel as DDP

class LLMTrainer:
    def __init__(self, model, config):
        self.model = DDP(model)  # Distributed training across GPUs
        self.optimizer = AdamW(
            model.parameters(),
            lr=config.learning_rate,
            betas=(0.9, 0.95),
            weight_decay=0.1
        )
        self.scheduler = CosineAnnealingLR(self.optimizer)
        
    def train_step(self, batch):
        # Forward pass
        input_ids = batch['input_ids']  # [batch_size, seq_length]
        attention_mask = batch['attention_mask']
        
        # Shift targets for next-token prediction
        labels = input_ids[:, 1:].contiguous()
        input_ids = input_ids[:, :-1].contiguous()
        
        # Model forward
        outputs = self.model(
            input_ids=input_ids,
            attention_mask=attention_mask[:, :-1]
        )
        
        # Calculate loss (cross-entropy)
        logits = outputs.logits  # [batch_size, seq_length-1, vocab_size]
        loss = nn.CrossEntropyLoss()(
            logits.view(-1, logits.size(-1)),
            labels.view(-1)
        )
        
        # Backward pass
        loss.backward()
        
        # Gradient clipping to prevent explosion
        torch.nn.utils.clip_grad_norm_(self.model.parameters(), 1.0)
        
        # Optimizer step
        self.optimizer.step()
        self.optimizer.zero_grad()
        self.scheduler.step()
        
        return loss.item()

# Training configuration for a 7B model
config = {
    'batch_size': 4096,          # Large batch for stability
    'sequence_length': 2048,      # Context window
    'learning_rate': 6e-4,        # Peak learning rate
    'warmup_steps': 2000,         # Learning rate warmup
    'total_steps': 300000,        # Total training steps
    'gradient_accumulation': 16,  # Effective batch = 65K tokens
    'num_gpus': 128,             # A100 80GB GPUs
    'precision': 'bfloat16',     # Mixed precision training
}

⚠️ Training Challenges

  • Loss spikes: Model can suddenly diverge and need restart
  • Hardware failures: With 1000s of GPUs, failures are constant
  • Data quality: Bad data can poison the entire model
  • Checkpoint management: Each checkpoint can be 100s of GBs

Part 6: Capabilities & Limitations

LLMs exhibit remarkable capabilities that emerge with scale, but they also have fundamental limitations that are important to understand.

Emergent Capabilities

✅ What LLMs Can Do

  • In-context learning: Learn from examples without training
  • Chain-of-thought: Break down complex reasoning
  • Code generation: Write functional programs
  • Translation: Between 100+ languages
  • Summarization: Condense long documents
  • Creative writing: Stories, poems, scripts
  • Question answering: Across many domains
  • Math reasoning: Solve word problems

❌ Current Limitations

  • Hallucinations: Makes up plausible-sounding facts
  • No true understanding: Pattern matching, not reasoning
  • Knowledge cutoff: No real-time information
  • Context limits: Can't process infinite text
  • Consistency: May contradict itself
  • Math accuracy: Struggles with complex calculations
  • Causal reasoning: Poor at cause-effect chains
  • Self-awareness: No true consciousness

The Scaling Laws

Model capabilities follow predictable scaling laws discovered by OpenAI and others:

Loss = L(N, D, C)

Where:
- N: Number of parameters
- D: Dataset size (tokens)
- C: Compute budget (FLOPs)

Key findings:
1. Loss decreases as a power law with each factor
2. Optimal N ∝ C^0.73 (parameters scale with compute)
3. Optimal D ∝ C^0.27 (data scales with compute)
4. Bigger models are more sample-efficient

Emergent abilities appear at certain thresholds:
- 1B params: Basic language understanding
- 10B params: Simple reasoning, translation
- 100B params: Complex reasoning, code generation
- 1T+ params: Near-human performance on many tasks

Understanding Hallucinations

Why LLMs Hallucinate

LLMs are trained to predict likely next tokens, not to be factual. They optimize for plausibility, not truth.

Example Hallucination:

Prompt: "Tell me about the 2021 discovery of water on the sun"
LLM: "In 2021, scientists at NASA made the groundbreaking discovery of water molecules in the sun's corona, using the Solar Dynamics Observatory..."

❌ This is completely false - the model generated plausible-sounding scientific text

Mitigation Strategies:
  • • Retrieval Augmented Generation (RAG) - ground responses in real documents
  • • Chain-of-thought prompting - make model show its reasoning
  • • Temperature reduction - less creative, more factual
  • • Fine-tuning on factual data - improve accuracy in specific domains

Part 7: Fine-Tuning & RLHF

Fine-tuning adapts pre-trained models for specific tasks or behaviors. RLHF (Reinforcement Learning from Human Feedback) aligns models with human preferences.

Fine-Tuning Approaches

Full Fine-Tuning

Update all model parameters

# All parameters updated
model = load_pretrained()
for param in model.parameters():
    param.requires_grad = True
optimizer = AdamW(model.parameters())

✅ Best performance

❌ Expensive, risk of forgetting

LoRA (Low-Rank Adaptation)

Add small trainable matrices

# Only LoRA weights updated
base_model.freeze()
lora_A = nn.Linear(768, 16)
lora_B = nn.Linear(16, 768)
# h = h + lora_B(lora_A(x))

✅ 1000x fewer parameters

✅ Multiple adapters possible

RLHF: Teaching Models Human Preferences

The RLHF Pipeline

Step 1: Collect Comparison Data

Response A (Preferred)

"Here's a clear explanation of recursion with an example..."

Response B

"Recursion is when a function calls itself. That's it."

Step 2: Train Reward Model
reward_model = RewardModel(base_model)
# Learns: P(response_A > response_B)
loss = -log_sigmoid(r_A - r_B)
Step 3: PPO Training
for prompt in prompts:
    response = model.generate(prompt)
    reward = reward_model(prompt, response)
    
    # PPO objective
    ratio = new_prob / old_prob
    clipped = clip(ratio, 1-ε, 1+ε)
    loss = -min(ratio * advantage, clipped * advantage)
    
    # KL penalty to prevent drift
    loss += β * KL(new_model || old_model)

🎯 Constitutional AI (Claude's Approach)

Instead of human feedback, use AI feedback based on constitutional principles:

  • • AI generates multiple responses
  • • AI critiques based on principles (helpful, harmless, honest)
  • • AI revises responses based on critiques
  • • Train on AI-improved responses

Part 8: Using LLMs in Practice

From API calls to local deployment, here's how to actually use LLMs in your projects.

Working with LLM APIs

import openai
from anthropic import Anthropic
import google.generativeai as genai

# OpenAI GPT-4
openai.api_key = "your-key"

def call_gpt4(prompt, temperature=0.7, max_tokens=1000):
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ],
        temperature=temperature,  # 0=deterministic, 2=creative
        max_tokens=max_tokens,
        top_p=0.9,  # Nucleus sampling
        frequency_penalty=0.0,  # Reduce repetition
        presence_penalty=0.0    # Encourage new topics
    )
    return response.choices[0].message.content

# Anthropic Claude
anthropic = Anthropic(api_key="your-key")

def call_claude(prompt):
    message = anthropic.messages.create(
        model="claude-3-opus-20240229",
        max_tokens=1000,
        temperature=0.7,
        messages=[{"role": "user", "content": prompt}]
    )
    return message.content

# Google Gemini
genai.configure(api_key="your-key")

def call_gemini(prompt):
    model = genai.GenerativeModel('gemini-pro')
    response = model.generate_content(prompt)
    return response.text

# Streaming responses for better UX
def stream_gpt4(prompt):
    stream = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        stream=True
    )
    
    for chunk in stream:
        if chunk.choices[0].delta.content:
            yield chunk.choices[0].delta.content

Running LLMs Locally

Using Ollama

# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh

# Pull and run models
ollama pull llama2:7b
ollama run llama2:7b "Hello!"

# Python integration
import ollama
response = ollama.chat(
    model='llama2:7b',
    messages=[{
        'role': 'user',
        'content': 'Why is sky blue?'
    }]
)

Using Transformers

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    pipeline
)

# Load model
model_name = "microsoft/phi-2"
tokenizer = AutoTokenizer.from_pretrained(
    model_name
)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Generate
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer
)
result = pipe("Once upon a time")

Advanced Techniques

🔍 Retrieval-Augmented Generation (RAG)

Combine LLMs with external knowledge bases for accurate, up-to-date responses.

import chromadb
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA

# Create vector database
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(
    documents=your_documents,
    embedding=embeddings
)

# Create RAG chain
qa_chain = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model="gpt-4"),
    retriever=vectorstore.as_retriever(),
    return_source_documents=True
)

# Query with context
result = qa_chain({"query": "What is our refund policy?"})
print(result['result'])  # Answer grounded in your docs
print(result['source_documents'])  # Citations

🔗 LangChain for Complex Workflows

Build sophisticated LLM applications with chains, agents, and tools.

from langchain.agents import initialize_agent, Tool
from langchain.tools import DuckDuckGoSearchRun
from langchain.llms import ChatOpenAI

# Define tools
search = DuckDuckGoSearchRun()
tools = [
    Tool(
        name="Search",
        func=search.run,
        description="Search the web for information"
    ),
    Tool(
        name="Calculator",
        func=lambda x: eval(x),
        description="Perform calculations"
    )
]

# Create agent
agent = initialize_agent(
    tools=tools,
    llm=ChatOpenAI(model="gpt-4"),
    agent="zero-shot-react-description",
    verbose=True
)

# Agent automatically uses tools as needed
result = agent.run(
    "What's the population of Tokyo multiplied by 2?"
)
# Agent will: 1) Search for Tokyo population
#            2) Use calculator to multiply by 2

Best Practices & Tips

💡 Production Best Practices

Performance
  • • Cache common responses
  • • Use streaming for long outputs
  • • Batch similar requests
  • • Implement retry logic with backoff
Reliability
  • • Validate outputs with rules/regex
  • • Use temperature=0 for consistency
  • • Implement fallback models
  • • Monitor for quality degradation
Cost Optimization
  • • Use smaller models when possible
  • • Compress prompts (remove redundancy)
  • • Fine-tune for specific tasks
  • • Implement token limits
Safety
  • • Content filtering on inputs/outputs
  • • Rate limiting per user
  • • Audit logs for compliance
  • • Human-in-the-loop for critical tasks

What You've Learned

Core Concepts

  • ✓ Transformer architecture and attention
  • ✓ Tokenization and embeddings
  • ✓ Pre-training and fine-tuning
  • ✓ RLHF and alignment

Practical Skills

  • ✓ Using LLM APIs effectively
  • ✓ Running models locally
  • ✓ Building RAG systems
  • ✓ Production best practices

Ready to dive deeper into AI?

Keep Learning

📚 Papers to Read

  • • Attention Is All You Need (2017)
  • • GPT-3 Paper (2020)
  • • Constitutional AI (2022)
  • • Sparks of AGI (2023)

🛠️ Tools to Try

  • • Hugging Face Transformers
  • • LangChain / LlamaIndex
  • • Ollama for local models
  • • OpenAI Playground

🎯 Projects to Build

  • • RAG chatbot for docs
  • • Code review assistant
  • • Custom fine-tuned model
  • • Multi-agent system