The OpenAI Model Craft Challenge: Parameter Golf – A Deep Dive into the 16MB Constraint

Introduction to the OpenAI Model Craft Challenge

The OpenAI Model Craft Challenge, colloquially known as Parameter Golf, is a high-stakes competition designed to push the boundaries of language model (LM) training within extreme constraints. Inspired by the NanoGPT Speedrunning challenge, this initiative encourages participants to train the most efficient and effective language models that fit entirely within 16 megabytes (MB) of compressed artifact size while adhering to a 10-minute training limit on an 8x H100 GPU cluster. The goal is not merely to achieve minimal computational overhead but to optimize for the lowest possible validation loss on the FineWeb dataset, measured in bits per byte (bpb).

This challenge is deeply rooted in the principles of neural scaling laws, where the objective shifts from optimizing for model size or compute time to maximizing performance within a fixed parameter budget. Participants are encouraged to explore unconventional architectures, quantization techniques, and evaluation strategies that could redefine how we train and evaluate language models.

Core Objectives and Constraints

1. Technical Constraints

Artifact Size Limit: The submission must fit entirely within 16MB of compressed bytes (code + model weights).
This includes all Python code in train_gpt.py, tokenizer configurations, and the trained model.
External downloads or network access during evaluation are prohibited.
Training Time Constraint:
Leaderboard submissions must complete training within 10 minutes on an 8x H100 GPU cluster.
Non-record submissions can exceed this limit but should still be justified as computationally feasible.

2. Evaluation Metrics

The primary metric is the FineWeb validation loss, measured in nats (natural logarithms of perplexity). The challenge aims to minimize this loss while adhering to the constraints, with a focus on achieving improvements over existing records.

Validation Loss: Lower values indicate better model performance.
Example: A score of 1.1428 nats is considered strong, whereas 1.2244 nats represents a baseline.

3. Key Challenges and Innovations

Participants are encouraged to explore:

Architectural Innovations: Novel neural network designs that maximize efficiency.
Examples include multi-layer perceptrons (MLPs) with sparse attention, recurrent architectures, or hybrid models combining transformer layers with simpler feed-forward networks.
Quantization Techniques: Reducing precision to minimize model size while preserving performance.
Common methods: Int8/Int6 quantization, mixed-precision training, and low-rank factorization (LRF).
Tokenization Strategies: Optimizing vocabulary sizes and encoding schemes.
Examples include BigramHash, zstd compression, or custom tokenizers that reduce embedding dimensions.
Training Optimization: Techniques like weight decay, stochastic weight averaging (SWA), and learning rate scheduling.
Evaluation Methods: Novel ways to assess model performance without violating constraints.
Some submissions use sliding-window evaluation or test-time training (TTT).

The Leaderboard: A Closer Look

Below is a detailed breakdown of the top-performing submissions as of March 2026, highlighting their architectural and optimization choices:

1. Top-Ranking Submissions (March 2026)

| Rank | Run Name | Author | Summary | Date | |------|-----------------------------------|-----------------|-------------------------------------------------------------------------------------------|--------------------| | 1 | 10L Int5-MLP + BigramHash(10240) | thwu1 | 10 layers, mixed int5/int6 quantization, BigramHash(10240), SWA(0.4), WD=0.04 | 2026-03-20 | | 2 | Int6 MLP3x + SmearGate | Raahil Shah | 3x MLP + SmearGate + BigramHash + OrthoInit + Muon WD + SWA | 2026-03-20 | | 3 | 11L MLP3x + Int6 QAT | aruniyer | 11 layers, 3x MLP, int6 QAT, zstd-22, WD=0.04, sliding eval | 2026-03-20 | | 4 | SmearGate + OrthoInit | aquariouseworkman | SmearGate + BigramHash + 3x MLP + int6 STE QAT + sliding eval | 2026-03-19 |

Key Innovations in Top Submissions

Quantization and Precision:
The top submissions frequently employ mixed-precision quantization (e.g., Int5/Int6 for weights, Int8 for embeddings).
Techniques like quantization-aware training (QAT) and stochastic training environments (STE) help preserve performance while reducing model size.
Architectural Tweaks:
Multi-layer perceptrons (MLPs) with fewer attention heads but deeper layers are common, as they reduce the computational overhead of self-attention.
SmearGate and Orthogonal Initialization improve gradient flow and convergence in constrained settings.
Tokenization and Compression:
BigramHash reduces vocabulary size by hashing tokens into a smaller space (e.g., 10,240 tokens).
zstd compression further minimizes the model’s footprint during evaluation.

2. Notable Non-Record Runs

Some submissions push beyond the 16MB limit but still demonstrate creative approaches:

| Rank | Run Name | Author | Summary | Date | |------|-----------------------------------|-----------------|-------------------------------------------------------------------------------------------|--------------------| | - | 4-Hour Baseline | Will DePue | Unlimited compute, 4 hours on 8xH100 (for comparison) | 2026-03-18 | | - | Long Context (4k seq length) | Spokane Way | Extended sequence lengths with optimized hyperparameters | 2026-03-19 |

Key Insights

The 4-hour baseline demonstrates that even without aggressive optimization, models can achieve competitive performance when given more compute.
Long-context submissions (e.g., 4k or 2048-token sequences) explore how to handle extended inputs within the constraints.

Getting Started: Training Your First Model

Local Development on Apple Silicon

For participants with an Apple M1/M2 Mac, OpenAI provides a pre-configured MLX training script:

git clone https://github.com/openai/parameter-golf.git
cd parameter-golf
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install mlx numpy sentencepiece huggingface-hub datasets tqdm

Downloading FineWeb Dataset

python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 10

This downloads a subset of the dataset with a 1024-token vocabulary (default: full validation + 8B tokens).

Running a Small MLX Training Job

RUN_ID=mlx_smoke ITERATIONS=200 TRAIN_BATCH_TOKENS=8192 VAL_LOSS_EVERY=0 VAL_BATCH_SIZE=8192 python3 train_gpt_mlx.py

This skips validation and prints the final val_loss and val_bpb.

Scaling Up to Remote Machines

For larger-scale training, participants can use Runpod (OpenAI’s GPU cloud provider):

Steps for Launching a 1xH100 Pod

Create an account on Runpod.
Deploy a new pod with the official Parameter Golf template.
Clone the repository and run:

   cd /workspace
   git clone https://github.com/openai/parameter-golf.git
   python3 data/cached_challenge_fineweb.py --variant sp1024
   torchrun --standalone --nproc_per_node=1 train_gpt.py

Key Configuration Flags

--train-batch-tokens: Controls batch size in tokens (e.g., 8192).
--val-loss-every: Frequency of validation checks.
--max-wallclock-seconds=0: Override the 10-minute limit.

FAQ: Addressing Common Concerns

What Counts Toward the 16MB Artifact?

The artifact size is calculated as:

Code bytes (train_gpt.py) + Compressed model bytes

No external downloads or network access during evaluation are allowed.
The limit is decimal 16MB (16,000,000 bytes), not MiB.

Evaluation Rules

No validation data leakage: Models cannot be trained on the validation set before evaluation.
Test-time training (TTT): Allowed only if tokens are already evaluated in a previous run.
Computation limits: Evaluation must complete within 10 minutes on 8x H100.

Submission Requirements

To qualify as an SOTA record:

Beat the current leaderboard by ≥0.005 nats (statistically significant).
Provide train logs and reproducibility proofs.
Include a README.md with detailed methodology.

Architectural and Optimization Strategies

1. Quantization Techniques

Int8/Int6 Mixed Precision: Reduces model size while preserving accuracy.
Example: Int5 for weights, Int8 for embeddings.
Quantization-Aware Training (QAT): Trains models with quantized weights to minimize performance loss.

2. Tokenization and Vocabulary Reduction

BigramHash: Hashes tokens into a smaller space (e.g., 10,240 tokens).
zstd Compression: Further reduces model size during evaluation.

3. Architectural Innovations

Multi-Layer Perceptrons (MLPs): Simpler than transformers but effective in constrained settings.
Sparse Attention: Reduces attention computation overhead.
Recurrent Architectures: Useful for long sequences within limited memory.

4. Training Optimization

Weight Decay (WD): Regularizes training to prevent overfitting.
Stochastic Weight Averaging (SWA): Improves generalization in constrained settings.
Sliding Window Evaluation: Gradually increases context length during evaluation.

Conclusion: The Future of Parameter Golf

The OpenAI Model Craft Challenge represents a bold experiment in pushing the boundaries of language model training within extreme constraints. By encouraging creativity in architecture, quantization, and evaluation strategies, this challenge has spurred innovations that could redefine how we train models for efficiency and performance.

Participants have demonstrated that even with 16MB of compressed artifact size, it is possible to achieve competitive validation losses by leveraging unconventional techniques like:

Mixed-precision quantization
Custom tokenizers (e.g., BigramHash)
Architectural simplifications (MLPs, sparse attention)

As the challenge progresses, we can expect even more daring experiments—such as long-context training, test-time training, and novel compression schemes—to further push the limits of what’s possible within a fixed parameter budget.

For those eager to contribute, the repository provides a solid foundation for experimentation. Whether you’re an academic researcher or a software engineer, this challenge offers a unique opportunity to tackle one of AI’s most pressing questions: How small can we make a model that still performs well?

Final Note: The leaderboard is dynamic, and submissions are evaluated based on statistical significance. Always ensure your results are reproducible before claiming a record!

(End of detailed description)

OpenAI Model Craft Challenge: Parameter Golf – 16MB Limit