On May 28, 2024, Karpathy introduces an impressive guide on reproducing the GPT-2 (124M) model using llm.c, a concise implementation in C/CUDA. This model, originally released by OpenAI in 2019, represents the smallest version of the GPT-2 series. Remarkably, training this model in a modern setup can be economically viable, even for startups and SMEs.
Efficient Model Training
The 124M model, characterized by 12 layers, 12 heads, 768 dimensions, and requiring 10 billion tokens from the FineWeb dataset, achieves remarkable efficiency with llm.c. This setup enables training the model on an 8X A100 80GB SXM node within approximately 90 minutes, costing only around $20.

Comparison with OpenAI
While comparing the performance with the checkpoint released by OpenAI’s FineWeb withheld validation dataset, Karpathy notes improvements. However, due to variations in the data distribution (original GPT-2 utilized an unreleased "WebText" dataset) and differences over time, the comparison is not entirely fair. Further evaluation with HellaSwag accuracy shows the reproduced model achieves 29.9, surpassing the GPT-2 (124M) benchmark at 29.4, even with fewer training tokens (10B vs. 300B for GPT-3 models).
Training Process
The training process on a Linux x86 64bit Ubuntu 22.04 with CUDA 12 is discussed. While the guide is tailored to this setup, comments in the README file provide support for other systems. Key hyperparameters are adopted from the GPT-3 paper due to their enhanced details, ensuring an optimal configuration for training this model.
Key Considerations for Running Out of Memory
If memory becomes a constraint during training, a few steps can ensure smooth progress:
- Enable -r 1 option.
- Gradually reduce the batch size (-b) by half until the run succeeds.
- Attempt to revert -r 0 setting to regain some processing speed if feasible.
Insights into the Training Journey
During an actual training example on a single A100 40GB PCIe GPU (costing $1.29/hr), the process involves ~20K steps over 10B tokens, reaching 178K tokens/second throughput. Important metrics include:
- Current loss: Starting at 7.577, which reduces significantly by the end of optimization.
- Model Flops Utilization (MFU): Achieving approximately 60% efficiency on an A100 80GB SXM.
- Gradient Norm: Initially spiking but calming down as training stabilizes, with gradient clipping at 1.0 standard.
Visualizing Training Progress
Visualization tools play a critical role in understanding and optimizing the training process. Using a Jupyter notebook to parse log files, one can create detailed charts tracking various stages of training, assisting in fine-tuning performance further.
Tokenizer and Sampling
Training inconsistencies regarding the GPT-2 tokenizer are addressed by generating the necessary .bin file to facilitate text decoding. Although the code isn’t primarily designed for inference, conditional sampling can be experimented within a hacky manner to assess model outputs.
Further Developments and Acknowledgments
Karpathy acknowledges significant contributions to the llm.c project, especially in CUDA kernel optimization and distributed optimization. As the project evolves, there are plans to reproduce larger models, such as the 350M, 740M, and eventually the 1558M GPT-2, aiming for cleaner, better-tested code with multi-node training support.
Conclusion
For startups and SMEs, this guide represents a significant milestone in economically reproducing advanced AI models. By leveraging efficient coding practices and accessible hardware, smaller enterprises can harness the power of GPT-2, unlocking new potential in a cost-effective manner.
Edworking is the best and smartest decision for SMEs and startups to be more productive. Edworking is a FREE superapp of productivity that includes all you need for work powered by AI in the same superapp, connecting Task Management, Docs, Chat, Videocall, and File Management. Save money today by not paying for Slack, Trello, Dropbox, Zoom, and Notion.
---
Remember these 3 key ideas for your startup:
- Cost-Efficiency: Utilize affordable resources like Lambda Labs to achieve high-efficiency training for advanced models. With llm.c, reproducing GPT-2 (124M) can cost as low as $20. Learn more.
- Optimized Training: Focus on key hyperparameters and batch configurations to manage memory and optimize training performance. Efficient training processes result in achieving better accuracy and lower losses. Explore detailed tips.
- Tools and Visualization: Implement visualization tools to track training progress and optimize your models. Detailed insights help in enhancing model performance and making informed decisions.
Feel free to explore the detailed documentation and guides on platforms like GitHub and Lambda Labs for an in-depth understanding and practical implementation.
For more details, see the original source.






