OpenAI's New O1 Chain-of-Thought Models

OpenAI released two major new preview models: o1-preview and o1-mini. These models, previously rumored under the codename “strawberry,” are not just a simple upgrade from GPT-4o but introduce significant trade-offs in cost and performance for improved reasoning capabilities. OpenAI's elevator pitch provides a good starting point: "We’ve developed a new series of AI models designed to spend more time thinking before they respond."

Training and Reasoning Capabilities

OpenAI's article, Learning to Reason with LLMs, explains how these new models were trained using a large-scale reinforcement learning algorithm. This algorithm teaches the model to think productively using its chain of thought in a highly data-efficient training process. The performance of o1 consistently improves with more reinforcement learning (train-time compute) and more time spent thinking (test-time compute). The constraints on scaling this approach differ substantially from those of LLM pretraining, and OpenAI continues to investigate them.
Through reinforcement learning, the o1 models learn to hone their chain of thought and refine their strategies. They recognize and correct mistakes, break down tricky steps into simpler ones, and try different approaches when the current one isn’t working. This process dramatically improves the model’s ability to reason, meaning the models can handle significantly more complicated prompts where a good result requires backtracking and "thinking" beyond just next token prediction.

Trade-offs and API Documentation

Some of the most intriguing details about the new models and their trade-offs can be found in their API documentation. For applications needing image inputs, function calling, or consistently fast response times, the GPT-4o and GPT-4o mini models will continue to be the right choice. However, for applications demanding deep reasoning and can accommodate longer response times, the o1 models could be an excellent choice.

Reasoning Tokens

One of the most interesting aspects is the introduction of reasoning tokens—tokens that are not visible in the API response but are still billed and counted as output tokens. These tokens are where the new magic happens. OpenAI suggests allocating a budget of around 25,000 of these for prompts that benefit from the new models. Consequently, the output token allowance has been increased dramatically—to 32,768 for o1-preview and 65,536 for the supposedly smaller o1-mini. This is a significant increase from the GPT-4o and GPT-4o-mini models, which both currently have a 16,384 output token limit.

OpenAI's new chain-of-thought model enhances reasoning capabilities.

Limit Additional Context in RAG

Another key tip from the API documentation is to limit additional context in retrieval-augmented generation (RAG). When providing additional context or documents, include only the most relevant information to prevent the model from overcomplicating its response. This is a big change from how RAG is usually implemented, where the advice is often to cram as many potentially relevant documents as possible into the prompt.

Hidden Reasoning Tokens

A frustrating detail is that these reasoning tokens remain invisible in the API—you get billed for them, but you don’t get to see what they were. OpenAI explains why in Hiding the Chains of Thought: Assuming it is faithful and legible, the hidden chain of thought allows us to “read the mind” of the model and understand its thought process. For example, in the future, we may wish to monitor the chain of thought for signs of manipulating the user. However, for this to work, the model must have the freedom to express its thoughts in an unaltered form, so we cannot train any policy compliance or user preferences onto the chain of thought. We also do not want to make an unaligned chain of thought directly visible to users. Therefore, after weighing multiple factors including user experience, competitive advantage, and the option to pursue chain of thought monitoring, OpenAI decided not to show the raw chains of thought to users.

Community and Future Applications

OpenAI provides some initial examples in the Chain of Thought section of their announcement, covering tasks like generating Bash scripts, solving crossword puzzles, and calculating the pH of a moderately complex solution of chemicals. These examples show that the ChatGPT UI version of these models does expose details of the chain of thought but doesn’t show the raw reasoning tokens, instead using a separate mechanism to summarize the steps into a more human-readable form.

Training AI models with reinforcement learning enhances their reasoning capabilities.
OpenAI also has two new cookbooks with more sophisticated examples. I asked on Twitter for examples of prompts that people had found which failed on GPT-4o but worked on o1-preview. A couple of my favorites were shared, but great examples are still a bit thin on the ground. OpenAI researcher Jason Wei noted that results on AIME and GPQA are strong, but that doesn’t necessarily translate to something a user can feel. Even as someone working in science, it’s not easy to find the slice of prompts where GPT-4o fails, o1 does well, and I can grade the answer. But when you do find such prompts, o1 feels totally magical. We all need to find harder prompts.

Initial Impressions and Future Outlook

Ethan Mollick has been previewing the models for a few weeks and published his initial impressions. His crossword example is particularly interesting for the visible reasoning steps, which include notes like: "I noticed a mismatch between the first letters of 1 Across and 1 Down. Considering 'CONS' instead of 'LIES' for 1 Across to ensure alignment."
It’s going to take a while for the community to shake out the best practices for when and where these models should be applied. I expect to continue mostly using GPT-4o (and Claude 3.5 Sonnet), but it’s going to be really interesting to see us collectively expand our mental model of what kind of tasks can be solved using LLMs given this new class of model. I expect we’ll see other AI labs, including the open model weights community, start to replicate some of these results with their own versions of models that are specifically trained to apply this style of chain-of-thought reasoning.
Remember these 3 key ideas for your startup:

Enhanced Reasoning Capabilities: The new o1 models are designed to spend more time thinking before they respond, making them ideal for applications that require deep reasoning. This can be particularly useful for startups developing complex AI-driven solutions.
Increased Token Allowance: With the introduction of reasoning tokens and a significant increase in output token allowance, the o1 models can handle more complex prompts and provide more detailed responses. This can be a game-changer for SMEs looking to leverage AI for intricate tasks.
Selective Context Inclusion: When using retrieval-augmented generation (RAG), include only the most relevant information to prevent the model from overcomplicating its response. This approach can help startups streamline their AI processes and improve efficiency.

Edworking is the best and smartest decision for SMEs and startups to be more productive. Edworking is a FREE superapp of productivity that includes all you need for work powered by AI in the same superapp, connecting Task Management, Docs, Chat, Videocall, and File Management. Save money today by not paying for Slack, Trello, Dropbox, Zoom, and Notion.

OpenAI Unveils o1 Chain-of-Thought AI Models