OpenAI has released a new language model called o1 that can draw logical conclusions and solve mathematical puzzles. This model significantly outperforms its predecessor, GPT-4o, in logical reasoning tasks.

Why all the excitement?
OpenAI’s new AI model, with the rather sober designation o1, is making quite a splash. The AI can draw logical conclusions much better than other large language models. AI specialist Ethan Mollick presents a particularly impressive example in his popular newsletter “One Useful Thing”: He had the AI solve a crossword puzzle.
In a detailed blog post, however, OpenAI particularly highlighted the model’s ability to solve math problems. The model solved the tasks of the American Invitational Mathematics Examination (AIME), a qualifying competition for the International Mathematical Olympiad for students, with a reliability of about 83 percent. OpenAI’s best model to date, GPT-4o, only achieves about 15 percent on this test.
Why are math problems so important at this point?
That solving middle school mathematics should be a sign of special machine intelligence sounds surprising at first. But mathematics is the benchmark for logical thinking, and a generative artificial intelligence that can not only formulate coherent sentences but also think logically would indeed be sensational. Such a machine could learn to autonomously complete complex tasks based on very generally formulated instructions. It could weigh opportunities and risks and understand abstract ideas that have so far been reserved only for us humans. It could even help master the great problems of our time.
Why can’t language models do this so far?
Language models are actually only there to determine the most likely next word in a text. The fact that ChatGPT, Claude and Co. still appear so amazingly human-like and can do surprisingly much by now is related to huge training datasets, clever optimization strategies, and a lot of fine-tuning of the models.
But there are still big problems:
- Language models work sequentially from input to output. This means they generate their responses word by word but can’t change their responses retroactively. They are therefore not made for tasks that require mental buffering and understanding of non-linear relationships.
- Language models work statistically. Logical connections that humans explicitly know and use are only implicitly present for language models – if they appear in the training data.
- Language models hallucinate. Especially when there is little in the training data about the input prompt they are supposed to complement, they produce output that somehow looks right but is completely wrong in content.
How can computers crack math problems and logic puzzles?
This question has occupied AI research since the 1960s. Popular at the time was the idea of symbolic artificial intelligence. The idea was to bring human knowledge about a particular problem into a formal and machine-readable form, such as a search tree. This consists of nodes connected by edges – namely whenever one of the states of the problem can be transformed into a second state by applying the respective rules. Using this method, the computer program Logic Theorist succeeded in proving 38 of the first 52 theorems of Principia Mathematica as early as 1956. In this three-volume work, mathematicians Alfred North Whitehead and Bertrand Russell had attempted to derive all of mathematics from as few premises as possible. Essentially, this was based on the software finding a continuous path between the theorem and a previously proven point in the search tree.
How did OpenAI approach the problem?
We don’t know exactly. There are only a few details that OpenAI has published so far. All we know so far is that OpenAI has boosted a large language model to solve the problem – and that they use two techniques: Chain of Thought (CoT) and reinforcement learning.
CoT means having the machine work step by step – much like solving a math problem using the rule of three – we know this, then this follows, then that. The technique itself is not new. It is known that large language models provide more accurate results this way.
Reinforcement learning means letting the model try out different solution paths, which are initially chosen randomly. If the model arrives at the correct solution, the intermediate steps on this path are given a higher probability value.
Subbarao Kambhampati of Arizona State University speculates in a post on X about how the two components could interlock: Presumably, to solve a given problem, the language model generates a large number of CoT prompts and then inches its way forward piece by piece – trying out all possible solution paths. In a special training, the correct paths are then weighted higher. All of this is repeated billions of times during training – probably also with the help of synthetic data – until the model has learned enough.
In productive operation, the model also generates a lot of internal CoT prompts from the prompt and then selects the solution paths that, according to its training, should most likely lead to the solution. From these, it then – presumably – selects the shortest one and shows excerpts of it to the user.
However, this means that here again there is no guarantee that the solution is actually correct. There is no real, logical or mathematical verification of the solution. The model can still hallucinate.
Is this now the breakthrough towards artificial general intelligence (AGI)?
No. While experts don’t even agree on what a real AGI should be, not even OpenAI claims that Project Strawberry is such a thing.