First impressions of OpenAI o1: An AI designed to think too much
OpenAI published its new o1 models on Thursday, giving ChatGPT users their first chance to try out AI models that pause to “think” before responding. There’s been a lot of hype about these models, which are codenamed “Strawberry” at OpenAI. But does Strawberry live up to the hype?
In a way, yes.
Compared to GPT-4o, the o1 models feel like one step forward and two steps back. OpenAI o1 excels at reasoning and answering complex questions, but the model is about four times more expensive to use than GPT-4o. OpenAI’s latest model lacks the tools, multimodal capabilities, and speed that made GPT-4o so impressive. In fact, OpenAI even admits that “GPT-4o is still the best option for most prompts” on its help page and points out elsewhere that o1 has problems with simpler tasks.
“It’s impressive, but I think the improvement is not very significant,” said Ravid Shwartz Ziv, an NYU professor who studies AI models. “It’s better on certain problems, but there’s no improvement across the board.”
For all these reasons, it’s important to use o1 only for the questions it’s really meant to help with: big questions. To be clear, most people don’t use generative AI to answer these kinds of questions today, mainly because today’s AI models aren’t very good at it. However, o1 is a cautious step in that direction.
Thinking through big ideas
OpenAI o1 is unique because it “thinks” before answering, breaking big problems down into small steps and trying to recognize when it is doing one of those steps right or wrong. This “multi-step thinking” is not entirely new (researchers have been proposing it for years, and You.com uses it for complex queries), but was not practical until recently.
“There’s a lot of excitement in the AI community,” Kian Katanforoosh, CEO of Workera and an adjunct professor at Stanford University who teaches machine learning courses, said in an interview. “If you can train a reinforcement learning algorithm combined with some of the language modeling techniques that OpenAI offers, you can technically create incremental reasoning and allow the AI model to work backwards from the big ideas that you’re working through.”
OpenAI o1 is also uniquely expensive. In most models, you pay for input and output tokens. However, o1 adds a hidden process (the small steps the model breaks big problems down into) that adds a large amount of computational power that you never fully see. OpenAI hides some of the details of this process to maintain its competitive advantage. Even so, you are charged for these in the form of “reasoning tokens.” This again underscores why you need to be careful when using OpenAI o1, lest you get charged a ton of tokens for asking where the capital of Nevada is.
However, the idea of an AI model that helps you “step back from big ideas” is powerful. In practice, the model does this quite well.
In one example, I asked ChatGPT o1 preview to help my family plan Thanksgiving, a task that could benefit from a bit of unbiased logic and reasoning. Specifically, I wanted help figuring out if two ovens would be enough to cook Thanksgiving dinner for 11 people, and wanted to discuss whether we should consider renting an Airbnb to get access to a third oven.
After 12 seconds of “thinking,” ChatGPT wrote me a 750+ word response, ultimately telling me that two ovens should be enough with some careful strategy and that my family could save on costs and spend more time together. But it broke down its thinking for me at each step, explaining how it took all of these external factors into account, including cost, family time, and oven management.
ChatGPT o1 previewed how to make the most of the oven space in the house hosting the event, which was a smart idea. Oddly enough, I was suggested to rent a portable oven for the day, but it performed much better than GPT-4o, which required multiple inquiries about exactly what dishes I would bring and then gave me only poor advice that I found less useful.
It may seem silly to ask about Thanksgiving dinner, but you’ll realize how helpful this tool can be in breaking down complicated tasks.
I also asked o1 to help me plan a busy day where I had to travel between the airport, several face-to-face meetings in different locations, and my office. I received a very detailed plan, but that may have been a bit much. Sometimes all the extra steps can be a bit overwhelming.
On a simpler question, o1 does way too much – it doesn’t know when to stop thinking too much. I asked where you can find cedars in America, and it provided an 800+ word answer describing every species of cedar in the country, including its scientific name. For some reason, it even had to consult OpenAI’s guidelines at one point. GPT-4o answered this question much better, providing me with about three sentences explaining that you can find the trees all over the country.
Dampened expectations
In some ways, Strawberry never lived up to the hype. Reports of OpenAI’s reasoning models date back to November 2023, around the time everyone was looking for an answer to why OpenAI’s board fired Sam Altman. This got the rumor mill in the AI world buzzing, with some speculating that Strawberry was a form of AGI, the enlightened version of AI that OpenAI ultimately hopes to create.
Altman confirmed o1 is not AGI, to clear up any doubts, not that you would be confused after using the thing. The CEO also scaled back expectations for this launch, tweet that “o1 is still buggy and limited, and still seems more impressive the first time you use it than it does later when you spend more time with it.”
The rest of the AI world has to come to terms with a less exciting start than expected.
“The hype kind of got out of control for OpenAI,” says Rohan Pandey, a research engineer at AI startup ReWorkd, which builds web scrapers using OpenAI’s models.
He hopes that o1’s reasoning power is good enough to solve a number of complicated, niche problems that GPT-4 fails to solve. This is probably how most people in the industry see o1, but not quite as the revolutionary step forward that GPT-4 represented for the industry.
“Everyone is waiting for a leap in capabilities, and it’s not clear if this represents that. I think it’s as simple as that,” Mike Conover, CEO of Brightwave, who previously co-developed Databricks’ AI model Dolly, said in an interview.
What is the value here?
The underlying principles used to develop o1 go back years. Google used similar techniques in 2016 to develop AlphaGo, the first AI system to beat a world champion in the board game Go, as pointed out by former Google employee and CEO of venture capital firm S32, Andy Harrison. AlphaGo trained by playing against itself countless times, essentially teaching itself until it reached superhuman abilities.
He points out that this raises an age-old debate in the AI world.
“Camp one believes that you can automate workflows through this agent-based process. Camp two believes that if you had general intelligence and judgment, you wouldn’t need the workflow and the AI would just make a judgment like a human,” Harrison said in an interview.
Harrison says he’s in camp one, and camp two requires trusting AI to make the right decisions. He doesn’t think we’re there yet.
Others see o1 less as a decision-making aid and more as a tool to challenge your thinking when making important decisions.
Katanforoosh, the CEO of Workera, described an example where he wanted to interview a data scientist for his company. He tells OpenAI o1 that he only has 30 minutes and wants to assess a certain number of skills. He can work backwards with the AI model to understand if he is thinking about it correctly, and o1 will understand time constraints and so on.
The question is whether this helpful tool is worth the high price. As AI models become cheaper, o1 is one of the first AI models in a long time to become more expensive.