Scientist impressed by latest ChatGPT model o1
Scientists praise OpenAI's new ChatGPT model o1 for its impressive advances in science support.

Scientist impressed by latest ChatGPT model o1
Researchers who helped test OpenAI's new large language model, OpenAI o1, say it's a big step forward in terms of Usefulness of chatbots for science represents.
“In my field of quantum physics, there are significantly more detailed and coherent answers” than with the previous model, GPT-4o, says Mario Krenn, head of the Artificial Scientist Lab at the Max Planck Institute for the Physics of Light in Erlangen, Germany. Krenn was part of a group of scientists on the 'Red Team' who tested the pre-release version of o1 for OpenAI, a technology company based in San Francisco, California, putting the bot through its paces and checking for security concerns.
Since the public launch of ChatGPT in 2022 On average, the large language models that power such chatbots have become larger and better, with more parameters, larger training data sets and stronger skills on a variety of standardized tests.
OpenAI explains that the o1 series represents a fundamental change in the company's approach. Observers report that this AI model stands out because it has spent more time in certain learning phases and “thinks” longer about its answers, making it slower but more capable — especially in areas where right and wrong answers are clearly defined. The company adds that o1 can “think through complex tasks and solve more difficult problems than previous models in science, programming and mathematics.” Currently, o1-preview and o1-mini — a smaller, more cost-effective version suitable for programming — are available in testing for paying customers and certain developers. The company has not published any information about the parameters or computing power of the o1 models.
Outperforming graduate students
Andrew White, a chemist at FutureHouse, a San Francisco nonprofit focused on how AI can be applied to molecular biology, says that over the past year and a half, observers since the public release of GPT-4, were surprised and disappointed by a general lack of improvement in how chatbots support scientific tasks. The o1 series, he believes, has changed this.
Remarkably, o1 is the first major language model to beat graduate students on the most difficult question — the ‘Diamond’ set — in a test called the Graduate-Level Google-Proof Q&A Benchmark (GPQA). 1. OpenAI says its researchers scored just under 70% in the GPQA Diamond, while o1 scored 78% overall, with a particularly high score of 93% in Physics (see “Next Level”). That's "significantly higher than the next best documented [chatbot] performance," says David Rein, who was part of the team that developed the GPQA. Rein currently works at the nonprofit Model Evaluation and Threat Research in Berkeley, California, which assesses the risks of AI. “It seems plausible to me that this represents a significant and fundamental improvement in the core capabilities of the model,” he adds.
OpenAI also tested o1 in a qualifying exam for the International Mathematics Olympiad. The previous best model, GPT-4o, solved only 13% of the tasks correctly, while o1 scored 83%.
Thinking in processes
OpenAI o1 works with a chain of thinking steps: it talks itself through a series of considerations as it tries to solve a problem, correcting itself as it goes.
OpenAI has chosen to keep the details of a given thought-step chain secret — partly because the chain might contain errors or socially unacceptable “thoughts,” and partly to protect corporate secrets about how the model works. Instead, o1 offers a reconstructed summary of its logic for the user, along with its answers. It is unclear, White says, whether the full sequence of thought steps, if revealed, would bear any similarities to human thought.
The new abilities also have their downsides. OpenAI reports that it has received anecdotal feedback that o1 models “hallucinate” — invent false answers — more frequently than their predecessors (although the company's internal testing for o1 showed slightly lower hallucination rates).
Red Team scientists noted numerous ways in which o1 was helpful in developing protocols for scientific experiments, but OpenAI says testers also "highlighted a lack of safety information about harmful steps, such as not highlighting explosion hazards or suggesting inappropriate chemical safety methods, indicating the model's inadequacy when it comes to safety-critical tasks."
“It’s still not perfect or reliable enough to not need scrutiny,” White says. He adds that o1 is better suited to Leading experts as beginners. “It is beyond their immediate ability for a beginner to look at a log generated by o1 and realize that it is ‘nonsense’,” he says.
Science problem solver
Krenn believes o1 will accelerate science by helping to scan the literature, identify gaps and suggest interesting research avenues for future studies. He integrated o1 into a tool he helped develop that makes this possible, called SciMuse 2. “It generates much more interesting ideas than GPT-4 or GPT-4o,” he says.
Kyle Kabasares, a data scientist at the Bay Area Environmental Research Institute in Moffett Field, California, used o1 to do some programming steps from his doctoral project that calculated the mass of black holes. “I was just blown away,” he says, noting that it took o1 about an hour to accomplish what took him many months.
Catherine Brownstein, a geneticist at Boston Children's Hospital in Massachusetts, says the hospital is currently testing several AI systems, including o1-preview, for applications such as uncovering connections between patient characteristics and rare disease genes. She says o1 “is more accurate and offers options that I didn’t think were possible from a chatbot.”
-
Rein, D. et al. Preprint at arXiv https://doi.org/10.48550/arXiv.2311.12022 (2023).
-
Gu, X. & Krenn, M. Preprint at arXiv https://doi.org/10.48550/arXiv.2405.17044 (2024).