AI fails to match top mathematicians in landmark research-level test

Elite maths trial shows AI still falls short of top human problem-solvers

Last updated:
3 MIN READ
Maths
Four leading AI models have fallen short of top human mathematicians when put to the ultimate test - real, unpublished research problems.
Shutterstock

Dubai: Artificial intelligence (AI) is on a run of late trying to prove it can outpace humans in solving decades-old puzzles, mastering games, and writing code at superhuman speed. But when it comes to the cutting edge of mathematical research, humans still have the upper hand.

That's the verdict from First Proof, a new project that has put four AI systems through what may be the most demanding maths test ever devised for machines.

The test posed ten genuine research-level problems - questions that working mathematicians had recently solved but not yet published.

Get updated faster and for FREE: Download the Gulf News app now - simply click here

A panel of anonymous experts in each relevant field then assessed the AI responses. The results were published on the First Proof website on June 10, and they were clear - not one model matched the standard of a top human mathematician.

This was the first test of its kind to satisfy three key conditions at once: research-level questions, problems absent from training data, and formal grading by expert mathematicians.

According to Nature, previous AI benchmarks have often been criticised for using problems that models may have encountered during training, meaning a high score could reflect memorisation rather than genuine reasoning.

First Proof closed that loophole by sourcing problems directly from researchers' unpublished work, making it virtually impossible for the models to have seen them before.

Who was in the running?

OpenAI was the only major tech company to participate with a commercially available model, its ChatGPT 5.5 Pro. The remaining three systems came from academic groups at UCLA, Princeton University, and ETH Zurich.

Notably absent were Google's Aletheia - a system designed specifically for mathematical problem-solving and the full, unreleased version of Anthropic's Claude Mythos. Since, First Proof required that no human assistance could be independently verified or ruled out, those models could not be officially entered.

The models also displayed a familiar weakness - hallucination. Even when explicitly instructed to double-check their references, the AI systems produced factually incorrect outputs, a persistent issue with large language models that becomes particularly damaging in a field where precision is everything.

AI shows promise but mathematicians remain ahead

In May, an OpenAI chatbot made headlines by cracking an 80-year-old problem posed by the late Hungarian mathematician Paul Erdős, a genuine feat that sparked excitement about AI's potential in pure mathematics.

First Proof's results don't take the shine off that achievement, but they do put it in perspective - cracking an old puzzle is a very different thing from solving a brand-new research problem

First Proof conducted an initial trial in February using a different set of questions. Anyone was free to evaluate their preferred AI system, and many people did, but the outcomes were never formally validated. That changed in the June round, which introduced proper oversight to ensure the comparison was meaningful.

The First Proof team sees future rounds of the test as a valuable tool for the mathematical community - not to determine whether AI can replace mathematicians but to understand how it might assist them.

Potential roles include checking proofs for errors, suggesting avenues of inquiry, and eventually tackling problems autonomously in narrow domains. For now, though, the message from the world's hardest maths test is unambiguous - the humans are still winning.