Background GradientBackground Gradient
Inkeep Logo
← Back to Blog

OpenAI's FrontierScience Benchmark Tests AI Research Capabilities

OpenAI's FrontierScience benchmark reveals the gap between structured problems and real research work. GPT-5.2 scores 77% on Olympiad tasks but only 25% on open-ended research—key insights for AI support teams.

Inkeep Team
Inkeep Team
OpenAI's FrontierScience Benchmark Tests AI Research Capabilities

What's New

OpenAI released FrontierScience, a benchmark that tests whether AI models can perform PhD-level scientific research tasks—not just answer multiple choice questions.

GPT-5.2 scored 77% on Olympiad-style problems and 25% on open-ended research tasks, outperforming Claude Opus 4.5 and Gemini 3 Pro.

Why It Matters

This benchmark reveals the gap between "knowing facts" and "doing real work."

The 52-point spread between structured problems (77%) and research tasks (25%) shows that even frontier models struggle when problems aren't neatly defined. For teams building AI assistants, this matters: your users don't ask textbook questions.

The research track uses 10-point rubrics graded by GPT-5 to evaluate reasoning quality, not just final answers. This approach—assessing intermediate steps, not just outputs—offers a template for evaluating any AI system handling complex queries.

The Big Picture

AI benchmarks are shifting from "can it pass the test" to "can it do the job."

When GPQA launched in November 2023, GPT-4 scored 39% on PhD-level science questions. Two years later, GPT-5.2 hit 92%. That benchmark is now saturated.

FrontierScience represents a new generation of evaluations: 700+ questions written by 42 international Olympiad medalists and 45 PhD scientists across physics, chemistry, and biology. The research tasks mimic what scientists actually encounter—multi-step problems requiring synthesis across domains.

Go Deeper: Implications for AI-Powered Support

The FrontierScience methodology holds lessons for anyone building or evaluating AI assistants.

Rubric-based grading scales evaluation. OpenAI couldn't hire experts to grade every response, so they designed rubrics with "objectively assessable items" that a model grader could check. Teams evaluating support AI face the same constraint. Breaking quality into discrete, verifiable components—did the response cite a source? Did it address the user's actual question?—makes automated QA practical.

Structured vs. open-ended performance diverges. The 52-point gap between Olympiad (77%) and Research (25%) scores shows that models handle constrained problems far better than ambiguous ones. Support queries span both: some customers ask specific how-to questions, others describe vague symptoms. Understanding where your AI excels—and fails—lets you route appropriately.

Selection bias affects benchmarks. OpenAI notes they "discarded tasks that models successfully got right" during development, biasing results against their own models. Any evaluation process that iterates on failures will overfit to weaknesses. For support teams measuring resolution rates, the same risk applies: testing only on escalated tickets skews your picture of overall performance.

Reasoning effort correlates with accuracy. GPT-5.2 at "xhigh" reasoning effort scored 77% on Olympiad tasks; at "low" effort, it dropped to 67.5%. For latency-sensitive support applications, this tradeoff matters. Not every query needs maximum compute.

Failure modes are instructive. OpenAI's transcript analysis found models made "reasoning, logic, and calculation errors, didn't understand niche scientific concepts, and made factual inaccuracies." These failure categories—reasoning errors vs. knowledge gaps vs. hallucinations—require different mitigations. RAG helps with knowledge gaps; it doesn't fix reasoning errors.

What's Next

For teams building AI support systems:

  • Audit your evaluation approach. Are you measuring final answers only, or intermediate reasoning? Rubric-based grading catches failure modes that binary pass/fail misses.

  • Map your query complexity. Identify which support questions are "Olympiad-style" (structured, clear answers) vs. "Research-style" (open-ended, multi-step). Set expectations accordingly.

  • Track the reasoning-latency tradeoff. As models offer configurable "thinking time," you'll need to decide which queries justify the cost.

OpenAI plans to iterate on FrontierScience and expand to new domains. For now, the benchmark confirms what practitioners already know: AI handles structured tasks well but struggles with the messy, ambiguous problems that define real work. Building systems that acknowledge these limits—through citations, confidence signals, and human escalation paths—remains the practical path forward.

Stay Updated

Get the latest insights on AI agents and enterprise automation

See Inkeep Agents foryour specific use case.

Ask AI