OpenAI launches program to design new domain-specific AI benchmarks

OpenAI, like many AI labs, thinks AI benchmarks are broken. It says it wants to fix them through a new program.

Called the OpenAI Pioneers Program, the program will focus on creating evaluations for AI models that “set the bar for what good looks like,” as OpenAI phrased it in a blog post.

“As the pace of AI adoption accelerates across industries, there is a need to understand and improve its impact in the world,” the company continued in its post. “Creating domain-specific evals are one way to better reflect real-world use cases, helping teams assess model performance in practical, high-stakes environments.”

As the recent controversy with the crowdsourced benchmark LM Arena and Meta’s Maverick model illustrate, it’s tough to know, these days, precisely what differentiates one model from another. Many widely-used AI benchmarks measure performance on esoteric tasks, like solving doctorate-level math problems. Others can be gamed, or don’t align well with most people’s preferences.

Through the Pioneers Program, OpenAI hopes to create benchmarks for specific domains like legal, finance, insurance, healthcare, and accounting. The lab says that, in the coming months, it’ll work with “multiple companies” to design tailored benchmarks and eventually share those benchmarks publicly, along with “industry-specific” evaluations.

“The first cohort will focus on startups who will help lay the foundations of the OpenAI Pioneers Program,” OpenAI wrote in the blog post. “We’re selecting a handful of startups for this initial cohort, each working on high-value, applied use cases where AI can drive real-world impact.”

Companies in the program will also have the opportunity to work with OpenAI’s team to create model improvements via reinforcement fine tuning, a technique that optimizes models for a narrow set of tasks, OpenAI says.

The big question is whether the AI community will embrace benchmarks whose creation was funded by OpenAI. OpenAI has supported benchmarking efforts financially before, and designed its own evaluations. But partnering with customers to release AI tests may be seen as an ethical bridge too far.

S4 Capital downgrades sales outlook as tariff issues hinder financial outlook

Oil slips on rising OPEC+ output, despite Canadian supply concerns

Australia raises minimum wages by 3.5%

International tourist spending in Europe seen up 11% this year, report says

Could the euro replace the dollar as global reserve currency? Its not getting any lesslikely

S4 Capital downgrades sales outlook as tariff issues hinder financial outlook

Oil slips on rising OPEC+ output, despite Canadian supply concerns

Australia raises minimum wages by 3.5%

International tourist spending in Europe seen up 11% this year, report says

Could the euro replace the dollar as global reserve currency? Its not getting any lesslikely

OpenAI launches program to design new domain-specific AI benchmarks

Share

Profitable African fintech PalmPay is in talks to raise as much as $100M

North America takes the bulk of AI VC investments, despite tough political environment

iOS 19: All the rumored changes Apple could be bringing to its new operating system

Attacks on the Three Facets of My Identity

Data breach at newspaper giant Lee Enterprises affects 40,000 people

Popular

Microsofts new AI agent can control software and robots

Waymo and Toyota are dating. If they get serious, a new autonomous vehicle could be created.

RFK Jr. promptly cancels vaccine advisory meeting, pulls flu shot campaign

Discord appoints former Activision Blizzard exec Humam Sakhnini as CEO

Anthropic CEO says spies are after $100M AI secrets in a few lines of code

Dropbox Chief Customer Officer Eric Cox plans to step down, per filing

Related Articles

How to watch Apples WWDC 2025 keynote

Profitable African fintech PalmPay is in talks to raise as much as $100M

North America takes the bulk of AI VC investments, despite tough political environment

iOS 19: All the rumored changes Apple could be bringing to its new operating system

Attacks on the Three Facets of My Identity

Data breach at newspaper giant Lee Enterprises affects 40,000 people

Windsurf says Anthropic is limiting its direct access to Claude AI models

One of Africas most successful founders is back with a new AI startup and already raised $9M

About Us

Popular Category

Editor Picks

Musks declared infant mother Ashley St. Clair uses break up suggestions to Trump as presidents fight with very first friend grows

Females working for UN in Kabul deal with stalking and death dangers: Do not show up tomorrow or you will be eliminated

OpenAI launches program to design new domain-specific AI benchmarks

Share

Related posts:

Popular

Related Articles

About Us

Popular Category

Editor Picks