← NewsAll
AI scientists' error-reasoning will be assessed by researchers
Summary
A University of Exeter researcher has secured Leverhulme funding for a four-year project to build a theory of scientific error, assemble a database of error types and strategies, and create benchmarks to test how AI systems reason about experimental error.
Content
Stephan Guttinger at the University of Exeter has received a Research Leadership Award from the Leverhulme Trust for a four-year project on the reasoning abilities of AI "scientists." These AI scientists are software agents that can propose ideas, review literature, write code, run experiments and draft papers. The project brings together philosophers, natural scientists and computer scientists to address how these systems detect and respond to error. Guttinger notes that much of scientists' everyday error-handling happens in informal settings and is underrepresented in the data used to train AI.
Key facts:
- The project is funded by a Leverhulme Research Leadership Award and is planned to run for four years.
- The team will develop a detailed theory of scientific error and compile a systematic database of error types and the strategies researchers use to address them.
- Two benchmarks will be produced: a traditional benchmark with more than 500 question-and-answer pairs for isolated AI agents, and a separate benchmark designed to assess human–AI teams.
- The effort responds to a gap in current AI evaluation: existing benchmarks do not systematically test for scientific error-reasoning.
- The work aims to create conceptual, mathematical and data tools to assess how AI agents handle error, whether independently or in collaboration with humans.
Summary:
The project aims to establish methods and data to evaluate how AI-driven research agents identify and work through errors in scientific practice, with the intention of informing reliable development of such systems. Over the next four years the team will build an error theory, assemble a database of error types and strategies, and produce two benchmarks to test isolated AI agents and human–AI collaborations. Planned activities include assembling the interdisciplinary team and beginning construction of the theory and database.
