Recent studies have uncovered critical issues in the tests used to evaluate the safety and efficacy of artificial intelligence. This information was reported by The Guardian.
Researchers from the UK’s AI Security Institute, along with experts from Stanford, Berkeley, and Oxford universities, analyzed over 440 tests assessing AI safety.
They found flaws that, in their view, undermine the credibility of the results, noting that almost all the reviewed tests have "weaknesses in at least one area," and the outcomes may be "irrelevant or even misleading."
Many of these tests are used to assess new AI models developed by major tech companies, as highlighted by researcher Andrew Bean from the Oxford Internet Institute.
In the absence of national AI regulations in the UK and the US, these tests are employed to determine whether new models are safe, align with public interests, and achieve claimed capabilities in reasoning, mathematics, and coding.
"Tests form the foundation of nearly all claims of progress in artificial intelligence. However, without unified definitions and reliable measurement methods, it’s challenging to determine whether models are genuinely improving or merely appearing to do so," emphasized Bean.
The research examined publicly available tests, but leading AI firms also possess internal tests that were not scrutinized.
Bean noted that "a shocking conclusion was that only a small minority (16%) of tests employed uncertainty ratings or statistical methods to indicate how likely it is that criteria would be accurate. In other cases, when criteria were set to evaluate AI characteristics, including its ‘harmlessness,’ the definitions were often contentious or vague, which diminished the usefulness of the tests.
The study concludes that there is an "urgent need for shared standards and best practices" regarding AI.

6842 image for slide