In a groundbreaking study, Google’s DeepMind research unit has unveiled an artificial intelligence system that outperforms human fact-checkers in assessing the accuracy of information produced by large language models. This innovative system, known as the Search-Augmented Factuality Evaluator (SAFE), leverages a multi-step process to analyze text and verify claims using Google Search results.
Evaluating Superhuman Performance
In a recent study titled “Long-form factuality in large language models,” published on arXiv, SAFE showcased remarkable accuracy, aligning with human ratings 72% of the time and outperforming human judgment in 76% of disagreements. Nevertheless, the concept of “superhuman” performance is sparking lively discussions, with some experts debating the comparison against crowdworkers instead of expert fact-checkers.
Cost-Effective Verification
One of SAFE’s significant advantages is its cost-effectiveness. The study revealed that utilizing SAFE was approximately 20 times cheaper than employing human fact-checkers. With the exponential growth of information generated by language models, having an affordable and scalable method for verifying claims becomes increasingly crucial.
Benchmarking Top Language Models
The DeepMind team utilized SAFE to evaluate the factual accuracy of 13 leading language models across four families, including Gemini, GPT, Claude, and PaLM-2, on the LongFact benchmark. Larger models generally exhibited fewer factual errors, yet even top-performing models still generated significant false claims. This emphasizes the importance of automatic fact-checking tools in mitigating the risks associated with misinformation.
Prioritizing Transparency and Accountability
While the SAFE code and LongFact dataset have been made available for scrutiny on GitHub, further transparency is necessary regarding the human baselines used in the study. Understanding the qualifications and processes of crowdworkers is essential for accurately assessing SAFE’s capabilities.