June 5, 2025, | Redmond, WA – Microsoft Research today unveiled BenchmarkQED, a new automated RAG benchmarking tool intended to evaluate and compare retrieval-augmented generation (RAG) systems more accurately and reliably. The open-source platform is now on GitHub and comes with modular modules for query creation, answer assessment, and data preparation.
The tool allows developers and researchers to evaluate RAG system performance on various datasets and quality metrics without the need for specific setups.
What BenchmarkQED Provides for RAG Benchmark
BenchmarkQED facilitates automatic testing of large language model (LLM)- based QA systems, especially in responding to questions over private or large-scale datasets. It is an addition to Microsoft’s previous GraphRAG library and consists of three primary components:
AutoQ: An automated query generation module according to query source and scope (data-driven vs. activity-driven; local vs. global).
Autor: An auto-evaluation system based on GPT-4.1 for comparing answer quality on dimensions such as comprehensiveness, diversity, empowerment, and relevance.
AutoD: A system for data sampling to maintain uniformity of structure across datasets for more equitable benchmarking comparisons.
The evaluation suite is designed to facilitate reproducibility in the testing of novel RAG approaches.
LazyGraphRAG Performs Best For BenchmarkQED
In comparison benchmarks, Microsoft’s LazyGraphRAG model had much higher win rates for all query classes when compared to conventional vector-based RAG systems.
Benchmarks were performed using 1,397 medical-related articles from the AP News dataset.
A total of 200 distinct inquiries were autonomously formulated using AutoQ, evenly distributed across four predefined query categories.
In head-to-head evaluations across 96 distinct testing setups, LazyGraphRAG consistently delivered superior results, even when benchmarked against configurations leveraging context windows as large as one million tokens in vector-based retrieval systems.
Based on the findings, LazyGraphRAG worked particularly well in producing complete and varied responses within both local and global query settings. The performance of the system remained even when it was compared against models that had longer context windows and best practice chunking mechanisms.
“This is an important advance in benchmarking generative QA systems beyond simple retrieval. BenchmarkQED brings together disciplined methodology and expansive reach,” noted Darren Edge, a Senior Director at Microsoft Research.
How AutoQ Augments Query Testing In BenchmarkQED
The AutoQ module combines synthetic queries into four categories:
- Data-local – Local data from data (e.g., “Why are junior doctors going on strike in South Korea?”)
- Activity-local – Action-based local questions (e.g., public health consequences of emerging viruses)
- Data-global – Dataset-wide content topics
- Activity-global – Larger-scale initiatives across the dataset
All of these are essential for probing the capabilities of various RAG models and offer structured benchmarks for comparison.
LLMs as Judges: AutoE in Action
BenchmarkQED integrates a novel evaluation method within AutoE, leveraging GPT-4.1 as an autonomous assessor to gauge the depth and precision of responses. The process of evaluation includes:
- Answer pairs comparison on many trials
- Four fundamental metrics for answer scoring
- Compiling results for comparison of win rates
- Large-scale LLM evaluation ensures finer-grained benchmarking, especially for global reasoning tasks, where classical RAG tends to falter.
External and Open-Source Datasets Facilitate Community Adoption
To meet expanding demand, Microsoft has also published an updated transcript of the Behind the Tech podcast dataset and an open health news test dataset. You can now find both items directly inside the BenchmarkQED storage environment.
These datasets enable the broader research community to reproduce Microsoft’s experiments and apply BenchmarkQED to other areas.
Broader Implications for AI Development in Benchmarking
BenchmarkQED has the potential to become a go-to tool for RAG evaluation, especially with developers looking for accurate, scalable ways to verify AI-generated responses.
The launch comes as demand for enterprise-grade QA solutions increases, particularly in sectors like healthcare, legal search, and enterprise search.
Experts in the industry think that the framework will impact AI model selection, tuning, and deployment strategies in academia as well as in the enterprise.
The capability to analyze generative systems beyond retrieval accuracy will be crucial to establishing trust in AI outputs,” explained Ha Trinh, Senior Data Scientist at Microsoft.
FAQs
1. What is Microsoft’s BenchmarkQED?
BenchmarkQED is an open-source RAG benchmarking tool by Microsoft Research. It tests retrieval-augmented generation (RAG) systems automatically using query generation (AutoQ) modules, answer evaluation (AutoE) modules, and dataset sampling (AutoD) modules.
2. How does BenchmarkQED evaluate RAG systems?
BenchmarkQED uses LLMs like GPT-4.1 to assess answer quality based on metrics such as relevance, comprehensiveness, diversity, and empowerment. It compares generated answers across query types and datasets to produce reproducible benchmark results.
3. What makes LazyGraphRAG outperform other RAG systems?
LazyGraphRAG outperforms standard vector-based retrieval approaches by using compact data fragments, an expanded question-processing capacity, and a retrieval process structured as a dynamic graph rather than a flat index. Under BenchmarkQED tests, it reported over 50% win rates for all types of queries, even against models with 1 M-token context windows.
4. Can I use BenchmarkQED with my datasets?
Yes, BenchmarkQED can be fully customized to fit any developer’s or researcher’s needs. They can input their data into the AutoD module, create structured queries with AutoQ, and assess answers with AutoE—all within a reproducible environment.
5. Where do I download or contribute to BenchmarkQED?
BenchmarkQED is hosted on GitHub. The repository contains the source code, sample datasets, evaluation instructions, and contribution guidelines for developers and researchers.
Discover the future of AI, one insight at a time – stay informed, stay ahead with AI Tech Insights.
To share your insights, please write to us at sudipto@intentamplify.com