Microsoft Debuts BenchmarkQED: A Smarter Way to Benchmark RAG Systems

June 15, 2025

June 5, 2025, | Redmond, WA – Microsoft Research today unveiled BenchmarkQED, a new automated RAG benchmarking tool intended to evaluate and compare retrieval-augmented generation (RAG) systems more accurately and reliably. The open-source platform is now on GitHub and comes with modular modules for query creation, answer assessment, and data preparation.

The tool allows developers and researchers to evaluate RAG system performance on various datasets and quality metrics without the need for specific setups.

What BenchmarkQED Provides for RAG Benchmark

BenchmarkQED facilitates automatic testing of large language model (LLM)- based QA systems, especially in responding to questions over private or large-scale datasets. It is an addition to Microsoft’s previous GraphRAG library and consists of three primary components:

AutoQ: An automated query generation module according to query source and scope (data-driven vs. activity-driven; local vs. global).

Autor: An auto-evaluation system based on GPT-4.1 for comparing answer quality on dimensions such as comprehensiveness, diversity, empowerment, and relevance.

AutoD: A system for data sampling to maintain uniformity of structure across datasets for more equitable benchmarking comparisons.

The evaluation suite is designed to facilitate reproducibility in the testing of novel RAG approaches.

LazyGraphRAG Performs Best For BenchmarkQED

In comparison benchmarks, Microsoft’s LazyGraphRAG model had much higher win rates for all query classes when compared to conventional vector-based RAG systems.

Benchmarks were performed using 1,397 medical-related articles from the AP News dataset.

A total of 200 distinct inquiries were autonomously formulated using AutoQ, evenly distributed across four predefined query categories.

In head-to-head evaluations across 96 distinct testing setups, LazyGraphRAG consistently delivered superior results, even when benchmarked against configurations leveraging context windows as large as one million tokens in vector-based retrieval systems.

Based on the findings, LazyGraphRAG worked particularly well in producing complete and varied responses within both local and global query settings. The performance of the system remained even when it was compared against models that had longer context windows and best practice chunking mechanisms.

“This is an important advance in benchmarking generative QA systems beyond simple retrieval. BenchmarkQED brings together disciplined methodology and expansive reach,” noted Darren Edge, a Senior Director at Microsoft Research.

How AutoQ Augments Query Testing In BenchmarkQED

The AutoQ module combines synthetic queries into four categories:

Data-local – Local data from data (e.g., “Why are junior doctors going on strike in South Korea?”)
Activity-local – Action-based local questions (e.g., public health consequences of emerging viruses)
Data-global – Dataset-wide content topics
Activity-global – Larger-scale initiatives across the dataset

All of these are essential for probing the capabilities of various RAG models and offer structured benchmarks for comparison.

LLMs as Judges: AutoE in Action

BenchmarkQED integrates a novel evaluation method within AutoE, leveraging GPT-4.1 as an autonomous assessor to gauge the depth and precision of responses. The process of evaluation includes:

Answer pairs comparison on many trials
Four fundamental metrics for answer scoring
Compiling results for comparison of win rates
Large-scale LLM evaluation ensures finer-grained benchmarking, especially for global reasoning tasks, where classical RAG tends to falter.

External and Open-Source Datasets Facilitate Community Adoption

To meet expanding demand, Microsoft has also published an updated transcript of the Behind the Tech podcast dataset and an open health news test dataset. You can now find both items directly inside the BenchmarkQED storage environment.

These datasets enable the broader research community to reproduce Microsoft’s experiments and apply BenchmarkQED to other areas.

Broader Implications for AI Development in Benchmarking

BenchmarkQED has the potential to become a go-to tool for RAG evaluation, especially with developers looking for accurate, scalable ways to verify AI-generated responses.

The launch comes as demand for enterprise-grade QA solutions increases, particularly in sectors like healthcare, legal search, and enterprise search.

Experts in the industry think that the framework will impact AI model selection, tuning, and deployment strategies in academia as well as in the enterprise.

The capability to analyze generative systems beyond retrieval accuracy will be crucial to establishing trust in AI outputs,” explained Ha Trinh, Senior Data Scientist at Microsoft.

FAQs

1. What is Microsoft’s BenchmarkQED?

BenchmarkQED is an open-source RAG benchmarking tool by Microsoft Research. It tests retrieval-augmented generation (RAG) systems automatically using query generation (AutoQ) modules, answer evaluation (AutoE) modules, and dataset sampling (AutoD) modules.

2. How does BenchmarkQED evaluate RAG systems?

BenchmarkQED uses LLMs like GPT-4.1 to assess answer quality based on metrics such as relevance, comprehensiveness, diversity, and empowerment. It compares generated answers across query types and datasets to produce reproducible benchmark results.

3. What makes LazyGraphRAG outperform other RAG systems?

LazyGraphRAG outperforms standard vector-based retrieval approaches by using compact data fragments, an expanded question-processing capacity, and a retrieval process structured as a dynamic graph rather than a flat index. Under BenchmarkQED tests, it reported over 50% win rates for all types of queries, even against models with 1 M-token context windows.

4. Can I use BenchmarkQED with my datasets?

Yes, BenchmarkQED can be fully customized to fit any developer’s or researcher’s needs. They can input their data into the AutoD module, create structured queries with AutoQ, and assess answers with AutoE—all within a reproducible environment.

5. Where do I download or contribute to BenchmarkQED?

BenchmarkQED is hosted on GitHub. The repository contains the source code, sample datasets, evaluation instructions, and contribution guidelines for developers and researchers.

Discover the future of AI, one insight at a time – stay informed, stay ahead with AI Tech Insights.

To share your insights, please write to us at sudipto@intentamplify.com

Tags: AI capabilities, artificial intelligence, BenchmarkQED, generative AI, GPT-4.1, Microsoft

Microsoft Debuts BenchmarkQED: A Smarter Way to Benchmark RAG Systems

What BenchmarkQED Provides for RAG Benchmark

LazyGraphRAG Performs Best For BenchmarkQED

How AutoQ Augments Query Testing In BenchmarkQED

LLMs as Judges: AutoE in Action

External and Open-Source Datasets Facilitate Community Adoption

Broader Implications for AI Development in Benchmarking

FAQs

1. What is Microsoft’s BenchmarkQED?

2. How does BenchmarkQED evaluate RAG systems?

3. What makes LazyGraphRAG outperform other RAG systems?

4. Can I use BenchmarkQED with my datasets?

5. Where do I download or contribute to BenchmarkQED?

AI Tech Staff Writer

Share With

Recent Posts

The Chemistry Powering AI’s Next Leap

Nomerra raises $2 million to tackle private markets’ looming paperwork crisis

From Compliance to Competitive Advantage: Kevin Akeroyd on How AI Is Reshaping Global Tax Operations

AI is Changing What is Possible in the $10 Trillion Food Industry. Anterra Capital is Backing What Comes Next.

Contact Us

Quick Links

Insights

Get in touch

Follow Us

Our Other Brands

Download the AI Technology Insights Media Kit

Microsoft Debuts BenchmarkQED: A Smarter Way to Benchmark RAG Systems

What BenchmarkQED Provides for RAG Benchmark

LazyGraphRAG Performs Best For BenchmarkQED

How AutoQ Augments Query Testing In BenchmarkQED

LLMs as Judges: AutoE in Action

External and Open-Source Datasets Facilitate Community Adoption

Broader Implications for AI Development in Benchmarking

FAQs

1. What is Microsoft’s BenchmarkQED?

2. How does BenchmarkQED evaluate RAG systems?

3. What makes LazyGraphRAG outperform other RAG systems?

4. Can I use BenchmarkQED with my datasets?

5. Where do I download or contribute to BenchmarkQED?

AI Tech Staff Writer

Share With

Recent Posts

The Chemistry Powering AI’s Next Leap

Dawnguard Launches Platform to Build Secure Cloud Systems From Day Zero, With Fresh Funding and US Office

Nomerra raises $2 million to tackle private markets’ looming paperwork crisis

From Compliance to Competitive Advantage: Kevin Akeroyd on How AI Is Reshaping Global Tax Operations

Switzerland Extends its Lead in The Technologies Reshaping The Global Economy

AI is Changing What is Possible in the $10 Trillion Food Industry. Anterra Capital is Backing What Comes Next.

Contact Us

Quick Links

Insights

Get in touch

Follow Us

Our Other Brands

Download the AI Technology Insights Media Kit