Modern Information Retrieval Evaluation In The RAG Era

image.png

Why this topic matters

Traditional IR benchmarks fall short for real-world RAG applications due to stale data, incomplete labels, and unrealistic queries. This talk introduces FreshStack, a new benchmark built from recent StackOverflow and GitHub content, designed to reflect real programming queries.

Overview

image.png

Information Retrieval (IR) is not new

image.png

Traditional IR Evaluation

image.png

Examples

image.png

BEIR was created to provide a more realistic assessment by including 18 different retrieval tasks that span various domains, query types, and document formats.