Modern Information Retrieval Evaluation In The RAG Era

Why this topic matters
Traditional IR benchmarks fall short for real-world RAG applications due to stale data, incomplete labels, and unrealistic queries. This talk introduces FreshStack, a new benchmark built from recent StackOverflow and GitHub content, designed to reflect real programming queries.
Overview

Information Retrieval (IR) is not new

Traditional IR Evaluation

Examples

BEIR was created to provide a more realistic assessment by including 18 different retrieval tasks that span various domains, query types, and document formats.