arxiv:2603.20990

ECI_{sem}: Semantic Residual Effective Contrastive Information for Evaluating Hard Negatives

Published on Jun 5

· Submitted by

Aarush on Jun 8

Upvote

Authors:

Aarush Sinha ,

Abstract

ECI_sem, a semantic residual variant of Effective Contrastive Information, ranks negative sources for dense retrieval using frozen embeddings without requiring training, achieving strong performance on MS MARCO and BEIR benchmarks.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Hard-negative source selection for dense retrieval is usually decided only after fine-tuning and downstream evaluation. We propose ECI_{sem}, a semantic residual variant of Effective Contrastive Information (ECI) that ranks candidate negative sources using frozen target-encoder embeddings. ECI_{sem} is training-free, not label-free: each scored example requires a query, a labeled positive, and an explicit candidate negative. ECI_{sem} builds a weighted residual information matrix from target consistency, semantic locality, lexical residuality, and a log-determinant diversity objective. On MS MARCO negative sources, in-family ECI_{sem} ranks LLM negatives highest among non-hybrid sources and Dense+LLM highest among hybrid sources, matching the strongest aggregate BEIR transfer results across DistilBERT, E5-base, and Contriever. Controlled ablations show that this alignment depends on using the target encoder family, while additional ablations show stability under sample-size, temperature, tokenizer, and IDF-corpus perturbations. The theory gives a local linearized link to loss reduction, while the empirical study treats downstream evaluation as the final test.