Abstract Summary
Conventional vision-based document retrievers operate at page-level granularity, compelling subsequent reranking models to process documents containing substantial irrelevant information. We introduce RerAnchor, a post-retrieval, OCR-free reranking module designed to address this limitation. At the core of RerAnchor is Context Anchoring: a token-level classifier built upon a vision-language model assigns query-conditioned relevance scores to image patches. A subsequent masking step then suppresses low-scoring patches, effectively denoising the document before a late-interaction retriever performs the final scoring. To enable robust evaluation, we constructed new visual reranking testbeds derived from the Paper-VISA and PDF-MVQA datasets. Experimentally, RerAnchor demonstrates significant ranking improvements. On PDF-MVQA, it increases Recall@1 from 0.66 to 0.738 and MRR@3 from 0.735 to 0.789. On Paper-VISA, it improves Recall@1 from 0.64 to 0.677 and MRR@3 from 0.709 to 0.740, while also achieving competitive fine-grained grounding performance (F1=0.526) without relying on generative decoding. Further analysis identifies a stable masking regime that balances evidence retention with noise reduction. A case study also reveals our model's ability to identify multiple non-contiguous relevant regions, despite being trained exclusively with single-bounding-box supervision. RerAnchor effectively transforms coarse, page-level documents into precise, token-budget-friendly contexts, enhancing vision-based Retrieval-Augmented Generation (RAG). The code, data, and model checkpoints will be made publicly available.