Abstract Summary
Effective video retrieval has historically relied heavily on textual descriptions and metadata. However, recent advances in vision-language models (VLMs) prompt the question: Are text features still essential for effective video retrieval? In this work, we investigate this question using a unified multimodal retrieval framework based on advanced VLM embeddings. Evaluating on the comprehensive and multilingual MultiVENT 2.0 dataset from the MAGMaR shared task, we show that multimodal retrieval systems, combining visual frames, audio signals, and textual descriptions, surpass traditional text-only retrieval performance. Remarkably, our results demonstrate that retrieval based solely on non-text modalities (vision and audio) achieves performance comparable to text-based methods, indicating that explicit text input may no longer be strictly necessary.