Do We Still Need Text for Video Retrieval in the Era of Vision-Language Models?

This abstract has open access
Abstract Summary
Effective video retrieval has historically relied heavily on textual descriptions and metadata. However, recent advances in vision-language models (VLMs) prompt the question: Are text features still essential for effective video retrieval? In this work, we investigate this question using a unified multimodal retrieval framework based on advanced VLM embeddings. Evaluating on the comprehensive and multilingual MultiVENT 2.0 dataset from the MAGMaR shared task, we show that multimodal retrieval systems, combining visual frames, audio signals, and textual descriptions, surpass traditional text-only retrieval performance. Remarkably, our results demonstrate that retrieval based solely on non-text modalities (vision and audio) achieves performance comparable to text-based methods, indicating that explicit text input may no longer be strictly necessary.
Abstract ID :
NKDR84
Submission Type
Submission Topics

Associated Sessions

University of Waterloo
University of Waterloo
PhD Candidate
,
University Of Waterloo
Professor
,
University Of Waterloo

Abstracts With Same Type

Abstract ID
Abstract Title
Abstract Topic
Submission Type
Primary Author
NKDR99
Machine learning Search and ranking
Short papers
Mr. Amir Khosrojerdi
NKDR115
IR applications Large Language Models
Short papers
Omar Adjali
NKDR108
IR evaluation Search and ranking
Short papers
Ms. PAYEL SANTRA
NKDR112
Machine learning Search and ranking
Short papers
Amirabbas Afzali
NKDR82
Generative IRIR applicationsLarge Language ModelsRetrieval-Augmented GenerationSystem aspects
Short papers
Saisab Sadhu
NKDR102
Short papers
Mehmet Erdeniz Aydo?du
2 visits