Do We Still Need Text for Video Retrieval in the Era of Vision-Language Models?

This abstract has open access

Abstract Summary

Effective video retrieval has historically relied heavily on textual descriptions and metadata. However, recent advances in vision-language models (VLMs) prompt the question: Are text features still essential for effective video retrieval? In this work, we investigate this question using a unified multimodal retrieval framework based on advanced VLM embeddings. Evaluating on the comprehensive and multilingual MultiVENT 2.0 dataset from the MAGMaR shared task, we show that multimodal retrieval systems, combining visual frames, audio signals, and textual descriptions, surpass traditional text-only retrieval performance. Remarkably, our results demonstrate that retrieval based solely on non-text modalities (vision and audio) achieves performance comparable to text-based methods, indicating that explicit text input may no longer be strictly necessary.

Abstract ID :

NKDR84

Submission Type

Submission Topics

Associated Sessions

Poster Session (Short Papers)

Author
Co-Authors

Jiaqi Zhan

University of Waterloo

Xinyu Zhang

University of Waterloo

Shengyao Zhuang

CSIRO

Xueguang Ma

PhD Candidate

,

University Of Waterloo

Jimmy Lin

Professor

,

University Of Waterloo

Abstracts With Same Type

Abstract ID

Abstract Title

Abstract Topic

Submission Type

Primary Author

NKDR99

Adversarial Edge Perturbation Framework in Graph-based Retrieval

Machine learning Search and ranking

Short papers

Mr. Amir Khosrojerdi

NKDR115

Aligning Instruction-Tuned LLMs for Event Extraction with Multi-objective Reinforcement Learning

IR applications Large Language Models

Short papers

Omar Adjali

NKDR208

Analyzing AI Evaluation Benchmarks Through InformationRetrieval and Network Science

IR evaluation

Short papers

Gaia Simeoni

NKDR108

Beyond Correlations: A Downstream Evaluation Framework forQuery Performance Prediction

IR evaluation Search and ranking

Short papers

Ms. PAYEL SANTRA

NKDR104

Beyond Persuasiveness: A User-Centric Evaluation Framework of Explanations for Food Recommendation

Short papers

Yurou Zhao

NKDR112

Controlling Gender Bias in Retrieval via a Backpack Architecture

Machine learning Search and ranking

Short papers

Amirabbas Afzali

NKDR118

Correct but Incomplete: Why Chain-of-Thought Cannot Currently Support Auditable Reasoning

IR evaluation

Short papers

Edward Richards

NKDR82

DARE: A Dialectical Framework for Adversarial and Evidence-Aware RAG

Generative IRIR applicationsLarge Language ModelsRetrieval-Augmented GenerationSystem aspects

Short papers

Saisab Sadhu

NKDR102

EmbMerge: A Transformer-based Method for Fusing CDR Lists

Short papers

Mehmet Erdeniz Aydo?du

View All Abstracts

2 visits