Abstract Summary
With the surge of multimodal models and the demand for effective image Information Retrieval (IR) systems, high-quality text-to-image datasets have become paramount. However, most existing datasets are primarily in English, limiting their applicability to multilingual settings. To address this, we introduce the pt-image-ir-dataset, a manually annotated resource for text-based Image IR in European Portuguese. The dataset comprises 80 diverse queries and a curated pool of 5,201 images, each annotated for relevance by multiple human judges. The proposed dataset is a step forward in supporting the development and evaluation of image IR systems for European Portuguese, addressing a clear gap in multilingual multimodal research. To this end, we have made our dataset publicly available, alongside baseline experimental results, demonstrating its suitability on the Image IR task across different retrieval paradigms, including traditional text-based lexical IR methods, semantic dense retrieval models based on language embeddings, cutting-edge vision-language models and end-to-end image retrieval systems. Results demonstrate that vision-language models, particularly OpenCLIP/xlm-roberta-base-ViT-B-32, significantly outperform other approaches (MRR = 0.610).