ClaimPT: A Portuguese Dataset of Annotated Claims in News Articles

This abstract has open access

Abstract Summary

Fact-checking remains a demanding and time-consuming task, still largely dependent on manual verification and unable to match the rapid spread of misinformation online. This is particularly important because debunking false information typically takes longer to reach consumers than the original misinformation does; therefore, accelerating corrections through automation can help combat misinformation more effectively. Although many organizations perform manual fact-checking, this approach is difficult to scale given the growing volume of digital content. These limitations have motivated interest in automating fact-checking, where identifying claim sentences is a crucial first step. However, progress has been uneven across languages, with English dominating due to abundant annotated data. European Portuguese, like other low-resource languages, still lacks accessible and licensed datasets, limiting both research and NLP tool developments. In this paper, we introduce ClaimPT, a new dataset of annotated claims from European Portuguese news articles, comprising 1308 articles and 6875 individual annotations. Unlike most existing resources based on social media or parliamentary transcripts, ClaimPT focuses on journalistic content, collected through a partnership with LUSA, the Portuguese News Agency. To ensure high-quality annotations, each article was manually annotated by two trained annotators and validated by a curator, following a newly proposed annotation scheme. We also provide baseline models for claim detection, establishing initial performance benchmarks and enabling future applications of Natural Language Processing (NLP) and Information retrieval (IR) techniques. By releasing ClaimPT, we aim to advance research on low-resource fact-checking and enhance understanding of misinformation in news media.

Abstract ID :

NKDR131

Submission Type

Resource

Submission Topics

Machine Learning and Large Language Models Societally-motivated IR research