Abstract Summary
This paper investigates the potential of LLMs for automatically annotating the usefulness, supportiveness, and credibility of search results. These aspects, while essential to the construction of misinformation benchmarks, are expensive and difficult to obtain at scale. Our comparative study suggests that, under certain conditions, LLMs can provide reasonable estimates of usefulness and supportiveness. In contrast, credibility judgments generated by LLMs show almost no agreement with human assessments. This raises concerns for the exploitation of LLMs to assist in the construction of collections that require annotations that go beyond relevance.