An Open SERP Mining Infrastructure for the Archive Query Log

This abstract has open access

Abstract Summary

Query logs are key resources for studying search engine interactions and improving retrieval effectiveness but are rarely publicly available. In the past, search providers only shared small subsets of their own logs to curb competition and to ensure privacy. The Archive Query Log (AQL) will become an open alternative: mining query logs from archived search engine result pages (SERPs). While the AQL-22 prototype demonstrated the feasibility of this approach, its limited scalability and maintainability hindered the widespread adoption by the research community. We re-implement the crawling and parsing of the AQL on open infrastructure, using standard tools, a new framework for storing SERPs, and following FAIR data principles. The extended and continuously crawled AQL-25 corpus contains 553 million SERPs from 775 search providers, mined from six web archives, where so far 223 million SERPs (44 TB; 40%) have been downloaded and parsed. We demonstrate the use of this new AQL mining framework in two typical analysis scenarios: a temporal analysis now implemented as a single Elasticsearch query and a batch-processing analysis using Ray. Our resource equips researchers with all the tools needed to analyze SERPs.

Abstract ID :

NKDR132

Submission Type

Resource

Submission Topics