Browser-based Crawling of News Websites Behind Paywalls: IIPC-IFLA News Media Workshop February 13, 2025
The IFLA News Media and International Internet Preservation Consortium (IIPC) are teaming up again to host a series of workshops focusing on archiving news media. As most news is now published online, there is a growing interest to better understand the current best practices in web archiving. The main goal of our workshops is to examine and compare how organizations of varying sizes tackle this topic and to learn from their collective experiences. Through presentations and informal discussions, we will showcase diverse organizational approaches to archiving news media, including audiovisual content and social media, highlight key challenges, and explore innovative solutions.
In a digital age characterized by information abundance, access to online content remains a significant challenge. Many valuable resources are hidden behind paywalls and login screens, making it difficult for researchers, archivists, and data enthusiasts to retrieve, preserve, and analyze this content. In this workshop, experts from web archives across Europe aim to empower participants with the knowledge and tools necessary to tackle these obstacles effectively and will equip participants with the skills and knowledge required to navigate paywalls, conduct web crawls effectively, ensure data quality, and foster ongoing communication with site owners.
The following key components will be covered in this workshop:
- Accessing Paywalled Content:
- Techniques to bypass paywalls and access restricted websites
- Negotiating with newspapers and publishers to obtain login credentials
- Strategies for requesting IP Authentication from site administrators
- Browser plugins and user agent customization to enhance access
- Actually Crawling Content:
- Exploration of web crawling tools, including Heritrix and Browsertrix
- Utilizing Browsertrix Cloud and Browsertrix Crawler for efficient and scalable crawling
- Using Browsertrix Behaviors for harvesting special content, such as videos, podcasts and flipbooks
- Introduction to other essential tools for web harvesting
- Quality Assurance of Content:
- Deduplication techniques and best practices
- Implementing dashboards for IP-validation to ensure data integrity
- Workshop segment on setting up the initial infrastructure and performing proxy at home
- Communication with Site Owners:
- Emphasizing the importance of communication with site owners
- Highlighting the direct correlation between effective communication and access privileges
- Strategies for maintaining ongoing relationships with content providers
FEBRUARY 13, 2025 10:00-12:00 EST | 15:00-17:00 UTC | 16:00-18:00 CET
AGENDA
10:00-10:05: Introduction
10:05-11:25: Presentations
11:25-11:55: Q&A
11:55-12:00: Wrap-up
SPEAKERS
Anders Klindt Myrvoll, Royal Danish Library
Anders Klindt Myrvoll has been the Programme Manager at the national Danish web archive, Netarkivet, at the Royal Danish Library since 2018. Together with colleagues, he is collecting, preserving and providing access to the Danish web. Prior to web archiving, Anders worked for more than 13 years in the broadcast, film and media industry, collaborating globally on high-end localization, making original content for children, saving digital cultural heritage, strategy, optimization, leadership and much more. You can find him on Linkedin or @andersklindt on X/Twitter.
Antares Reich, Austrian National Library
Antares is a crawl engineer and is responsible for the set-up and quality assurance of all crawls at the Austrian National Library. Previously he worked as a software developer for cashier systems and as a local assistant to a member of the European Parliament. He loves books and to play music.
Joel Nieminen, National Library of Finland
László Tóth, National Library of Luxembourg
László is a software engineer involved in the development of tools related to web archiving at the National Library of Luxembourg. This includes web crawling, ingest workflows and playback. Previously, he worked as a developer for a European media company, specializing in software concerned with broadcasting, media and post-production. László holds an MSc in Advanced Computing Science from the University of East Anglia (United Kingdom) and outside of software development he is mainly interested in mathematics and classical music.
Samuli Sairanen, National Library of Finland
IFLA/IIPC WORKSHOP PLANNING COMMITTEE
Alexia Bauville, BnF (France)
Ana Krahmer, University of North Texas Libraries (United States)
Deborah Thomas, Library of Congress (United States)
Karolina Holub, National and University Library in Zagreb (Croatia)
Mark Phillips, University of North Texas Libraries (United States)
Natalia Márquez-Bustos, Pontificia Universidad Javeriana (Colombia)
Olga Holownia, IIPC (United States)
Perla Olivia Rodriguez Reséndiz, UNAM (Mexico)
Kopana Terry, University of Kentucky Libraries (United States)
Contact: [email protected]