Regular Topology#

Designed for explorative crawling, the regular crawling pipeline consumes URLs from the frontier, follows the general pipeline elements for URL crawling, appends the extracted URLs to the frontier, and archives new pages. Given its unrestricted scope, it carries the highest risk for security and licensing infractions. Some risks, such as inadvertent interaction with a botnet, are inevitable in the context of explorative crawling. Also, the crawling mode can face requests and/or complaints from webmasters. While we developed a corresponding web-page to inform webmasters about our crawling activity and how to control our crawler (see https:/, exploratory crawling can still trigger requests from webmaster (currently handled via e-mail). Thus, this crawling mode needs a stricter governance.


