Sitemap Topology#

Unlike the regular pipeline that discovers new webpages traditionally via hyperlink extraction, this pipeline adopts an alternative link discovery method - Sitemaps. These are crawler-friendly tools utilized by webmasters to expedite robot access to website content. The crawler retrieves and parses Sitemaps as well as webpages discovered through Sitemaps. Sitemaps also specify changes on a website and allow for refreshing content in an efficient and timely manner. This pipeline has a lower risk profile as the retrieved webpages are anticipated to be crawled based on the Sitemap mechanism.



storm_topology

Sitemap Topology