Open Web Crawler#
This documentation is for OWLer version 0.1.0, which is in its early stages of development. Please be aware that features and functionality may change in future versions.
OWLer, a wide-ranging web crawler, functions with the objective to facilitate improved accessibility, searchability, and navigability of the internet’s expansive scope. It operates as an ongoing development system, focused on optimizing its functionality while ensuring compliance with the digital rights and guidelines established by the web community.
Deriving from the power and flexibility of StormCrawler, we introduce the Open Web Crawler (OWLer). OWLer is an enhancement and a tailored adaptation, designed to accommodate the specific needs of the OpenWebSearch.eu project.
OWLer inherits the robust and scalable nature of its progenitor, and further extends the topologies with additional spouts and bolts to fit the demands of the project. To manage and optimize the crawling process, we have fine-tuned these topologies based on our requirements, creating a system that balances efficiency and respect for the websites we crawl.
In OWLer, we utilize the inherent flexibility of StormCrawler’s topologies to create bespoke spouts and bolts. For instance, we have developed spouts to handle various data sources like Sitemaps and externally provided WARC files (e.g. Common Crawl). Our bolts have also been tailored for processes like parsing the HTML content, extracting hyperlinks, filtering URLs, and managing the cache system to avoid duplications.
OWler supports many features, such as:
Regular crawling of a fixed list of web sites
Continuous re-crawling of sitemaps to discover new pages
Discovery and crawling of new relevant web sites through automatic link prioritization
Indexing of crawled pages using OpenSearch
Web interface for navigating crawled pages in real-time and for for crawler monitoring
Configuration of different types of URL classifiers (machine-learning, regex, etc.)
OWler 0.1.0 is using the following dependencies:
Apache Storm 2.3.0.
Apache ZooKeeper 3.4.13.
For a successful crawl using OWLer, there are several key components that need to be in place.
Firstly, an operating instance of OpenSearch is required. OpenSearch is used to store the crawl status and metrics, providing a comprehensive overview of the crawl process and its results. This can be either local (Dockerized or installed on your machine) or remote.
Secondly, an S3 endpoint is needed to store the crawled files. This could be an actual AWS S3 bucket or an S3-compatible storage service like Minio. It is crucial to ensure that this endpoint is reachable from the machine where the crawler is running.
Lastly, a running URL Frontier is necessary to manage and monitor all the URLs that are to be crawled. The URL Frontier is a critical component of any web crawler. It is responsible for storing the URLs to be crawled, scheduling them for crawling, and avoiding re-crawling of URLs. We recommend using our URL Frontier component, which is a robust and scalable URL Frontier implementation derived version from Crawler-Commons URL Frontier. It supports multiple crawl jobs, distributed operations, and a variety of queue scheduling policies.
By ensuring these components are correctly set up and operational, you can ensure a successful and efficient crawl using OWLer.
OWLer is part of the OpenWebSearch.EU project: