You can either build OWLer from the source code and run it locally, in a server environment or use Docker to build an image and run it in a container. OWLer comes with a playbook that installs and configures most of the dependencies to run the crawler.
Deploying in a Server Environment#
Prerequisites: Please refer to the Prerequisites in the Storm deployment guide.
Clone or download OWLer Installation Toolkit repository to your local drive.
# Clone repository git clone https://opencode.it4i.eu/openwebsearcheu-public/owler-installation-toolkit cd owler-installation-toolkit
Use ssh-add to load the SSH key needed to connect to the server, e.g.
Edit the hosts file to specify the IP addresses of the servers (i.e., storm_master, storm_worker(s))
Run the following command inside this directory. The
owler-playbook.ymlplaybook will start OWler in the server environment.
ansible-playbook -i hosts owler-playbook.yml -u username
where username is the name of the user on the remote server. If the servers use different user names, specify them individually in the hosts file.
The Ansible playbook performs several tasks to set up and configure the Storm Crawler environment on both master and worker nodes. Here’s a detailed breakdown of what each task does:
owler-playbook.ymlplaybook performs the following tasks:
i. Update apt cache on all hosts: This is important to ensure that the latest versions of all packages are available for installation.
ii. Enable the Storm Supervisor service on the worker node: The Storm Supervisor is responsible for creating worker processes for executing tasks assigned by the Nimbus.
iii. Prompt the user to enter a password for the Storm Crawler user on the master node: This user is then created with the entered password and added to the ‘storm’ and ‘sudo’ groups. The password is encrypted using md5_crypt for security.
iv. Install several necessary packages on the master node: These include curl, git, wget, gunzip, gawk, and maven.
v. Enable several services on the master node: These include Zookeeper, Storm UI, and Storm Nimbus. Zookeeper is used for coordination between Storm nodes, Storm UI provides a web interface for monitoring the cluster, and Storm Nimbus is the master node daemon that assigns tasks to the workers.
vi. Clone the OWLer repository to the home directory on the master node.
vii. Create an ‘input’ directory in the cloned OWLer repository.
viii. Download and extract a file named ‘warc.paths.gz’ from the Common Crawl data repository to the ‘input’ directory. This file contains a list of paths to WARC (Web ARChive) files. These WARC files are part of the Common Crawl data and contain raw web page data crawled from the internet.
ix. Modify the ‘warc.paths’ file to prepend each line with ‘https://data.commoncrawl.org/’.
x. Clone the Storm Crawler repository to the home directory on the master node.
xi. Perform a clean install of Storm Crawler: If the installation fails, it tries again skipping the tests.
xii. Perform a clean package of OWLer.
xiii. Override the ‘dev.properties’ file in the OWLer directory with a new one.
xiv. Copy OWLer script to ‘/usr/local/bin/owler’: This script is used to start or stop the OWLer service. It accepts a mode argument to determine the pipeline to use and an action argument to determine whether to start or stop the service.
For starting in warc mode:
owler start -mode=warc
For stopping in warc mode:
owler stop -mode=warc
For starting in the default mode (regular):
For stopping in the default mode (regular):