You can either build Strom from the source code and run it locally, in a server environment or use Docker to build an image and run it in a container. OWLer comes with a playbook that installs and configures most of the dependencies to run the crawler.
Ansible can be installed from PyPi:
# Optional: Create a fresh venv python3 -m venv venv && source venv/bin/activate pip install ansible
See Ansible installation guide for details on how to install Ansible for your platform.
Deploying in a Server Environment#
Install a recent version of Java (JDK 11) and Git.
Set up an instance of OpenSearch. This can be either local (Dockerized or installed on your machine) or remote. This is needed to store the metrics.
Establish an S3 endpoint (e.g., Minio) for storing the crawled files.
Ensure a URL Frontier is running. This is required for managing the queue of URLs to be crawled.
Clone or download OWLer Installation Toolkit repository to your local drive.
# Clone repository git clone https://opencode.it4i.eu/openwebsearcheu-public/owler-installation-toolkit cd owler-installation-toolkit
Use ssh-add to load the SSH key needed to connect to the server, e.g.
Edit the hosts file to specify the IP addresses of the servers (i.e., storm_master, storm_worker(s))
dev.propertiesfile in the OWLer directory with the appropriate variables.
Run the following command inside this directory. The
storm-playbook.ymlplaybook will start Storm in the server environment.
ansible-playbook -i hosts storm-playbook.yml -u username
where username is the name of the user on the remote server. If the servers use different user names, specify them individually in the hosts file.
The Ansible playbook performs several tasks to set up and configure the Storm Crawler environment on both master and worker nodes. Here’s a detailed breakdown of what each task does:
i. Set hosts, storm_version, and master_ip variables: This step configures the playbook to execute on all hosts, sets the storm_version variable to the desired version of Apache Storm, and sets the master_ip variable to the FQDN of the master host.
ii. Install necessary software: This step ensures that Python for Ansible and OpenJDK 11 Java Development Kit are installed on the host.
iii. Create Storm and Zookeeper users and groups: This step creates new system users ‘storm’ and ‘zookeeper’ and adds them to their respective groups.
iv. Download and setup Apache Storm and Zookeeper: This step downloads the specified versions of Apache Storm and Zookeeper, extracts them to the appropriate directories, and sets the necessary permissions.
v. Configure Storm and Zookeeper: This step copies the necessary configuration files to their respective directories and creates symbolic links to the necessary binaries.
vi. Create necessary directories: This step creates the /var/log/storm and /var/log/storm/workers-artifacts/ directories for storing log files.
vii. Start necessary services: This step starts the storm-logviewer, zookeeper, storm-nimbus, and storm-ui services.
After running this playbook, Zookeeper and Storm will be installed and running as services. You should be able to see the Storm UI on the master node’s port 8080.
Prerequisite: You will need to install a recent version of Docker. See Docker installation guide for details on how to install Docker for your platform.
Clone or download OWler’s repository to your local drive.
# Clone repository git clone https://opencode.it4i.eu/openwebsearcheu-public/owler cd owler
Run the following commands inside this directory.
mkdir input # Download and extract WARC path form CommonCrawls (e.g. https://commoncrawl.org/2022/10/sep-oct-2022-crawl-archive-now-available/) wget -c https://data.commoncrawl.org/crawl-data/CC-MAIN-2022-40/warc.paths.gz -P input && gunzip input/warc.paths.gz # export env variables export PATH=/Users/changeme/opt/apache-maven-3.8.6/bin:$PATH export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk-11.0.17.jdk/Contents/Home # Clone and install storm-crawler git clone https://github.com/DigitalPebble/storm-crawler.git cp -fr files/storm-crawler/pom.xml storm-crawler cd storm-crawler mvn clean install cd .. mvn clean package docker-compose --env-file .env -f docker-compose.yml up --build --renew-anon-volumes --remove-orphans # On a separate shell docker-compose run --rm storm-crawler storm jar warc-crawler.jar org.apache.storm.flux.Flux topology/warc-elastic-crawler/owler.flux -e
In the upcoming version, we are planning to introduce an Ansible-friendly installation for Docker as well.