Deploying Storm#
You can either build Strom from the source code and run it locally, in a server environment or use Docker to build an image and run it in a container. OWLer comes with a playbook that installs and configures most of the dependencies to run the crawler.
Ansible Playbook#
Ansible can be installed from PyPi:
# Optional: Create a fresh venv
python3 -m venv venv && source venv/bin/activate
pip install ansible
See Ansible installation guide for details on how to install Ansible for your platform.
Deploying in a Server Environment#
Prerequisites:
Install a recent version of Java (JDK 11) and Git.
Set up an instance of OpenSearch. This can be either local (Dockerized or installed on your machine) or remote. This is needed to store the metrics.
Establish an S3 endpoint (e.g., Minio) for storing the crawled files.
Ensure a URL Frontier is running. This is required for managing the queue of URLs to be crawled.
Deployment:
Clone or download OWLer Installation Toolkit repository to your local drive.
# Clone repository
git clone https://opencode.it4i.eu/openwebsearcheu-public/owler-installation-toolkit
cd owler-installation-toolkit
Use ssh-add to load the SSH key needed to connect to the server, e.g.
ssh-add ~/.ssh/mykey.pem
Edit the hosts file to specify the IP addresses of the servers (i.e., storm_master, storm_worker(s))
Edit the
dev.properties
file in the OWLer directory with the appropriate variables.Run the following command inside this directory. The
storm-playbook.yml
playbook will start Storm in the server environment.
ansible-playbook -i hosts storm-playbook.yml -u username
where username is the name of the user on the remote server. If the servers use different user names, specify them individually in the hosts file.
The Ansible playbook performs several tasks to set up and configure the Storm Crawler environment on both master and worker nodes. Here’s a detailed breakdown of what each task does:
i. Set hosts, storm_version, and master_ip variables: This step configures the playbook to execute on all hosts, sets the storm_version variable to the desired version of Apache Storm, and sets the master_ip variable to the FQDN of the master host.
ii. Install necessary software: This step ensures that Python for Ansible and OpenJDK 11 Java Development Kit are installed on the host.
iii. Create Storm and Zookeeper users and groups: This step creates new system users ‘storm’ and ‘zookeeper’ and adds them to their respective groups.
iv. Download and setup Apache Storm and Zookeeper: This step downloads the specified versions of Apache Storm and Zookeeper, extracts them to the appropriate directories, and sets the necessary permissions.
v. Configure Storm and Zookeeper: This step copies the necessary configuration files to their respective directories and creates symbolic links to the necessary binaries.
vi. Create necessary directories: This step creates the /var/log/storm and /var/log/storm/workers-artifacts/ directories for storing log files.
vii. Start necessary services: This step starts the storm-logviewer, zookeeper, storm-nimbus, and storm-ui services.
After running this playbook, Zookeeper and Storm will be installed and running as services. You should be able to see the Storm UI on the master node’s port 8080.
Using Docker#
Prerequisite: You will need to install a recent version of Docker. See Docker installation guide for details on how to install Docker for your platform.
Deployment:
Clone or download OWler’s repository to your local drive.
# Clone repository
git clone https://opencode.it4i.eu/openwebsearcheu-public/owler
cd owler
Run the following commands inside this directory.
mkdir input
# Download and extract WARC path form CommonCrawls (e.g. https://commoncrawl.org/2022/10/sep-oct-2022-crawl-archive-now-available/)
wget -c https://data.commoncrawl.org/crawl-data/CC-MAIN-2022-40/warc.paths.gz -P input && gunzip input/warc.paths.gz
# export env variables
export PATH=/Users/changeme/opt/apache-maven-3.8.6/bin:$PATH
export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk-11.0.17.jdk/Contents/Home
# Clone and install storm-crawler
git clone https://github.com/DigitalPebble/storm-crawler.git
cp -fr files/storm-crawler/pom.xml storm-crawler
cd storm-crawler
mvn clean install
cd ..
mvn clean package
docker-compose --env-file .env -f docker-compose.yml up --build --renew-anon-volumes --remove-orphans
# On a separate shell
docker-compose run --rm storm-crawler
storm jar warc-crawler.jar org.apache.storm.flux.Flux topology/warc-elastic-crawler/owler.flux -e
Warning
In the upcoming version, we are planning to introduce an Ansible-friendly installation for Docker as well.