Crawler Options#

Apache Storm and the StormCrawler are vast in terms of configurations, and the possible options for configuration when dealing with Flux, Storm, and StormCrawler can be extensive. Below is an extended overview of some of the main OWLer options you might consider in a Flux configuration.

Note that this list might not be exhaustive due to the limitations in presenting a vast amount of data here, but it should give you a good starting point:

Option Name

Description

Default Value

OWLer Value

fetchInterval.default

Default fetch interval for URLs

90 (seconds)

10080

fetchInterval.max

Max time allowed for fetch interval

365 (days)

180

http.agent.name

Name of the HTTP agent

stormcrawler

Owler@ows.eu/1

http.protocol.implementation

Class of protocol implementation

-

com.mycompany.CustomHttpProtocol

http.store.headers

Store HTTP headers in the metadata

false

true

http.timeout

Connection timeout in ms

10000

5000

parsefilters.config.file

File name of the parse filters

parsefilters.json

parsefilters.json

sitemap.discovery

Discovers sitemaps from robots.txt

true

true

urlfilters.config.file

File name for URL filters

urlfilters.json

urlfilters.json

topology.message.timeout.secs

Timeout in seconds for a tuple

Varies per topology

300

topology.metrics.consumer.register

Classes to register as metric consumers

-

org.apache.storm.metric.LoggingMetricsConsumer

topology.worker.childopts

JVM options for Storm workers

-

-Xmx8G -Djava.net.preferIPv4Stack=true

This table is a guide to some of the configurations you might use in OWLer, especially when using the Flux framework. Each of these options can be included in the Flux YAML file under the config section. Make sure to consult the StormCrawler and Apache Storm documentation for a comprehensive list and detailed explanation of all configuration options.