Crawler Options#
Apache Storm and the StormCrawler are vast in terms of configurations, and the possible options for configuration when dealing with Flux, Storm, and StormCrawler can be extensive. Below is an extended overview of some of the main OWLer options you might consider in a Flux configuration.
Note that this list might not be exhaustive due to the limitations in presenting a vast amount of data here, but it should give you a good starting point:
Option Name |
Description |
Default Value |
OWLer Value |
---|---|---|---|
|
Default fetch interval for URLs |
|
|
|
Max time allowed for fetch interval |
|
|
|
Name of the HTTP agent |
|
|
|
Class of protocol implementation |
|
|
|
Store HTTP headers in the metadata |
|
|
|
Connection timeout in ms |
|
|
|
File name of the parse filters |
|
|
|
Discovers sitemaps from robots.txt |
|
|
|
File name for URL filters |
|
|
|
Timeout in seconds for a tuple |
Varies per topology |
|
|
Classes to register as metric consumers |
|
|
|
JVM options for Storm workers |
|
|
This table is a guide to some of the configurations you might use in OWLer, especially when using the Flux framework. Each of these options can be included in the Flux YAML file under the config
section. Make sure to consult the StormCrawler and Apache Storm documentation for a comprehensive list and detailed explanation of all configuration options.