Building your Topology using OWLer#

Building an OWLer Topology with Java#

OWLer is built around the concept of topologies. A topology defines the data flow between spouts and bolts, which are the two main components in Storm.

1. Spouts and Bolts#

Spout: Generates the stream of data. In the context of a crawler, a spout might emit URLs to be fetched.
Bolt: Processes the stream of data. This could be fetching the page at a URL, parsing the content, storing it, or any other task.



storm_topology

Storm topology with element processors Spouts and Bolts. Spouts are input nodes while Bolts are nodes that produce a processing in the tuples.
Source: Andreoni, Martin. (2018). A Monitoring and Threat Detection System Using Stream Processing as a Virtual Function for Big Data.



2. Defining your Topology#

Let’s create a basic crawling topology with the following components:

  • FeedSpout: Generates URLs to be fetched.

  • FetcherBolt: Fetches web pages.

  • ParserBolt: Parses the content of web pages.

  • StoreBolt: Stores the parsed content.

TopologyBuilder builder = new TopologyBuilder();

builder.setSpout("spout", new FeedSpout(), 1);
builder.setBolt("fetch", new FetcherBolt(), 2).shuffleGrouping("spout");
builder.setBolt("parse", new ParserBolt(), 2).shuffleGrouping("fetch");
builder.setBolt("store", new StoreBolt(), 1).shuffleGrouping("parse");

3. Customizing your Bolts#

Often, you’ll want to modify or extend the built-in Bolts. Here’s how you can customize FetcherBolt:

FetcherBolt customFetcher = new FetcherBolt() {
    @Override
    public void execute(Tuple input) {
        // Your custom logic here
        super.execute(input);
    }
};
builder.setBolt("fetch", customFetcher, 2).shuffleGrouping("spout");

4. Configuring Your Topology#

StormCrawler provides configuration files (crawler-conf.yaml) to tune various parameters.

For instance, you can set the user agent:

http.agent.name: "MyCustomBot/1.0"

5. Submitting Your Topology#

Finally, once your topology is defined and configured, you need to submit it to the Storm cluster.

Config conf = new Config();
StormSubmitter.submitTopology("myCrawler", conf, builder.createTopology());

6. Extending with External Modules#

You can extend your topology with various external modules such as OpenSearch for storage or monitoring. This requires additional configuration and integrating the corresponding StormCrawler modules.

For OpenSearch:

builder.setBolt("indexer", new OpenSearchIndexerBolt(), 1).shuffleGrouping("parse");

Building an OWLer Topology with Flux#

Flux is essentially a framework for defining Storm topologies in YAML. It can also handle more advanced use cases like multi-lang components, dependencies, and external configurations.

A Flux topology definition consists of three main sections:

  • Spouts: Defines the data sources.

  • Bolts: Defines data processing components.

  • Streams: Defines how data flows between components.

For our example, we’ll keep the same Spouts and Bolts (FeedSpout, FetcherBolt, ParserBolt, and StoreBolt).

Here’s what a Flux YAML looks like:

name: "crawler-topology"

spouts:
  - id: "spout"
    className: "com.example.storm.FeedSpout"
    parallelism: 1

bolts:
  - id: "fetch"
    className: "com.example.storm.FetcherBolt"
    parallelism: 2
  - id: "parse"
    className: "com.example.storm.ParserBolt"
    parallelism: 2
  - id: "store"
    className: "com.example.storm.StoreBolt"
    parallelism: 1

streams:
  - from: "spout"
    to: "fetch"
    groupBy: SHUFFLE
  - from: "fetch"
    to: "parse"
    groupBy: SHUFFLE
  - from: "parse"
    to: "store"
    groupBy: SHUFFLE

3. External Configuration#

You can also link your Flux topology to an external Storm configuration file:

includes:
    - resource: "/path/to/crawler-conf.yaml"

4. Deploying with Flux#

To deploy a topology with Flux, use:

storm jar my-crawler.jar org.apache.storm.flux.Flux --local crawler-flux.yaml --filter

Here:

  • --local: Runs the topology in local mode (omit for cluster mode).

  • --filter: Lets you selectively include/exclude properties.

5. Advanced Flux Configurations#

Flux supports various advanced configurations:

  • Multi-lang components: You can easily incorporate spouts and bolts written in other languages.

  • Topology Configurations: Allows for specifying Storm topology configs like workers, max spout pending, etc.

For instance, to specify the number of workers:

config:
  topology.workers: 4

6. Extending with External Modules using Flux#

To integrate external modules like Elasticsearch:

bolts:
  ...
  - id: "indexer"
    className: "com.example.storm.IndexerBolt"
    parallelism: 1

streams:
  ...
  - from: "parse"
    to: "indexer"
    groupBy: SHUFFLE

Best Practices#

  1. Scalability: Increase the parallelism factor for those bolts that are more resource-intensive.

  2. Robustness: Handle exceptions and failures gracefully. Use Storm’s mechanism to fail tuples when necessary, so they can be retried.

  3. Politeness: Be respectful to servers. Don’t overload them with too many requests. Set appropriate delays in the configuration.

Conclusion#

Building a topology with StormCrawler is both an art and a science. The architecture you choose will depend on the specific requirements of your crawling task. Always monitor, optimize, and adapt based on the data and feedback you gather.

Always ensure you’re working with the latest versions, keep an eye on performance metrics, and remember to adjust parallelism and configurations as your data demands grow or change.