WARC2WARC Topology#
This topology assimilates WARC files provided externally into the index and populates the frontier with identified links. This allows to continuously integrate crawling results from external crawlers, most prominently Common Crawl, into our pipelines. Functionally, it retrieves WARC files from S3-compliant storage, parses and extracts the embedded content and relevant hyperlinks, submits the extracted URLs to the frontier, and archives the WARC entries without modification. This non-fetching crawling pipeline hinges solely on already obtained web content, thus eliminating direct risks associated with unauthorized or impolite crawling. For data and high performance computing centers, this crawling mode also eliminates potential security impacts and alarms (e.g. due to accessing bot sites while crawling) as well as complaints from webmaster.
WARC2WARC Topology
Configuring and Tuning S3A Fast Upload#
Note
These tuning recommendations are experimental and may change in the future.
Because of the nature of the S3 object store, data written to an S3A OutputStream
is not written incrementally — instead, by default, it is buffered to disk until the stream is closed in its close()
method.
This can make output slow because the execution time for OutputStream.close()
To enable the fast upload mechanism, set the fs.s3a.fast.upload property (it is disabled by default).
When this is set, the incremental block upload mechanism is used, with the buffering mechanism set in fs.s3a.fast.upload.buffer
. The number of threads performing uploads in the filesystem is defined by fs.s3a.threads.max
; the queue of waiting uploads limited by fs.s3a.max.total.tasks
. The size of each buffer is set by fs.s3a.multipart.size
.
Parameter |
Value |
Description |
---|---|---|
|
|
The |
|
500M |
Defines the size (in bytes) of the chunks into which the upload or copy operations will be split up. |
|
8 |
Defines the maximum number of blocks a single output stream can have active uploading, or queued to the central FileSystem instance’s pool of queued operations. |