What is the meaning of bucket in StormCrawler spouts?

Question

What is the meaning of bucket in StormCrawler spouts?

53 views Asked by aeranginkaman At 26 June 2021 at 05:40

What is the meaning of bucket in the StormCrawler project? I have seen bucket in different spouts of the project. For example, in Solr and Sql based spouts we have used it in the spouts.

Original Q&A

There are 1 answers

**Julien Nioche** · Accepted Answer · 2021-06-26T07:55:32+00:00

A bucket is simply a way of partitioning the data from the backend in order to guarantee a good diversity of sources while crawling. The values are usually set to be the hostnames, domains or IPs of the pages.

Without buckets, the spout could get a lot of URLs for the same website. The FetcherBolt enforces politeness and internally stores URLs in queues, so in the worst-case scenario, it would have a single queue with loads of URLs and fetch them one by one, with a politeness delay.

With bucketing, you get a number of URLs from various sites and fetch them in parallel. Internally, the FetcherBolt would have a lot of queues with a few URLs in each of them.

You can see the number of internal queues and active threads from the FetcherBolt when using the Grafana dashboard (or the Kibana) one.

Performance-wise, it is better to have the best possible diversity of sources.

TechQA.

What is the meaning of bucket in StormCrawler spouts?

There are 1 answers

Related Questions in WEB-CRAWLER

Related Questions in STORMCRAWLER

Popular Questions

Trending Questions