I have few urls need to scrape using stormcrawler.As per [link]https://medium.com/analytics-vidhya/web-scraping-and-indexing-with-stormcrawler-and-elasticsearch-a105cb9c02ca[/link]I followed all steps and got scraped and loaded content in my elastic.
As per above blog, he used Flux command to inject topology to ES.
spouts:
-
className: com.digitalpebble.stormcrawler.spout.FileSpout
constructorArgs:
- "stormcrawlertest-master/"
- seeds.txt
- true
id: spout
parallelism: 1
streams:
-
from: spout
grouping:
customClass:
className: com.digitalpebble.stormcrawler.util.URLStreamGrouping
constructorArgs:
- byHost
streamId: status
type: CUSTOM
to: status
this will inject urls to ES. I followed the same class in Flux and create a main class
String[] argsa = new String[] { "-conf","/crawler-conf.yaml", "-conf","/es-conf.yaml", "-local" };
ConfigurableTopology.start(new InjectorTopology(), argsa);
public class InjectorTopology extends ConfigurableTopology {
@Override
protected int run(String[] args) {
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("spout", new FileSpout("stormcrawlertest-master/","seeds.txt", true), 1);
builder.setBolt("status", new StatusUpdaterBolt(), 1).customGrouping("spout",new URLStreamGrouping(Constants.PARTITION_MODE_HOST));
return submit("ESInjectorInstance", conf, builder);
}}
and clean and package by maven run python storm.py jar target/stormcrawlertest-1.0-SNAPSHOT.jar com.my.sitescraper.main.SiteScraper this is not injecting any urls to ES.
What I am missing.