Apache Nutch - How to store crawl data under the folder with the page name/url

37 views Asked by At

I'm currently using Apache Nutch to crawl a website. Usually, the dump data I acquired by this command

bin/nutch dump -segment crawl/segments -outputDir test_data

However, the returned data folder structure is like this: a1, af, d3,etc...

I want to configure the crawl so that the folder name will be according to the website section such as "About Us", "News" instead of "ca","d3". Thank you

I tried changing the Nutch_site.xml and adding property, but it seems like a first-timer like me lack the know-how to make it work properly

0

There are 0 answers