Best way to crawl through file system and index

773 views Asked by At

I am working on a project where I need to crawl through more than 10TB of data and index it. I need to implement incremental crawling that takes less time.

My question is : Which is the best tool suitable that all the big organizations are using for this along with java?

I was trying it out using Solr and Manifold CF but Manifold has very little documentation on the internet.

2

There are 2 answers

0
Shashank Raj On BEST ANSWER

We ended up using Solr J (JAVA) and Apache Manifold CF. Although the documentation for Manifold CF was little to none, we subscribed to the newsletter and asked questions to the developers and they responded quickly. However, I would not recommend anyone to use this setup as Apache Manifold CF is something that is outdated and poorly built. So better search for alternatives. Hope this helped somebody.

2
Harisudhan. A On

For any Crawling activities using Java best to go with the open source JSOUP and SolrJ API, Clear and neat easy understable documentations.

Jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.

SolrJ is an API that makes it easy for Java applications to talk to Solr. SolrJ hides a lot of the details of connecting to Solr and allows your application to interact with Solr with simple high-level methods.

for more option you can also try Elasticsearch with the java API