Best way to crawl through file system and index

Question

Best way to crawl through file system and index

771 views Asked by Shashank Raj At 01 December 2017 at 09:40

I am working on a project where I need to crawl through more than 10TB of data and index it. I need to implement incremental crawling that takes less time.

My question is : Which is the best tool suitable that all the big organizations are using for this along with java?

I was trying it out using Solr and Manifold CF but Manifold has very little documentation on the internet.

Original Q&A

There are 2 answers

Harisudhan. A On 01 December 2017 at 09:53

For any Crawling activities using Java best to go with the open source JSOUP and SolrJ API, Clear and neat easy understable documentations.

Jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.

SolrJ is an API that makes it easy for Java applications to talk to Solr. SolrJ hides a lot of the details of connecting to Solr and allows your application to interact with Solr with simple high-level methods.

for more option you can also try Elasticsearch with the java API

**Shashank Raj** · Accepted Answer · 2020-07-10T21:01:48+00:00

We ended up using Solr J (JAVA) and Apache Manifold CF. Although the documentation for Manifold CF was little to none, we subscribed to the newsletter and asked questions to the developers and they responded quickly. However, I would not recommend anyone to use this setup as Apache Manifold CF is something that is outdated and poorly built. So better search for alternatives. Hope this helped somebody.

TechQA.

Best way to crawl through file system and index

There are 2 answers

Related Questions in JAVA

Related Questions in SOLR

Related Questions in MANIFOLDCF

Popular Questions

Popular Tags

Trending Questions