I am working on a project where I need to crawl through more than 10TB of data and index it. I need to implement incremental crawling that takes less time.
My question is : Which is the best tool suitable that all the big organizations are using for this along with java?
I was trying it out using Solr and Manifold CF but Manifold has very little documentation on the internet.
We ended up using Solr J (JAVA) and Apache Manifold CF. Although the documentation for Manifold CF was little to none, we subscribed to the newsletter and asked questions to the developers and they responded quickly. However, I would not recommend anyone to use this setup as Apache Manifold CF is something that is outdated and poorly built. So better search for alternatives. Hope this helped somebody.