I have a large index over which I need to perform near-real-time updates and full-text search, but I also want to be able to run map-reduce jobs over that data. Is it possible to do this without having to maintain two separate copies of the data? (e.g. one copy in Solr, another in HDFS).

It looks like Solr can be configured to use HDFS for storage, but it doesn't look like this plays well with map-reduce, since it's just storing the index in HDFS in a way that would be difficult to read from Hadoop map-reduce.

For ElasticSearch, there is es-hadoop, but this is geared towards reading and writing to ElasticSearch from within Hadoop, but doesn't seem to solve the problem of getting data into HDFS in near-real-time or avoiding having two copies of the data.

Has anyone faced a similar problem or possibly found other tools that might help solve the problem? Or is it standard practice to have a separate copy of your data for map-reduce jobs?

Thanks!

1

There are 1 answers

0
Ramzy On BEST ANSWER

If you are talking about having the option to store in hdfs(run map reduce) in future and then perform indexing with solr, then I think, you can follow the below steps

For real time streaming(for eg twitter), you need to store them in db at real time. One option is to send them to kakfka and utilize storm. From there you can store in hdfs and in solr in parellel. They have concept of bolts which will perform the same. Once is hdfs, you can use map reduce. Once in Solr, you an perform search. If you want both data to be in synch, you can try some event processing which listens to data insertion into HDFS(or its stack) and perform indexing in Solr. Please go through kafka, storm documentation to have basic idea. Alternatives can be Flume, or Spark. Not sure about those.