Merging Solr index stored in HDFS not working

18 views Asked by At

I'm trying to merge two Solr core indexes into new one using org/apache/lucene/misc/IndexMergeTool.

All indexes are saved on HDFS under path /apps/solr/data/collection_name/data/index.

So I've created a new collection say col_new, and I'm trying to merge there col_1: core_1 and core_2.

I'm using is the following:

""" java -cp /usr/cloudera-hdp-solr/5.0.0.5-301/cloudera-hdp-solr/solr/server/solr-webapp/webapp/WEB-INF/lib/lucene-core-7.4.0.jar:/usr/cloudera-hdp-solr/5.0.0.5-301/cloudera-hdp-solr/solr/server/solr-webapp/webapp/WEB-INF/lib/lucene-misc-7.4.0.jar org/apache/lucene/misc/IndexMergeTool -destDir hdfs://namenode/path_to_new_core/data/index -srcDir hdfs://namenode/path_to_old_core_1/data/index hdfs://namenode/path_to_old_core_2/data/index """

The behaviour is strange. It creates a folder named hdfs: and other two named -srcDir and -destDir.

Have someone experience in merging indexes saved on a shared file system?

Other details:

  • Solr version 7.4
  • HDP v3
  • Lucene 5.0.0

Thanks.

1

There are 1 answers

0
Egor On

The problem may be in the directory type that IndexMergeTool uses to read and write index files. I am not sure about all versions, but the last version uses FSDirectory to access the files.

FSDirectory has a few implementations, but all of them work with local file systems, not with HDFS. To access HDFS, it should use HdfsDirectory.

It looks like IndexMergeTool can't help you with merge files stored on HDFS, but you can implement your own merger using HDFSDirectory:

val part1 = HdfsDirectory(Path("path-to-part1"), hdfsConfig)
val part2 = HdfsDirectory(Path("path-to-part2"), hdfsConfig)
val output = HdfsDirectory(Path("output-path"), hdfsConfig)

IndexWriter(output, IndexWriterConfig()).use {
    it.addIndexes(part1)
    it.addIndexes(part2)
}

It is also important to note that writing directly to HDFS may produce performance problems if you have a lot of files in your indexes. Sometimes it may be fast to merge indexes locally and then copy to HDFS the result index.