I'm preparing to make distribute search module with lucence and hadoop but fell confused with something:
as we know , hdfs is a distribute file system ,when i put a file to hdfs , the file will be divided into severial blocks and stored in diffrent slave machine in the claster , but if i use lucene to write index on hdfs , i want to see the index on each machine , how to acheived it ?
i have read some of the hadoop/contrib/index and some katta ,but don't understand the idea of the "shards ,looks like part of the index" , it was stored on local disk of one computer or only one directionary distribut in the cluster ?
Thanks for advance
-As for your Question 1:
You can implement the Lucene "Directory" interface to make it work with with hadoop and let hadoop handle the files you submit to it. You could also provide your own implementation of "IndexWriter" and "IndexReader" and use your hadoop client to write and read the Index. This way since you could have more control about the format the index you will write. You can "see" or access the index on each machine via the your lucene/hadoop implementation.
-For your question 2:
A shard is a subset of the index. When you run your query all shards are processed in the same time and the results of the index search on all shards are combined. On each machine of your cluster you will have a part of your index: a shard. So a part of the index will be stored on a local machine but will appear to you as as a single file distributed across the cluster.
I can also suggest you to checkout the distributed search SolrCloud, or here It is runs on Lucene as indexing/search engine and already enables you to have a clustered index. It also provides an API for submitting the files to index and query the index. Maybe it is sufficient for your use case.