I have a dictionary file which is very similar to an actual dictionary. That is it stores words and an index for each word. The number of words is quite large, let's say 2 million words. I have another file which stores a the top words for a bunch of documents. That is, each line in the second file has a key which refers to a document (document name) follows by 10 values which are the indexes of the words in that document. I would like to efficiently merge these two files in order to obtain the top words for each document using Hadoop.
One naive solution is to output the index of the words as the key in the mapper and the value can be either the document name or the actual word. But because there are many words, the number of reducers will be huge and the solution may not be scalable.
Another solution is to buffer the dictionary file in mapper and do the join in the mapper, but it requires lots of memory in the mapper side. Is there a better solution to do that (without being worried about the memory issue?)