I am supposed to extract representative terms from an organisation's website using wikipedia's article-link data dump. To achieve this I've -
- Crawled & downloaded organisation's webpages. (~110,000)
- Created a dictionary of wikipedia ID and terms/title. (~40million records)
Now, I'm supposed to process each of the webpages using the dictionary to recognise terms and track their term IDs & frequencies.
For the dictionary to fit in memory, I've splitted the dictionary into smaller files. Based on my experiment with a small data-set, the processing time for the above will be around 75 days.
And this is just for 1 organisation. I have to do the same for more than 40 of them.
Implementation -
- HashMap for storing dictionary in memory.
- looping through each map entry to search the term in a webpage, using Boyer-Moore search implementation.
- Repeating the above for each webpage, and storing results in a HashMap.
I've tried optimizing the code and tuning the JVM for better performance.
Can someone please advise on a more efficient way to implement the above, reducing the processing time to a few days.
Is Hadoop an option to consider?
Based on your question:
How did you arrive at the 75 days estimate?
There are number of performance targets:
Here is an outline of what I believe you are doing:
What this is essentially doing is breaking up each document into tokens and then performing a lookup in wikipedia dictionary for its token's existence.
This is exactly what a Lucene Analyzer does.
A Lucene Tokenizer will convert document into tokens. This happens before the terms are indexed into lucene. So all you have to do is implement a Analyzer which can lookup the Wikipedia Dictionary, for whether or not a token is in dictionary.
I would do it like this:
When you do this, you will have ready-made statistics from the Lucene Index such as:
There are lot of things you can do to improve the performance. For example:
I hope that helps.