How to improve the performance when working with wikipedia data and huge no. of webpages?

Question

How to improve the performance when working with wikipedia data and huge no. of webpages?

180 views Asked by user3227868 At 23 January 2014 at 14:09

I am supposed to extract representative terms from an organisation's website using wikipedia's article-link data dump. To achieve this I've -

Crawled & downloaded organisation's webpages. (~110,000)
Created a dictionary of wikipedia ID and terms/title. (~40million records)

Now, I'm supposed to process each of the webpages using the dictionary to recognise terms and track their term IDs & frequencies.

For the dictionary to fit in memory, I've splitted the dictionary into smaller files. Based on my experiment with a small data-set, the processing time for the above will be around 75 days.

And this is just for 1 organisation. I have to do the same for more than 40 of them.

Implementation -

HashMap for storing dictionary in memory.
looping through each map entry to search the term in a webpage, using Boyer-Moore search implementation.
Repeating the above for each webpage, and storing results in a HashMap.

I've tried optimizing the code and tuning the JVM for better performance.

Can someone please advise on a more efficient way to implement the above, reducing the processing time to a few days.

Is Hadoop an option to consider?

Original Q&A

There are 2 answers

Andrew On 23 January 2014 at 17:12

One of the ways that use only MR is to:

Assuming you already have N dictionaries of smaller size that fit to memory you can: Launch N "map only" jobs that will be scanning all your data (each one with only one dictionary) and output smth like {pageId, termId, occurence, etc} to folder /your_tmp_folder/N/ As a result you will have N*M files where M is amount of mappers on each stage(should be the same).

Then second job will simply analyze your {pageId, termId, occurence, etc} objects and build stats per page id.

Map only jobs should be very fast in your case. If not - please paste your code.

**tuxdna** · Accepted Answer · 2014-01-23T18:29:57+00:00

Based on your question:

Number of Documents = 110000

Dictionary => List of [TermID, Title Terms] = 40million entries

Size of documents = 11000 * 1KB per document on an average =  26.9GB
(1KB per document on an average)

Size of dictionary = 40million * 256bytes = 9.5GB of raw data
(256bytes per entry on an average)

How did you arrive at the 75 days estimate?

There are number of performance targets:

How are you storing the Documents?
How are you storing/retrieving the Dictionary? ( assuming not all of it in memory unless you can afford to)
How many machines are you running it on?
Are you performing the dictionary lookups in parallel? ( of-course assuming dictionary is immutable once you have already processed whole of wikipedia )

Here is an outline of what I believe you are doing:

dictionary = read wikipedia dictionary
document = a sequence of documents
documents.map { doc =>
  var docTermFreq = Map[String, Int]()
  for(term <- doc.terms.map if(dictionary.contains(term)) ) {
     docTermFreq = docTermFreq + (term -> docTermFreq.getOrElse(term, 0) + 1)
  }
  // store docTermFreq map
}

What this is essentially doing is breaking up each document into tokens and then performing a lookup in wikipedia dictionary for its token's existence.

This is exactly what a Lucene Analyzer does.

A Lucene Tokenizer will convert document into tokens. This happens before the terms are indexed into lucene. So all you have to do is implement a Analyzer which can lookup the Wikipedia Dictionary, for whether or not a token is in dictionary.

I would do it like this:

Take every document and prepare a token stream ( using an Analyzer described above )
Index the document terms.
At this point you will have wikipedia terms only, in the Lucene Index.

When you do this, you will have ready-made statistics from the Lucene Index such as:

Document Frequency of a Term
TermFrequencyVector ( exactly what you need )
and a ready to use inverted index! ( for a quick introduction to Inverted Index and Retrieval )

There are lot of things you can do to improve the performance. For example:

Parallelize the document stream processing.
You can store the dictionary in key-value database such as BerkeylyDB or Kyoto Cabinet, or even an in-memory key-value storage such as Redis or Memcache.

I hope that helps.

TechQA.

How to improve the performance when working with wikipedia data and huge no. of webpages?

There are 2 answers

Related Questions in JAVA

Related Questions in HADOOP

Related Questions in ONTOLOGY

Related Questions in TEXT-MINING

Related Questions in WIKIMEDIA-DUMPS

Popular Questions

Popular Tags

Trending Questions