Understanding of hBase data storage (webpage) for Nutch

Question

Understanding of hBase data storage (webpage) for Nutch

1k views Asked by Jan Bouchner At 08 September 2013 at 16:03

I am using HBase as my storage for crawled data by Apache Nutch. A location of my storage is in path /data/hbase/webpage and there I can see a lot of folders like:

64b2feb30073eec24d9dba65d421e7f 
482062bc554bd45bf198d9edea971a30
7c8a6eec12d9f6926a1d912be9a0ca81
c1f682541b8d1c0559de6df14ae84e2b
083b28ee75babc718cc28e66b98c9ff5
809eb4bb5f2be087e2c84a2f51d26653

and more...

These folders contains another folders like:

f  h  il  mk  mtdt  ol  p  recovered.edits  s

But it is not so important.

I am writing my own indexer for Nutch to get crawled data from HBase to Solr. I need to put it to Solr in batches because when I run it all, I get OutOfMemory exception.

I would like to ask you if it is possible to get batch ids from my HBase storage (to know which batch ids I have and then I can send it to index).

Original Q&A

There are 1 answers

**Alfonso Nishikawa** · Accepted Answer · 2013-09-09T08:05:55+00:00

I don't know how are you trying to implement your solution: if as a Nutch plugin, a Hadoop MapReduce or a single process script, but I guess this information will be helpful:

As indicated in nutch-src/conf/gora-hbase-mapping.xml, batchId is mapped to HBase's column f:bid.
You have to read it using Gora. The instances of WebPage have the method #getBatchId(). Check the avro WebPage definition and the compiled class.

When developing a plugin, much probably you will see a WebPage parameter in the interface of the plugin.

If you want to access batchId in a raw way in HBase, just read the column f:bid and consider it the raw text. If I am not wrong, Gora is not writing additional information on strings (unlike when serialized).

TechQA.

Understanding of hBase data storage (webpage) for Nutch

There are 1 answers

Related Questions in HADOOP

Related Questions in HBASE

Related Questions in NUTCH

Related Questions in DISTRIBUTED-DATABASE

Popular Questions

Popular Tags

Trending Questions