How to loop over an Hbase table in chuncks in Python

Question

How to loop over an Hbase table in chuncks in Python

1k views Asked by Rafik Benmansour At 28 November 2017 at 13:00

I am currently writing a Python script that converts HBase tables into csv using "happybase". The problem I am having is that if the table is too big, I get the below error after reaching a little over 2 million lines:

Hbase_thrift.IOError: IOError(message='org.apache.hadoop.hbase.DoNotRetryIOException: hconnection-0x8dfa2f2 closed\n\tat org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1182)\n\tat org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:305)\n\tat org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:156)\n\tat org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:60)\n\tat org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:212)\n\tat org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:314)\n\tat org.apache.hadoop.hbase.client.ClientScanner.loadCache(ClientScanner.java:432)\n\tat org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:358)\n\tat org.apache.hadoop.hbase.client.AbstractClientScanner.next(AbstractClientScanner.java:70)\n\tat org.apache.hadoop.hbase.thrift.ThriftServerRunner$HBaseHandler.scannerGetList(ThriftServerRunner.java:1423)\n\tat sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)\n\tat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.lang.reflect.Method.invoke(Method.java:498)\n\tat org.apache.hadoop.hbase.thrift.HbaseHandlerMetricsProxy.invoke(HbaseHandlerMetricsProxy.java:67)\n\tat com.sun.proxy.$Proxy10.scannerGetList(Unknown Source)\n\tat org.apache.hadoop.hbase.thrift.generated.Hbase$Processor$scannerGetList.getResult(Hbase.java:4789)\n\tat org.apache.hadoop.hbase.thrift.generated.Hbase$Processor$scannerGetList.getResult(Hbase.java:4773)\n\tat org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)\n\tat org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)\n\tat org.apache.hadoop.hbase.thrift.TBoundedThreadPoolServer$ClientConnnection.run(TBoundedThreadPoolServer.java:289)\n\tat org.apache.hadoop.hbase.thrift.CallQueue$Call.run(CallQueue.java:64)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\n')

What I though of is chopping the for loop into sub-loops (i.e. open the Hbase connection -> get the data for the first 100,000 lines -> close the connection -> reopen it again -> get the next 100,000 lines -> close it... and so on), but I can't seem the figure out how to do it. Here is a sample of my code that reads all the lines and crashes:

import happybase
connection = happybase.Connection('localhost')
table = 'some_table'
table_object = connection.table(table)
for row in table_object.scan():
    print row

Any help would be appreciated (even if you suggest another solution :))

Thanks

Original Q&A

There are 1 answers

**Rafik Benmansour** · Answer 1 · 2017-11-29T21:05:44+00:00

Actually, I figured out the way to do it, and it's as follow:

import happybase
connection = happybase.Connection('localhost')
table = 'some_table'
table_object = connection.table(table)

while True:
  try:
    for row in table_object.scan():
      print row
    break
  except Exception as e:
    if "org.apache.hadoop.hbase.DoNotRetryIOException" in e.message:
      connection.open()
    else:
      print e
      quit()

TechQA.

How to loop over an Hbase table in chuncks in Python

There are 1 answers

Related Questions in PYTHON

Related Questions in HBASE

Related Questions in HAPPYBASE

Popular Questions

Popular Tags

Trending Questions