Tika text extraction not working on HDFS

Question

Tika text extraction not working on HDFS

649 views Asked by HHH At 23 June 2015 at 17:26

I'm trying to use Tika to extract text from a bunch of simple txt files stored on HDFS. I have the following code in my reducer, but surprisingly Tika does not return anything. It work fine in my local machine but as soon as I move everything to hadoop cluster, the result is empty.

FileSystem fs = FileSystem.get(new Configuration());            
Path pt = new Path(Configs.BLOBSTORAGEPREFIX+fileAdd);          
InputStream stream = fs.open(pt);           


 AutoDetectParser parser = new AutoDetectParser();
 BodyContentHandler handler = new BodyContentHandler();
 Metadata metadata = new Metadata();   

 parser.parse(stream, handler, metadata);       


  spaceContentBuffer.append(handler.toString());

The last line append the extreaxted content to a StringBuilder, but it is always empty.

p.s. my hadoop cluster is Azure HDInsight so the HDFS is Blob Storage.

I also tried the following code

Metadata metadata = new Metadata();         
BodyContentHandler handler =  new BodyContentHandler();
Parser parser = new TXTParser();            
ParseContext con = new ParseContext();          
parser.parse(stream, handler, metadata, con);

and I got the following error message: Failed to detect the character encoding of a document

Original Q&A

There are 1 answers

**Jason Tang - MSFT** · Answer 1 · 2015-07-14T17:45:44+00:00

Jason Tang - MSFT On 14 July 2015 at 17:45

If the user does not specify Content-Type when uploading a blob, it will be set to “application/octet-stream” by default.

TechQA.

Tika text extraction not working on HDFS

There are 1 answers

Related Questions in HADOOP

Related Questions in AZURE-BLOB-STORAGE

Related Questions in APACHE-TIKA

Popular Questions

Trending Questions