Tika text extraction not working on HDFS

579 views Asked by At

I'm trying to use Tika to extract text from a bunch of simple txt files stored on HDFS. I have the following code in my reducer, but surprisingly Tika does not return anything. It work fine in my local machine but as soon as I move everything to hadoop cluster, the result is empty.

FileSystem fs = FileSystem.get(new Configuration());            
Path pt = new Path(Configs.BLOBSTORAGEPREFIX+fileAdd);          
InputStream stream = fs.open(pt);           


 AutoDetectParser parser = new AutoDetectParser();
 BodyContentHandler handler = new BodyContentHandler();
 Metadata metadata = new Metadata();   

 parser.parse(stream, handler, metadata);       


  spaceContentBuffer.append(handler.toString());

The last line append the extreaxted content to a StringBuilder, but it is always empty.

p.s. my hadoop cluster is Azure HDInsight so the HDFS is Blob Storage.

I also tried the following code

Metadata metadata = new Metadata();         
BodyContentHandler handler =  new BodyContentHandler();
Parser parser = new TXTParser();            
ParseContext con = new ParseContext();          
parser.parse(stream, handler, metadata, con);

and I got the following error message: Failed to detect the character encoding of a document

1

There are 1 answers

0
Jason Tang - MSFT On

If the user does not specify Content-Type when uploading a blob, it will be set to “application/octet-stream” by default.