How to parse octet-stream files using Apache Tika?

Question

How to parse octet-stream files using Apache Tika?

952 views Asked by HHH At 23 June 2015 at 19:00

I have stored all different types of files on Azure Blob storage, files are txt, doc, pdf,etc. However all the files are stored as 'octet-stream' there and when I open the files to extract the text from them using Tika, Tika cann't detect the character encoding. How can I get around this problem?

FileSystem fs = FileSystem.get(new Configuration());            
Path pt = new Path(Configs.BLOBSTORAGEPREFIX+fileAdd);          
InputStream stream = fs.open(pt);           


AutoDetectParser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();   

parser.parse(stream, handler, metadata);       


spaceContentBuffer.append(handler.toString());

Original Q&A

There are 1 answers

**Zhaoxing Lu** · Answer 1 · 2015-06-23T23:53:17+00:00

If you are calling Azure Storage REST API directly, you can set header "x-ms-blob-content-type" via API Set Blob Properties.

If you are using Azure Storage Client Library, you can write similar code as below:

blockBlob.Properties.ContentType = "text/xml";
blockBlob.SetProperties();

TechQA.

How to parse octet-stream files using Apache Tika?

There are 1 answers

Related Questions in JAVA

Related Questions in AZURE-BLOB-STORAGE

Related Questions in APACHE-TIKA

Popular Questions

Trending Questions