Azure resource to handle unstructured data sources

496 views Asked by At

we have a requirement to extract dark data from unstructured sources such as letters, rad reports, etc. Please suggest azure resource to extract data from common document formats: DOC, DOCX, PDF, RTF, TXT, HTML, etc and then to do analysis on the extracted data.

1

There are 1 answers

2
Peter Pan On

It sounds like you just want to extract raw text or images from these rich text format documents. If only do these, some libraries of parsing different documents is your real needs.

Here is some libraries in Java or Python to do that. If you are using .NET which I'm not familiar with, you can search in Google or Bing to find these alternative for .NET.

  1. To parse the office document like DOC, DOCX: for Java, Apache POI is a good library for extracting data from MS office files; for Python, there seems to be not any package to do that, except using COM object like Word.Application or IronPython (Reading/Writing MS Word files in Python) in .NET on Windows.
  2. To parse PDF files: there are Apache PDFBox, jPDFText for Java and PyPDF2 for Python.
  3. To read RTF format file: Java natively supports via javax.swing.text.rtf.RTFEditorKit which you can get some sample code via search; like #1, also seems none for Python.
  4. To parse HTML files: jsoup for Java and BeautifulSoup & HTMLParser for Python are best for extracting data from HTML.
  5. For reading TXT format files, I think it's simple for any languages. But to extract valuable information from text content, Stanford NLP for Java and NLTK for Python are useful, also using Azure Text Analytics API of Cognitive Service can help doing some like key phrase extraction, and language detection.
  6. Apache Tika toolkit for content analysis is a good solution, too. Even you can deploy it alone and to invoke its REST APIs by Python, other languages.
  7. If you want to extract text from images, you can use Azure Computer Vision API of Cognitive Services to extract printed text or handwritten text, or use the third party library such as Tess4J or others you searched in GitHub.

All of above are almost depended on the third party dev kits without Azure resources. However, you can store these documents in Azure Storage and process them on Azure VM or Batch services, even to analyze the extract data in Azure Jupyter Notebook or use Azure ML to do more deeper research.