I have a large dataset from which I would like to extract and categorize specific elements. Below is a most common example:
I would like to know if this is possible using Amazon Comprehend or maybe there are better tools to do that. I am not a developer and looking to hire someone to program this for me. But I would like to understand conceptually if something like this feasible before I hire someone.
Comprehend is capable of extracting and categorizing text from your document. You can use Comprehend’s Custom Entity Recognition.
For this, you will provide annotated training data as input. You can leverage Ground Truth in Amazon SageMaker to do the annotations, and directly provide Ground Truth output to Comprehend Entity Recognition Training job. You can also provide your own annotations file for the training job - https://docs.aws.amazon.com/comprehend/latest/dg/API_EntityRecognizerInputDataConfig.html.
The relevant APIs for Amazon Comprehend would be -
Here is a detailed example of how to train custom entity recognizers with Amazon Comprehend - https://docs.aws.amazon.com/comprehend/latest/dg/training-recognizers.html
Annotation file example for this use-case.
The file doc1 should contain the text that you want to extract entities from.