Recommended annotation tool to create a Named Entities Recognition data set

2k views Asked by At

I'm new to NLP. I am looking for recommendations for an Annotation tool to create a labeled NER dataset from raw texts.

In details:

I'm trying to create a labeled data set for specific types of Entities in order to develop my own NER project (rule based at first). I assumed there will be some friendly frameworks that allows create tagging projects, tag text data, create a labeled dataset, and even share projects so several people could work on the same project, but I'm struggling to find one (I admit "friendly" or "intuitive" are subjective, yet this is my experience).

So far I've tried several Frameworks:

  • I tried LightTag. It makes the tagging itself fast and easy (i.e. marking the words and giving them labels) but the entire process of creating a useful dataset is not as intuitive as I expected (i.e. uploading the text files, split to different tagging objects, save the tags, etc.)
  • I've installed and tried LabelStudio and found it less mature then LightTag (don't mean to judge here :))
  • I've also read about spaCy's Prodigy, which offers a paid annotation tool. I would consider purchasing it, but their website only offers a live demo of the the tagging phase and I can't access if their product is superior to the other two products above.

Even in StackOverflow the latest question I found on that matter is over 5 years ago.

Do you have any recommendation for a tool to create a labeled NER dataset from raw text?

4

There are 4 answers

0
Puneet Jindal On

⚠️ Disclaimer

I am the one of the founders of Labellerr and LabelGPT. So take my answer with a pinch of a salt. Because we have a reputation in the space so i would ensure that i tell my perspective of reality.

Finding the right tool for labeling data in AI projects, especially for naming things in text (NER), can be tricky. I started building Labellerr because, while leading AI teams, I faced the same problems it aims to solve.

The thing is, AI projects today deal with lots of different challenges. When it comes to choosing a labeling tool, speed, accuracy, and cost are big concerns for AI teams. But one tool doesn't fit all industries. For example, labeling a conversation between a doctor and a patient is different from sorting out insurance claims in healthcare. And it gets even more specific based on where and what kind of medical field you're in.

These tools that help label data have come a long way in the last few years, especially as AI technology has grown. They try to make labeling faster by automating some parts, but how much they can automate really depends on past experiences with similar data.

The tools need to be flexible because industries and data types are so different. Labellerr, the tool I helped create, was made to be customized for different needs. It's all about making it easier for AI teams to label data the way they need to, considering the unique challenges of their specific projects. Hardly the case happens when we get new users onboard and continue to scale and still no need to customize some bit of it whether it is some look and feel or some button here and there which might save time.

The search for the right tool is about finding something that can adapt to different needs, fits the budget, and can keep up with the ever-changing world of AI.

Hope you must have got some tool by now. But only solution is to try out all of them and see where its the easiest fit for your use case.

Hope this makes sense!

0
Murari Kumar On

You may try the Automatic Text Annotation Tool for spaCy NER recently developed and available at https://termitexpert.in/annotation_spacy_ner . This tool can convert your raw data into annotated data if you supply Entities and its corresponding items. The annotated data will be in json format that supports spaCy version 2 for developing custom named entity recognition (NER) model.

For example, if you have Entity FRUIT and its corresponding items are (apple, mango, banana). Then, this tool automatically finds each item from your text and annotate them as FRUIT. You can add other Entity and its corresponding items also.

Note: Abobe method works fine with spaCy v2, For using spaCy v3.0, you may have to convert the json data to DocBin format and use it for training, see doc.

0
Robert Alexander On

I have used both DOCCANO (https://github.com/doccano/doccano) and BRAT (https://brat.nlplab.org/).

Find the latter very good and it supports more functions. Both are free to use.

0
Vimal Menon On

⚠️ Disclaimer

I am the author of Acharya. I would limit my answers to the points raised in the question.


Based on your question, Acharya would help you in creating the project and upload your raw text data and annotate them to create a labeled dataset.

It would allow you to mark records individually for train or test in the dataset and would give data-centric reports to identify and fix annotation/labeling errors.

It allows you to add different algorithms (bring your own algorithm) to the project and train the model regularly. Once trained, it can give annotation suggestions from the trained models on untagged data to make the labeling process faster.

If you want to train in a different setup, it allows you to export the labeled dataset in multiple supported formats.

Currently, it does not support sharing of projects.

Acharya community edition is in alpha release. github page (https://github.com/astutic/Acharya) website (https://acharya.astutic.com/)

Doccano is another open-source annotation tool that you can check out https://github.com/doccano/doccano