Detect (predefined) topics in natural text

1.6k views Asked by At

Is there a library or database out there that can detect the topics of natural text?

I'm not talking about generating topics from extracted keywords, but about analysing the used vocabulary and matching it with predefined topics. Like searching for words used in cooking or certain sports (like names of football clubs or technical terms).

Update with clarification:

Example text snippet: A sentence about football, then another sentence talking about catering at the event.

Library could assign categories "sports", "football", "cooking".

I'm looking for something that can assign these categories (or "topics of interest" maybe) without me having to train thousands of models with terabytes of manually classified documents. This could for example work by matching keywords instead of statistical analysis (that's why I mentioned database earlier).

I'm searching this because I don't have the manpower to build such a big database myself.

2

There are 2 answers

1
Nikita Astrakhantsev On

The task you described is a classic text document classification. I recommend to read through this article and then search by known keywords.

In short, most popular approach is supervised machine learning (e.g. SVM) with tf-idf over words, or sometimes - word n-grams.

Scikit-learn tutorial describes this task; there are also existed libraries like LibShortText.

For datasets (more common term than 'database') look at Reuters-21578 Text Categorization Collection or here. In general, it isn't hard to collect texts from predefined categories. For example, go to news sites - maybe to specialized ones - like sports - if you want to classify texts by kinds of sport.

See also for related question on stackoverflow or quora.

1
Ankit Solanki On

There are multiple ways to address this problem and the underlying theme around the same is in the domain of Semantic Web.

  1. Use a knowledge base like dbpedia, dbpedia is essentially wikipedia data in triple format (subject predicate object). Query dbpedia using sparql on predicate- rdfs:label, this will return you an URI for the token if it is a part of dbpedia and a predicate called dcterms:subject will have the category related to that subject. You might need to traverse the triple store to get more abstract relationship. Similar knowledge bases - ConceptNet, freebase, yago.

  2. Check, http://www.cyc.com/

Let me know if you want me to elaborate more

Best Ankit