Here is the problem: When given a block of text, I want to suggest possible topics . For example, a news article about Kobe Bryant would have suggested tags like: ‘basketball’, ‘nba’, ‘sports’.
I have a fairly large training dataset (350k+) that includes bodies of text and tags that users have assigned to the text. There are about 40k, pre existing topics; however, many of the topics do not have too many entries in them. I would say only about 5k of the topics have more than 10 entries in them. Users cannot assign topics that don’t already exist in the system. I'd also like to include that
Does anyone have any suggestions for algorithms to use?
If anyone has any suggestions of python libraries as well that would be awesome.
There have been attemps on similar problem - one example is right here - stackoverflow. When you wrote your question, stackoverflow itself suggests some tags without your intervention, though you can manually add or remove them.
Out-of-the-box classification would fail as the number of tags is really huge. There're two directions you could work on this problem from.
Nearest Neighbors Easy, fast and effective. You have a labelled training set. When a new document comes, you look for closest matches, e.g. words like 'tags', ' training', 'dataset', 'labels' etc helped your question map with other similar questions on StackOverflow. In those questions, machine-learning tag was there - so this tag was suggested. Best way for implementation is index your training data (search-engine tactic). You may use Lucene, Elastic Search or something similar. When a new document appears, use that as a query and search for top 10 matching documents stored previously. Poll their tags. Sort the tags and use the scores of the documents to find how important the tags are. Done.
Probabilistic Models Idea is on the lines of classification but off-the-shelf tools won't help you with that. Check the works like Clayton Stanley, Predicting Tags for StackOverflow Posts, Darren Kuo, On Word Prediction Methods or Schuster's report on Predicting Tags for StackOverflow Questions
If you have got this problem as a part of long-term academic project or research, working on Method 2 would be better. However if you need off the shelf solution, use Method 1. Lucene is a great indexing tool used even in production. It is originally in Java but you can easily find wrappers for Python. Another alternatives are Elastic Search, Katta and many more.
p.s. A lot of experimentation is required while playing with the tag-scores.