How can I efficiently extract keywords with relevance from a string? My list of keywords are predefined. For example, in an article about Michelle Obama that also mentions Barack Obama, I want to extract Michelle Obama
and Barack Obama
with the keyword Michelle Obama
getting a higher relevance value (both Michelle Obama
and Barack Obama
are present in my keywords list).
Checking the string for the number of occurrence of each keyword doesn't seem very efficient. My application is developed in PHP, but any language is ok, if I can do this efficiently.
I tried OpenCalais, but it is not detecting most of my keywords. Is it possible to extract keywords using Lucene?
The apache lucene package will suit you. However if you have title and paragraphs, you can filter out the stop words, give higher ranks for the words in the title and then match them or their forms in the paragraphs.. you can consult some text summarization articles for better programming yourself.