I tested the keyword extraction from the Natural Language Understanding service from IBM with the following text:
Desarrollo PDA. Ajustes PDA. Nuevo modulo PDA. Ajustes modulo PDA. No sincroniza PDA. Error modulo PDA.
And i got the following response:
- modulo pda with 98.31% relevance
- ajustes modulo pda with 64.44% relevance
- nuevo modulo pda with 64.34 relevance
Now my question is why is "modulo pda" keyword relevance 98.31% and not just "PDA" with a higher relevance?. I've been searching everywhere about how does IBM works with no avail.
The actual algorithm used to extract and score keywords would be a corporate proprietary recipe, I won't expect them to make it public. But you can find lot of research papers on that topic but usually the final commercial products would contain mix of different techniques to get the best results.
You can compare the different NLU services from different provides, like IBM, Google, Amazon and compare the results.
Specifically for your query, you are trying to extract keywords or topics from a single document. PDA occurs in every sentence in your document. If we apply a simple technique like TF-IDF where each sentence is a document, the the TF-IDF=0 for the word PDA since it occurs in every sentence and becomes irrelevant since its not adding an information to overall topic or document importance.