how can i use weka to terminology extraction?

237 views Asked by At

i need to extract domain-specific terms from a big training corpus, such as political terms or etc .how can i use Weka and it's filters to aim this object? can i use feature vector produced by StringToVector() filter in Weka to do this or not?

1

There are 1 answers

2
Jose Maria Gomez Hidalgo On

You can at least partly, as far as you have an appropriate dataset. For instance, let us assume you have a dataset like this one:

@relation test

@attribute text String
@attribute politics {yes,no}
@attribute religion {yes,no}

@data
"this is a text about politics",yes,no
"this text is about religion",no,yes
"this text mixes everything",yes,yes

For instance, for getting terms about politics, you can:

  1. Remove the religion attribute.
  2. Apply the StringToWordVector filter to the text attribute to get terms.
  3. Apply the AttributeSelection filter with Ranker and InfoGainAttributeEval to get the top ranked terms.

This latter step will give you a list of terms that are most predictive for the politics category. Most of them will be terms in the politics domain (although it is possible that some terms are predictive but just because they are not in the politics domain - that is, they provide negative evidence).

The quality of the terms you get depens on the dataset. The more topics it deals with, the better for your results; so instead of having two classes (politics, religion, like in my dataset), it is much better to have plenty of them and many examples for each category.