Using boost::iterator_range in Natural Language Processing

277 views Asked by At

My Problem is related to Natural Language Processing (NLP) and the chunking of input strings into logical groups.

To simplify things what I have is a vector of Token data structures, each of which contains a 'tag' string value among other things.

class Token
{
   public:
      std::string tag;
      std::string word;
      // other stuff;
};

std::vector<Token> input_tokens;
typedef std::vector<Token>::iterator tok_iter;

I also form a concatenated string of 'tag' values taken from each Token in the vector that looks like this:

std::string pos_tags = "DT JJ NN NN IN RB JJ NN DT";

I am interested in forming chunkings of JJ (adjective) and NN (noun) instances only so that for the above pos_tags example there would be two matching chunks:

"JJ NN NN", "JJ NN"

Is it possible to run a kind of regex over the pos_tags string such that each regex match represents the range of tokens in the input token set (input_tokens) from where it came? In other words, each chunk formed is not a string, but represented by a start/end iterators.

Ideally I would like to store the matches I have found as a vector of boost::iterator_range, where each range represents the start/end of each chunk found, something like this:

std::vector< boost::iterator_range<tok_iter> > chunks;

I hope this makes sense. I am not necessarily looking for the complete code, but hints as to how to use regular expressions (I'm a novice) in this way.

0

There are 0 answers