Using boost::iterator_range in Natural Language Processing

267 views Asked by AndyUK At 31 October 2012 at 22:41

My Problem is related to Natural Language Processing (NLP) and the chunking of input strings into logical groups.

To simplify things what I have is a vector of Token data structures, each of which contains a 'tag' string value among other things.

class Token
{
   public:
      std::string tag;
      std::string word;
      // other stuff;
};

std::vector<Token> input_tokens;
typedef std::vector<Token>::iterator tok_iter;

I also form a concatenated string of 'tag' values taken from each Token in the vector that looks like this:

std::string pos_tags = "DT JJ NN NN IN RB JJ NN DT";

I am interested in forming chunkings of JJ (adjective) and NN (noun) instances only so that for the above pos_tags example there would be two matching chunks:

"JJ NN NN", "JJ NN"

Is it possible to run a kind of regex over the pos_tags string such that each regex match represents the range of tokens in the input token set (input_tokens) from where it came? In other words, each chunk formed is not a string, but represented by a start/end iterators.

Ideally I would like to store the matches I have found as a vector of boost::iterator_range, where each range represents the start/end of each chunk found, something like this:

std::vector< boost::iterator_range<tok_iter> > chunks;

I hope this makes sense. I am not necessarily looking for the complete code, but hints as to how to use regular expressions (I'm a novice) in this way.

Original Q&A

TechQA.

Using boost::iterator_range in Natural Language Processing

There are 0 answers

Related Questions in C++

Related Questions in REGEX

Related Questions in BOOST

Related Questions in NLP

Related Questions in BOOST-ITERATORS

Popular Questions

Popular Tags

Trending Questions