My Problem is related to Natural Language Processing (NLP) and the chunking of input strings into logical groups.
To simplify things what I have is a vector of Token data structures, each of which contains a 'tag' string value among other things.
class Token
{
public:
std::string tag;
std::string word;
// other stuff;
};
std::vector<Token> input_tokens;
typedef std::vector<Token>::iterator tok_iter;
I also form a concatenated string of 'tag' values taken from each Token in the vector that looks like this:
std::string pos_tags = "DT JJ NN NN IN RB JJ NN DT";
I am interested in forming chunkings of JJ (adjective) and NN (noun) instances only so that for the above pos_tags example there would be two matching chunks:
"JJ NN NN", "JJ NN"
Is it possible to run a kind of regex over the pos_tags string such that each regex match represents the range of tokens in the input token set (input_tokens) from where it came? In other words, each chunk formed is not a string, but represented by a start/end iterators.
Ideally I would like to store the matches I have found as a vector of boost::iterator_range, where each range represents the start/end of each chunk found, something like this:
std::vector< boost::iterator_range<tok_iter> > chunks;
I hope this makes sense. I am not necessarily looking for the complete code, but hints as to how to use regular expressions (I'm a novice) in this way.