Tokenize sentence into words, considering special characters

250 views Asked by At

I have a function that receive a sentence, and tokenize into words, based on space " ". Now, I want to improve the function to eliminate some special characters, for example:

I am a boy.   => {I, am, a, boy}, no period after "boy"
I said :"are you ok?"  => {I, said, are, you, ok}, no question and quotation mark 

The original function is here, how can I improve it?

void Tokenize(const string& str, vector<string>& tokens, const string& delimiters = " ")
{

    string::size_type lastPos = str.find_first_not_of(delimiters, 0);

    string::size_type pos = str.find_first_of(delimiters, lastPos);

    while (string::npos != pos || string::npos != lastPos)
    {

        tokens.push_back(str.substr(lastPos, pos - lastPos));

        lastPos = str.find_first_not_of(delimiters, pos);

        pos = str.find_first_of(delimiters, lastPos);
    }
}
1

There are 1 answers

0
A M On

You could use a std::regex. There you could search, whatever you want and then put the result in a vector. That is rather simple.

See:

#include <iostream>
#include <string>
#include <algorithm>
#include <vector>
#include <regex>

// Our test data (raw string). So, containing also \" and so on
std::string testData(R"#(I said :"are you ok?")#");

std::regex re(R"#((\b\w+\b,?))#");

int main(void)
{
    // Define the variable id as vector of string and use the range constructor to read the test data and tokenize it
    std::vector<std::string> id{ std::sregex_token_iterator(testData.begin(), testData.end(), re, 1), std::sregex_token_iterator() };

    // For debug output. Print complete vector to std::cout
    std::copy(id.begin(), id.end(), std::ostream_iterator<std::string>(std::cout, " "));

    return 0;
}