I have a function that receive a sentence, and tokenize into words, based on space " ". Now, I want to improve the function to eliminate some special characters, for example:
I am a boy. => {I, am, a, boy}, no period after "boy"
I said :"are you ok?" => {I, said, are, you, ok}, no question and quotation mark
The original function is here, how can I improve it?
void Tokenize(const string& str, vector<string>& tokens, const string& delimiters = " ")
{
string::size_type lastPos = str.find_first_not_of(delimiters, 0);
string::size_type pos = str.find_first_of(delimiters, lastPos);
while (string::npos != pos || string::npos != lastPos)
{
tokens.push_back(str.substr(lastPos, pos - lastPos));
lastPos = str.find_first_not_of(delimiters, pos);
pos = str.find_first_of(delimiters, lastPos);
}
}
You could use a
std::regex
. There you could search, whatever you want and then put the result in a vector. That is rather simple.See: