split a string by a vector of strings

84 views Asked by At

I have an input string, and I also have a vector of separators.

I want to output the vector of strings that are not in the separators and as an additional entry, tje separator that was found

So for example given

"AB<sep1>DE<sep1><sep2>" where "<sep1>" and "<sep2>" are separators

The output would be the vector

"AB", "<sep1>", "DE", "<sep1>", "<sep2>"

Is there an efficient way to implement this?

2

There are 2 answers

0
chrysante On BEST ANSWER

Here is a possible implementation.

#include <algorithm>
#include <string>
#include <utility>
#include <vector>

std::vector<std::string> split(std::string const& str,
                               std::vector<std::string> const& seps) {
    std::vector<std::pair<size_t, size_t>> sepPos;
    for (auto& sep: seps) {
        if (sep.empty()) {
            continue;
        }
        size_t pos = 0;
        while ((pos = str.find(sep, pos)) != std::string::npos) {
            sepPos.push_back({ pos, pos + sep.size() });
            pos += sep.size();
        }
    }
    std::sort(sepPos.begin(), sepPos.end());
    std::vector<std::string> result;
    size_t pos = 0;
    for (auto [begin, end]: sepPos) {
        result.push_back(str.substr(pos, begin - pos));
        result.push_back(str.substr(begin, end - begin));
        pos = end;
    }
    result.push_back(str.substr(pos, str.size() - pos));
    return result;
}

And a minimal test.

#include <iostream>
#include <string>
#include <vector>

#include "Split.h"

int main() {
    auto tokens = split("AB<sep1>DE<sep2>X<sep1>Y", { "<sep1>", "<sep2>" });
    for (auto& token: tokens) {
        std::cout << token << std::endl;
    }
}

It first finds and collects the positions of all instances of seperators in the string. Then it sorts them so we can iterate over the positions and extract all the substrings.

Note that this breaks down if one seperator contains another seperator as a substring, but I assume that would be a pathological case. It also breaks if a seperator is specified twice.

1
Radovm On

It depends on how fast you want your implementation and what is the intended behavior in corner cases. This is an example of an implementation:

std::string s = "AB<sep1>DE<sep1><sep2>test";
std::vector<std::string> dels = {"<sep1>", "<sep2>"};

size_t pos = 0;
std::vector<std::string> tokens;

for(auto & del : dels)
{
    while ((pos = s.find(del)) != std::string::npos) 
    {
        tokens.push_back(s.substr(0, pos));
        s.erase(0, pos + del.length());
    }
}
if(s.size() != 0)
{
    tokens.push_back(s.substr(0, pos));    
}

for(auto & token : tokens)
{
    std::cout << token << std::endl;
}

The output is:

AB
DE

test

So this particular algorithm counts as a token the nothing between <sep1> and <sep2>.