Regular Expressions Count in Lookbehind

205 views Asked by At

I'm trying the following: I have a string, that can look like this: 'a, b, (c, d, (e, f), g), (h, i)' and I want to split it at the commas that resemble the first layer:

a b (c, d, (e, f), g) (h, i)

I just can't figure out how to do this. The logical solution I got was, I have to find the commas, which have the same amount of opening and closing brackets behind them. How can I implement this with regular expressions?

Best Regards

3

There are 3 answers

8
gnovice On

Here are a couple of options:

Option 1: If your data has a consistent pattern of commas and parentheses across rows, you can actually parse it quite easy with a regex. The downside is that if your pattern changes, you have to change the regex. But it's also quite fast (even for very large cell arrays):

str = {'(0, 0, 1540.4, (true, (121.96, 5)), 5.7068, 1587.0)';
       '(0, 0, 1537.5, (true, (121.93, 6)), 5.7068, 1587.0)';
       '(0, 0, 1537.5, (true, (121.93, 3)), 5.7068, 1587.0)';
       '(0, 0, 1537.5, (true, (121.93, 4)), 5.7068, 1587.0)';
       '(0, 0, 1537.5, (true, (121.93, 5)), 6.0965, 1587.0)';
       '(0, 0, 1535.2, (true, (121.9, 6)), 6.0965, 1587.0)';
       '(0, 0, 1535.2, (true, (121.9, 3)), 6.0965, 1587.0)';
       '(0, 0, 1535.2, (true, (121.9, 4)), 6.0965, 1587.0)';
       '(0, 0, 1535.2, (true, (121.9, 5)), 6.3782, 1587.0)';
       '(0, 0, 1532.3, (true, (121.87, 6)), 6.3782, 1587.0)'};

tokens = regexp(str, ['^\(([-\d\.]+), ' ... % Column 1
                         '([-\d\.]+), ' ... % Column 2
                         '([-\d\.]+), ' ... % Column 3
                         '(\(\w+, \([-\d\.]+, [-\d\.]\)\)), ' ... % Column 4
                         '([-\d\.]+), ' ... % Column 5
                         '([-\d\.]+))'], ... % Column 6
                'tokens', 'once');
str = vertcat(tokens{:});
disp(str);

And the result for this example:

'0'    '0'    '1540.4'    '(true, (121.96, 5))'    '5.7068'    '1587.0'
'0'    '0'    '1537.5'    '(true, (121.93, 6))'    '5.7068'    '1587.0'
'0'    '0'    '1537.5'    '(true, (121.93, 3))'    '5.7068'    '1587.0'
'0'    '0'    '1537.5'    '(true, (121.93, 4))'    '5.7068'    '1587.0'
'0'    '0'    '1537.5'    '(true, (121.93, 5))'    '6.0965'    '1587.0'
'0'    '0'    '1535.2'    '(true, (121.9, 6))'     '6.0965'    '1587.0'
'0'    '0'    '1535.2'    '(true, (121.9, 3))'     '6.0965'    '1587.0'
'0'    '0'    '1535.2'    '(true, (121.9, 4))'     '6.0965'    '1587.0'
'0'    '0'    '1535.2'    '(true, (121.9, 5))'     '6.3782'    '1587.0'
'0'    '0'    '1532.3'    '(true, (121.87, 6))'    '6.3782'    '1587.0'

Note that I used the pattern [-\d\.]+ to match an arbitrary number which may have a negative sign or decimal point.

Option 2: You can use regexprep to repeatedly remove pairs of parentheses that don't contain other parentheses, replacing them with whitespace to maintain the same size string. Then, find the positions of the commas in the final processed string and break up the original string using these positions. You won't have to change the regex for each new pattern of commas and parentheses, but this will be a little slower than the above (but still only taking a second or two for arrays of up to 15,000 cells):

% Using raw str from above:
str = cellfun(@(s) {s(2:end-1)}, str);
tempStr = str;
modStr = regexprep(tempStr, '(\([^\(\)]*\))', '${blanks(numel($0))}');
while ~isequal(modStr, tempStr)
  tempStr = modStr;
  modStr = regexprep(tempStr, '(\([^\(\)]*\))', '${blanks(numel($0))}');
end

commaIndex = regexp(tempStr, ',');
str = cellfun(@(v, s) {mat2cell(s, 1, diff([1 v numel(s)+1]))}, commaIndex, str);
str = strtrim(strip(vertcat(str{:}), ','));
disp(str);

This gives the same result as Option 1.

1
cjaube On

I know the question says how to implement it with regular expressions, but if you were you parse it character by character, you could simply keep track of the nest level as you go. Here is a javascript snipit to demonstrate how it might be done (https://jsfiddle.net/jf65k0jc/):

var str = 'a, b, (c, d, (e, f), g), (h, i)';
var arr = [];
var buffer = '';
var level = 0;
for (var i = 0; i < str.length; i++) {
  var letter = str[i];

  if (level === 0) {
    if (letter === ',') {
      arr.push(buffer.trim());
      buffer = '';
    }
    else {
      buffer += letter;
      if (letter === '(') {
        level++;
      }
    }
  }
  else {
    buffer += letter;
    if (letter === '(') {
      level++;
    }
    else if (letter === ')') {
      level--;
    }
  }
}
arr.push(buffer.trim());

var output = '';
for (var i = 0; i < arr.length; i++) {
  output += arr[i] + '<br>';
}
$('.output').html(output);

// Outputs:
// a
// b
// (c, d, (e, f), g)
// (h, i)
0
rahnema1 On

A solution without regex:

a = 'a, b, (c, d, (e, f), g), (h, i)';
a(cumsum((a=='(')-(a==')'))==0 & a==',')=';'
out = strsplit(a, ';')

result:

{
  [1,1] = a
  [1,2] =  b
  [1,3] =  (c, d, (e, f), g)
  [1,4] =  (h, i)
}

we can find level of nesting of each character using

cumsum((a=='(')-(a==')'));

array of nesting level:

0000001111111222221111000111110

so for example first 6 characters 'a, b, ' are in the 0th level and so on.
and we only require those characters that are in the 0 level

cumsum((a=='(')-(a==')'))==0

and also they should be commas

cumsum((a=='(')-(a==')'))==0 & a==','

set all commas that are in 0 level to ';'

a(cumsum((a=='(')-(a==')'))==0 & a==',')=';'

and split the string

strsplit(a, ';')