Priority in regex manipulating

305 views Asked by At

I write some java code to split string into array of string. First, I split that string using regex pattern "\\,\\,|\\," and then I split using pattern "\\,|\\,\\,". Why there are difference between output of the first and output of the second?

public class Test2 {
    public static void main(String[] args){

        String regex1 = "\\,\\,|\\,";
        String regex2 = "\\,|\\,\\,"; 

        String a  = "20140608,FT141590Z0LL,0608103611018634TCKJ3301000000018667,3000054789,IDR1742630000001,80507,1000,6012,TCKJ3301,6.00E+12,ID0010015,WADORI PURWANTO,,3000054789";
        String ss[] = a.split(regex1); 

        int index = 0; 
        for(String m : ss){
            System.out.println((index++)+ ": "+m+"|"); 
        }
    }
} 

Output when using regex1:

0: 20140608|
1: FT141590Z0LL|
2: 0608103611018634TCKJ3301000000018667|
3: 3000054789|
4: IDR1742630000001|
5: 80507|
6: 1000|
7: 6012|
8: TCKJ3301|
9: 6.00E+12|
10: ID0010015|
11: WADORI PURWANTO|
12: 3000054789|

And when using regex2:

0: 20140608|
1: FT141590Z0LL|
2: 0608103611018634TCKJ3301000000018667|
3: 3000054789|
4: IDR1742630000001|
5: 80507|
6: 1000|
7: 6012|
8: TCKJ3301|
9: 6.00E+12|
10: ID0010015|
11: WADORI PURWANTO|
12: |
13: 3000054789|

I need some explanation of how regex engine works when handling this situation.

4

There are 4 answers

0
Unihedron On BEST ANSWER

How regex works: The state machine always reads from left to right. ,|,, == ,, as it always will only be matched to the first alternation:

img
(source: gyazo.com)

,,|, == ,,?:

x
(source: gyazo.com)


However, you should use ,,? instead so there's no backtracking:

r
(source: gyazo.com)

0
vks On

Case 1: Split by ,, else ,
This gets only first case, the rest split by ,.

Case 2: Split by , else ,,
gets all cases. So ,, gets split into word and ,word.
Then ,word gets split into " " and word.

0
takeo999 On

Seeing the two results, it seems that the split method try to find the first expression at first ("," for regex2, ",," for regex1) and split the string, and then the second one, but after the first pass with regex2 there isn't a single "," left in the strings. That's why there is an empty string detected when ",," is read with regex2.

So for your regex to be useful, you need to write the more complex expression first.

0
Swapnil On

It will be evaluated from left to right. In regex1, \\,\\, is tried first, otherwise \\, is tried. That's why 12th String is not empty, because \\,\\, is matched in that case. For regex2, everything is matched using \\,, hence the empty String.