I can't accurately understand how does JavaScript's method string.match(regexp)'s g flag work

609 views Asked by At

In the book "JavaScript: The Good Parts", it explains method string.match(regexp) as below:

The match method matches a string and a regular expression. How it does this depends on the g flag. If there is no g flag, then the result of calling string .match( regexp ) is the same as calling regexp .exec( string ). However, if the regexp has the g flag, then it produces an array of all the matches but excludes the capturing groups:

Then the book provides code example:

var text = '<html><body bgcolor=linen><p>This is <b>bold<\/b>!<\/p><\/body><\/html>';
var tags = /[^<>]+|<(\/?)([A-Za-z]+)([^<>]*)>/g;
var a, i;
a = text.match(tags);
for (i = 0; i < a.length; i += 1) {
    document.writeln(('// [' + i + '] ' + a[i]).entityify());
}
// The result is
// [0] <html>
// [1] <body bgcolor=linen>
// [2] <p>
// [3] This is
// [4] <b>
// [5] bold
// [6] </b>
// [7] !
// [8] </p>
// [9] </body>
// [10] </html>

My question is that I can't understand "but excludes the capturing groups".

In the code example above, html in the </html> is in a capturing group. And why is it still included in the result array?

And / in the </html> is also in a capturing group. And why is it included in the result array?

Could you explain "but excludes the capturing groups" with the code example above?

Thank you very much!

2

There are 2 answers

1
T.J. Crowder On BEST ANSWER

In the code example above, html in the is in a capturing group. And why is it still included in the result array?

Because it's the full match. When he says "but excludes the capture groups" he doesn't mean from the full match result, just that the contents of the capture groups aren't reiterated in the array. If the capturing groups were included, you'd see

// The result is
// [0] <html>
// [1]           // From the capture group; nothing here
// [2] html      // From the capture group
// [3]           // From the capture group; nothing here
// ...

And / in the is also in a capturing group. And why is it included in the result array?

For the same reason as above: It's part of the overall match, and that's what's in the result; the contents of the individual capture groups are not.

This is easier to understand with a simpler example. Consider this code:

var s = "test1 test2";
var re = /(test)(.)/g;
var r = s.match(re);
var i;
for (i = 0; i < r.length; ++i) {
    console.log("[" + i + "]: '" + r[i] + "'");
}

Because the regular expression has the g flag, only the full matches are included in the array, so we see:

[0]: 'test1'
[1]: 'test2'

In each case, the entry in the array is the full match, which includes the characters that matched within capture groups making up the overall expression.

If we removed the g flag but didn't change anything else, we'd get the first full match followed by the contents of the two capture groups:

[0]: 'test1'    // The full match, including the stuff from each capture group
[1]: 'test'     // Capture group 0's contents
[2]: '1'        // Capture group 1's contents

There, the first entry is the full match; then the second and third are the contents of the capture groups. Note that the contents of the capture gruops

0
CrayonViolent On

the g modifier is to apply the regex globally. Without it, the regex matches and returns the first match found. With it, it searches for and matches all occurrences in the string.