I'm struggling to work with regular expressions. I think I understand the individual expressions but combining something together has me completely stumped. I don't grasp the use of something equivalent to an AND operator to connect the pieces I want together into a "full" match expression.

For example, I'd like to split a string into an array breaking on any values of <1> to <57> and </1> to </57>.

So, I thought I needed something like:

( '<' or '<\/' ) and ( [1-9] or [1-4][0-9] or [5][0-7] ) and '>'

I can get separately <[1-4][0-9]> to work or </[1-4][0-9]>, but when combined together with a '|' it returns partial matches or undefined in between full matches.

Could you please tell me what I am not understanding? Attached is my example.

If click 'Try' for the first expression, it produces empty values after each <21> or </21>. This prints as undefined in the console.log when I test it. The second expression produces < and </ after each tag. I don't understand this, let alone how to get the fuller expression earlier in this question converted into a regExp.

The desired output is:

'This is a', '<21>', 'test', '<\/21>', '.'

Thank you.

ADDITION After receiving Georg's answer to this question, I became interested in finding a method of escaping these tags, especially since there is not currently supported a negative look-back except in Chrome only. By that I mean \<21> would be treated as regular text and not generate a split of the string at that point. If you are interested in something similar, you may likely find the answer to my follow-up question provided by Revo here quite helpful.

let b, B = document.querySelectorAll('button');

for ( b of B ) b.addEventListener( 'click', split_str, false );

function split_str( evt )
 {
   let e = evt.currentTarget,
       r = new RegExp( e.previousElementSibling.value ),
       s = e.parentNode.previousElementSibling.value;
   e.parentNode.lastElementChild.textContent = s.split(r);   
 }
div > div  { border: 1px solid rgb(150,150,150); width: 500px; height: 200px;padding: 5px; }

input { border: 1px solid rgb(150,150,150); width: 500px; margin-bottom: 20px; padding:5px; }
<input type='text' value="This is a<21>test</21>.">

<div>

<input type='text' value="(<[1-4][0-9]>)|(<\/[1-4][0-9]>)"> <button>try</button>

<input type='text' value="((<|<\/)[1-4][0-9]>)"> <button>try</button>

<div></div>

</div> 

3 Answers

1
georg On Best Solutions

Ok, let's start with the number thingy. It's fine, except there's technically no need to bracket a single symbol [5]

 [1-9] | [1-4][0-9] | 5[0-7]

(using spaces here and below for clarity).

For the first part, an alteration like a | ab reads better when written as ab?, that is, "a, and then, optionally, b`. That gives us

 < \/ ?

Now, the "and" (or rather "and then") operator you were looking for, is very simple in the regex language - it's nothing. That is, a and then b is just ab.

However, if we combine both parts simply like this

a  x | y | z

that would be a mistake, because | has low priority, so that would be interpreted as

ax | y | z

which is not what we want. So we need to put the number thing in parens, for the reasons that will be explained below, these parens also have to be non-capturing:

<\/?  (?: [1-9] | [1-4][0-9] | 5[0-7] )

This matches our delimiters, but we also need everything in between, so we're going to split the input. split normally returns an array of strings that do not match the delimiter:

"a,b,c".split(/,/) => a b c

If we want to include the delimiter too, it has to be placed in a capturing group:

"a,b,c".split(/(,)/) => a , b , c

so we have to wrap everything in parens once again:

(  <\/?  (?: [1-9] | [1-4][0-9] | 5[0-7] )  )

and that's the reason for ?: - we want the whole thing to be captured, but not the number part.

Putting it all together seems to do the trick:

s = "This is a<21>test</21>."


console.log(s.split(/(<\/?(?:[1-9]|[1-4][0-9]|5[0-7])>)/))

Hope this sheds some light

2
junvar On

You've almost got it. It's really as simple as replacing 'or' with | and replacing and with concatenation. Then make sure your groups are unmatching by adding ?: to the beginning of each:

(?:<|<\/)(?:[1-9]|[1-4][0-9]|[5][0-7])>

MDN has an explanation on the interaction of split and regex. But the short example-explanation is:

'hi_joe'.split('_'); // ['hi', 'joe']
'hi_joe'.split(/_/); // ['hi', 'joe']
'hi_joe'.split(/(_)/); // ['hi', '_', 'joe']
'hi_joe'.split(/(?:_)/); // ['hi', 'joe']

Update per comment, if you'd like the <##> in your results array as well, wrap the regex in an additional set of parens.

((?:<|<\/)(?:[1-9]|[1-4][0-9]|[5][0-7])>)

0
bdebaere On

The way I understand regex is that, unless specified otherwise intentionally e.g. an OR clause, everything you define as a regex is in the form of an AND. [a-z] will match one character, whereas [a-z][a-z] will match one character AND another character.

Depending on your use case the regex below could be what you need. As you can see it captures everything between <number></number>.

<[1-5][0-9]>([\s\S]*?)<\/[1-5][0-9]>

<[1-5][0-9]> matches <number> where number is between 00 and 59.
[\s\S]*? matches every single character there is, including new lines, between zero and unlimited times.
</[1-5][0-9]> matches </number> where number is between 00 and 59.

Here is a snippet returning everything between <number></number>. It converts the matches to an array and gets the first capture group of the first match. The first capture group being everything between <number></number> as you can see by the parenthesis in the regex itself.

let str = '<10>Hello, world!</10>';

let reg = /<[1-5][0-9]>([\s\S]*?)<\/[1-5][0-9]>/g;

let matches = Array.from( str.matchAll(reg) );

console.log(matches[0][1]);