RegEx replace returns unexpected result without .*

69 views Asked by At

I'm trying to create a regualr expression that does the following transformations:

  1. Apple Orange > AO
  2. Load Module > LM
  3. anApple Orange > O
  4. toLoad Module > M

I found a suitable pattern, but noticed a strange behavior. Here's my initial try:

/^([A-Z])?[^ ]* ([A-Z])/

Running a replace on the third (and fourth) test case with this expression gives me a surprising result:

'anApple Orange'.replace(/^([A-Z])?[^ ]* ([A-Z])/,'$1$2')
> "Orange"

Why is that surprising? Well, the first group obviously does not match, since the string does not start with a capital letter, but the second group only selects a single capital letter: ([A-Z]), not everything after it: ([A-Z].*)

To my surprise, adding .* right after the last capture group gave me the correct result:

'anApple Orange'.replace(/^([A-Z])?[^ ]* ([A-Z]).*/,'$1$2')
> "O"

Why this is happening is beyond my understanding of JS and Regular Expressions. I'm thrilled to know what sort of dark magic is causing a singe [A-Z] to return multiple, and even some lowercase chracters.

Here's a runnable demo:

var testCases = [
      'Apple Orange',
      'Load Module',
      'anApple Orange',
      'toLoad Module'
    ],
    badregex = /^([A-Z])?[^ ]* ([A-Z])/,
    goodregex = /^([A-Z])?[^ ]* ([A-Z]).*/;

document.onreadystatechange = function(n){
  if (document.readyState === "complete"){
      for (var i=0,l=testCases.length; i<l; i++){
        var p = document.createElement('p'),
            testCase = testCases[i];
        p.innerHTML = ""+testCase+" &gt; "+testCase.replace(badregex,'$1$2')
        document.body.appendChild(p);
      }
      document.body.appendChild(document.createElement('hr'));
      for (var i=0,l=testCases.length; i<l; i++){
        var p = document.createElement('p'),
            testCase = testCases[i];
        p.innerHTML = ""+testCase+" &gt; "+testCase.replace(goodregex,'$1$2')
        document.body.appendChild(p);
      }
  }
}

3

There are 3 answers

5
Avinash Raj On BEST ANSWER

I would do like,

> "Apple Orange".replace(/(?:^|\s)([A-Z])|./g, "$1")
'AO'

Don't complex the things. Just capture all the uppercase chars which exists just after to a space or at the start. And then match all the remaining characters. Now replace all the matched chars by $1. Note that all the matched characters are replaced with the chars present inside the replacement part.

DEMO

Why?

'anApple Orange'.replace(/^([A-Z])?[^ ]* ([A-Z])/,'$1$2')
> "Orange"
  • ([A-Z])? checks for an optional uppercase letter at the start. There is no such thing. So it captures an empty string.
  • [^ ]* matches zero or more non-space characters.
  • <space> matches a space.
  • ([A-Z]) captures only the first letter in Orange.
  • Now by replacing all the matched chars with $1 -> empty string $2 -> O will give you Orange
1
t.niese On

Your first example matches anApple O. The $1 is empty because ^([A-Z])? is optional and does not match and $2 is O so you replace anApple O by O in the string anApple Orange and this will result in Orange

4
anubhava On

Instead of using replace with a complex regex you can use a a very simple regex with match and use join to get your desired output:

'anApple Orange'.match(/\b([A-Z])/g).join('')
//=> O

'Apple Orange'.match(/\b([A-Z])/g).join('')
//=> AO