Ruby: break string into words by capital letters and acronyms

464 views Asked by At

I need to break a string into several strings by capital letters and acronyms, I could do this:

myString.scan(/[A-Z][a-z]+/)

But it works for only capital letters, in cases like:

QuickFoxReadingPDF

or

LazyDogASAPSleep

The all-capital acronyms are missing in the result.

What should I change the RegEx to, or are there any alternatives?

Thanks!

Update:

Later I found some of my data has digits, like "RabbitHole3", It would be great if the solution could consider digits, ie. ["Rabbit", "Hole3"].

2

There are 2 answers

4
Ryszard Czech On BEST ANSWER

Use

s.split(/(?<=\p{Ll})(?=\p{Lu})|(?<=\p{Lu})(?=\p{Lu}\p{Ll})/)

See proof.

Explanation

--------------------------------------------------------------------------------
  (?<=                     look behind to see if there is:
--------------------------------------------------------------------------------
    \p{Ll}                 any lowercase letter
--------------------------------------------------------------------------------
  )                        end of look-behind
--------------------------------------------------------------------------------
  (?=                      look ahead to see if there is:
--------------------------------------------------------------------------------
    \p{Lu}                 any uppercase letter
--------------------------------------------------------------------------------
  )                        end of look-ahead
--------------------------------------------------------------------------------
 |                        OR
--------------------------------------------------------------------------------
  (?<=                     look behind to see if there is:
--------------------------------------------------------------------------------
    \p{Lu}                 any uppercase letter
--------------------------------------------------------------------------------
  )                        end of look-behind
--------------------------------------------------------------------------------
  (?=                      look ahead to see if there is:
--------------------------------------------------------------------------------
    \p{Lu}\p{Ll}           any uppercase letter, any lowercase letter
--------------------------------------------------------------------------------
  )                        end of look-ahead

Ruby code:

str = 'QuickFoxReadingPDF'
p str.split(/(?<=\p{Ll})(?=\p{Lu})|(?<=\p{Lu})(?=\p{Lu}\p{Ll})/)

Results: ["Quick", "Fox", "Reading", "PDF"]

0
The fourth bird On

The pattern [A-Z][a-z]+ matches a single uppercase char A-Z and one or more lowercase chars a-z which does not take multiple uppercase chars into account.

In this case, you also want to match an uppercase char when it is not directly followed by a lowercase char a-z.

Not sure if an acronym can consist of a single uppercase char, but if there there should be at least 2 uppercase chars

[A-Z][a-z]+|[A-Z]{2,}(?![a-z])

Regex demo