I'm trying to extract phone numbers in all formats (international and otherwise) in R.

Example data:

phonenum_txt <- "sDlkjazsdklzdjsdasz+49 123 999dszDLJhfadslkjhds0001 123.456sL:hdLKJDHS+31 (0) 8123zsKJHSDlkhzs&^#%Q(999)9999999adlfkhjsflj(999)999-9999sDLKO*$^9999999999adf;jhklslFjafhd9999999999999zdlfjx,hafdsifgsiaUDSahj"

I'd like:

extract_vector
[1] "+49 123 999"
[2] 0001 123.456
[3] "+31 (0) 8123"
[4] (999)9999999
[5] (999)999-9999
[6] 9999999999
[7] 9999999999999

I tried using:

extract_vector <- str_extract_all(phonenum_txt,"^(?:\\+\\d{1,3}|0\\d{1,3}|00\\d{1,2})?(?:\\s?\\(\\d+\\))?(?:[-\\/\\s.]|\\d)+$")

which I got from HERE, but my regex skills aren't good enough to convert it to make it work in R.

Thanks!

2 Answers

2
Emma On

While your data does not seem to be realistic, this expression might help you to design a desired expression to match your string.

(?=.+[0-9]{2,})([0-9+\.\-\(\)\s]+)

I have added an extra boundary, which is usually good to add when inputs are complex.

enter image description here

You might add or remove boundaries, if you wish. For instance, this expression might work as well:

([0-9+\.\-\(\)\s]+)

Or you can add additional left and right boundaries to it, for instance if all phone numbers are wrapped with lower/uppercase letters:

[a-z]([0-9+\.\-\(\)\s]+)[a-z]

You can simply call your desired target output, which is in a capturing group using $1.

enter image description here

Regular expression design works best, if/when there is real data available.

1
Pushpesh Kumar Rajwanshi On

You can use this regex to match and extract all the phone numbers you have in your string.

(?: *[-+().]? *\d){6,14}

The idea behind this regex is to allow optionally one character from this set [-+().] (as these characters can appear within your phone number) before one digit in your phone number. If your phone number can contain further more characters like { or } or [ or ] then you may add them to this character set. And this optional character set may be surrounded by optional spaces hence we have space star before and after that char set and at the end we have \d for matching it with a number and whole of this pattern is quantified {6,14} to at least appear 6 or at max appear 14 times (you can configure these numbers as per your needs) as a minimum numbers in a phone number as per your sample data is 6 (although in actual I think it is 7 or 8 of Singapore but that's up to you)

Regex Demo

R Code Demo

library(stringr)
str_match_all("sDlkjazsdklzdjsdasz+49 123 999dszDLJhfadslkjhds0001 123.456sL:hdLKJDHS+31 (0) 8123zsKJHSDlkhzs&^#%Q(999)9999999adlfkhjsflj(999)999-9999sDLKO*$^9999999999adf;jhklslFjafhd9999999999999zdlfjx,hafdsifgsiaUDSahj", "(?: *[-+().]? *\\d){6,14}")

Prints all your required numbers,

[[1]]
     [,1]           
[1,] "+49 123 999"  
[2,] "0001 123.456" 
[3,] "+31 (0) 8123" 
[4,] "(999)9999999" 
[5,] "(999)999-9999"
[6,] "9999999999"   
[7,] "9999999999999"