Regex to match repeated substring in Google Sheets

62 views Asked by At

I have data that always starts with a substring repeated twice without a delimiter, and then other data that I don't care about. The length of the repeated substring varies, and in the example below I'm using mostly [a-z] characters for the sake of simplicity, but the repeated substring is mostly unicode squiggles in the real dataset.

my data what I want to extract
johnjohnsajoalsas john
peterpeteraaksoskco peter
a8co.a8co.robinson a8co.
robrob7s:s7 rob
dkoisawks \[null\]

This can be done easily with a positive lookahead

^(.+)(?=\1)

or directly referencing the capture group like this

^(.+)\1

However, Google Sheets doesn't support either of these.

Any help will be greatly appreaciated.

3

There are 3 answers

1
rockinfreakshow On BEST ANSWER

Here's one approach (non-regex) you may test out in Sheets:

=let(Λ,A2, Σ,map(sequence(rounddown(len(Λ)/2)),lambda(Σ,{left(Λ,Σ),left(Λ,Σ)=mid(Λ,Σ+1,Σ)})),
    ifna(+filter(Σ,index(Σ,,2)),"-"))

enter image description here

  • lists all probabilities using left() and mid() to filter out the possible match if any...
0
Patrick Janser On

I had a try with Google Sheets and used "search and replace" with ^(.{2,})\1(.*)$ and it worked.

I tested (.*)\1 and also (.*?)\1 as you mentioned, and effectively it doesn't work.

It should also work with ^(.+?)\1. So the problem was just that * would match zero times and + would force at least to match one char.

So up to you to decide, but I think that a first name should be probably more than 2 characters, so this is why I would prefer using .{2,} or even \S{2,} to be sure to match non-spaces.

Search: ^(\S{2,})\1(.*)$

Replace: $1

0
z.. On

Another solution:

=LET(s,SEQUENCE(INT(LEN(A2)/2)),FILTER(LEFT(A2,s),LEFT(A2,s)=MID(A2,s+1,s)))

enter image description here

The idea is the same as the other user. Generate all the potentially matching strings of length at most INT(LEN(str)/2) and only keep the ones that match, if there are any.

For example, for johnjohnsajoalsas we check the following strings:

enter image description here

We see there is a match at row 4 so we keep that value and remove everything else.

We could also do this with a single formula in C2 that doesn't require to be dragged using the MAP function.

=MAP(A2:A,LAMBDA(str,IF(str="",,LET(s,SEQUENCE(INT(LEN(str)/2)),
   FILTER(LEFT(str,s),LEFT(str,s)=MID(str,s+1,s))))))

enter image description here