Regex to match repeated substring in Google Sheets

Question

Regex to match repeated substring in Google Sheets

62 views Asked by Hernán Ruiz At 31 March 2024 at 13:44

I have data that always starts with a substring repeated twice without a delimiter, and then other data that I don't care about. The length of the repeated substring varies, and in the example below I'm using mostly [a-z] characters for the sake of simplicity, but the repeated substring is mostly unicode squiggles in the real dataset.

my data	what I want to extract
`johnjohnsajoalsas`	`john`
`peterpeteraaksoskco`	`peter`
`a8co.a8co.robinson`	`a8co.`
`robrob7s:s7`	`rob`
`dkoisawks`	`\[null\]`

This can be done easily with a positive lookahead

^(.+)(?=\1)

or directly referencing the capture group like this

^(.+)\1

However, Google Sheets doesn't support either of these.

Any help will be greatly appreaciated.

Original Q&A

There are 3 answers

Patrick Janser On 31 March 2024 at 14:25

I had a try with Google Sheets and used "search and replace" with ^(.{2,})\1(.*)$ and it worked.

I tested (.*)\1 and also (.*?)\1 as you mentioned, and effectively it doesn't work.

It should also work with ^(.+?)\1. So the problem was just that * would match zero times and + would force at least to match one char.

So up to you to decide, but I think that a first name should be probably more than 2 characters, so this is why I would prefer using .{2,} or even \S{2,} to be sure to match non-spaces.

Search: ^(\S{2,})\1(.*)$

Replace: $1

z.. On 31 March 2024 at 14:35

Another solution:

=LET(s,SEQUENCE(INT(LEN(A2)/2)),FILTER(LEFT(A2,s),LEFT(A2,s)=MID(A2,s+1,s)))

The idea is the same as the other user. Generate all the potentially matching strings of length at most INT(LEN(str)/2) and only keep the ones that match, if there are any.

For example, for johnjohnsajoalsas we check the following strings:

We see there is a match at row 4 so we keep that value and remove everything else.

We could also do this with a single formula in C2 that doesn't require to be dragged using the MAP function.

=MAP(A2:A,LAMBDA(str,IF(str="",,LET(s,SEQUENCE(INT(LEN(str)/2)),
   FILTER(LEFT(str,s),LEFT(str,s)=MID(str,s+1,s))))))

**rockinfreakshow** · Accepted Answer · 2024-03-31T14:07:44+00:00

rockinfreakshow On 31 March 2024 at 14:07 BEST ANSWER

Here's one approach (non-regex) you may test out in Sheets:

=let(Λ,A2, Σ,map(sequence(rounddown(len(Λ)/2)),lambda(Σ,{left(Λ,Σ),left(Λ,Σ)=mid(Λ,Σ+1,Σ)})),
    ifna(+filter(Σ,index(Σ,,2)),"-"))

lists all probabilities using left() and mid() to filter out the possible match if any...

TechQA.

Regex to match repeated substring in Google Sheets

There are 3 answers

Related Questions in STRING

Related Questions in GOOGLE-SHEETS

Popular Questions

Trending Questions