Remove sequential duplicates using regex (pipe delimited)

151 views Asked by At

I have a pipe delimited list of phrases. I would like to remove sequential duplicates using a regex replace/substitution. For example:

dog|cat|cat woman|cat woman|dog|dog 
cat|cat|catman|catman|catman|cat woman|cat woman|dog|dogman|doggy

would be transformed into

dog|cat|cat woman|dog 
cat|catman|cat woman|dog|dogman|doggy

I am stuck. So far, I am at ((^|\|)([^\|]+))\1+ with a substitution of $1. But clearly, that does not work, for the output is

dog|cat woman|cat woman|dog 
cat|catman|catman|cat woman|dogman|doggy

Thanks for your help

1

There are 1 answers

10
The fourth bird On

You can set boundaries on the left and the right to prevent partial matches when using the capture group and the backreference.

If a lookbehind assertion is supported:

(?<![^|\n])([^|\n]+)(?:\|\1)+(?![^|\n])

The pattern matches:

  • (?<![^|\n]) Negative lookbehind, assert that what is directly to the left is not any char except | or a newline
  • ([^|\n]+) Capture group 1, match 1 or more times any char except | or a newline to prevent crossing lines
  • (?:\|\1)+ Repeat 1 or more times matching | and the backreference to group 1
  • (?![^|\n]) Negative lookahead that asserts that what is directly to the right is not any char except | or a newline

Regex demo

In the replacement you can use capture group 1.

Output

dog|cat|cat woman|dog
cat|catman|cat woman|dog|dogman|doggy

With thanks to Casimir et Hippolyte for the great improvement.