Matching strings EXCEPT on lines starting with a specific tag

49 views Asked by At

I am not a programmer, so I apologize if my question is a bit too basic.

I am a translator and have an xliff (for our purposes, plain text) document that is roughly structured like this:

<source>For workers in the rest of the state, the minimum wage will increase to $9.70 at the end of 2016, then another .70 each year after until reaching $12.50 on 12/31/2020 – after which the minimum wage will continue to increase to $15 on an indexed schedule.</source>
<target>Для работников остальной части штата минимальная ставка оплаты труда поднимется до $9,70 в конце 2016 года, а затем будет расти на $0,70 ежегодно, достигнув размера в $12,50 31 декабря 2020 года, после чего минимальная ставка будет продолжать повышаться до $15 на основании графика.</target>

I am trying to capture all instances of dollar amounts in the <target> segments, so the dollar sign followed by one or two digits optionally followed by a comma and two more digits.

The purpose is to eventually replace these expressions using regex find and replace in Notepad++.

So far, I've tested the following expression (accounting for the stray period in place of the comma)

(\$\d+(\,|\.)?\d*\d*)

and it returned all dollar amounts, including those in the <source> segments. Based on my searches here, I tried to exclude these using lookbehinds but failed to get the desired results. I won't be sharing my failed attempts with you.

What's a good way of achieving this?

Thank you!

1

There are 1 answers

0
Quixrick On BEST ANSWER

Okay, well this is tricky. It's easy to match the dollar amounts in your text with this:

(\$\d+(?:(?:\.|,)\d{2})?)

But if you only want to match after a certain point, you can match the stuff before it and then throw it away by using \K. So this will match all of the source stuff and the opening target tag:

<source>.*?</source>\s*<target>\K

Then, since we tack on the \K, it will just start matching from there. Now, with adding in a .*? before our dollar sign capture group, we will be able to capture the first group of stuff. However, if you want to capture more than one thing, you will need to recurse the first pattern. You can do that by using the (?1) syntax. That will repeat the first capture group.

If you put it all together, you would end up with something like this:

<source>.*?</source>\s*<target>\K(?:.*?)(\$\d+(?:(?:\.|,)\d{2})?)|((?1))

Hopefully that gets you going in the right direction.

Here is a demo