Given a big file (~1.000.000 lines) with the following format:

1.xml:LINK-ID-12$LANG,LINK-ID-242$LANG,____de-DE
2.xml:LINK-ID-323$LANG,LINK-ID-122$LANG,____en-GB

After processing the result should be

1.xml:LINK-ID-12#de-DE,LINK-ID-242#de-DE
2.xml:LINK-ID-#en-GB,LINK-ID-122#en-GB

The last element in a line always contains the language. The format of this element is free to choose, for demo purposes it is ____<LANG>.

The placeholder to replace with the language $LANG is also free to choose.

Removing the last entry in the array is not the big deal, I'm really looking for a solution for replacement..

If possible I'm looking for a solution that does not require bash to iterate over the whole file, maybe something with awk/sed/grep (because of speed)

2 Answers

0
Ed Morton On Best Solutions
$ awk 'BEGIN{FS=OFS=","} {sub(/^_+/,"#",$3); gsub(/\$LANG/,$3); print $1, $2}' file
1.xml:LINK-ID-12#de-DE,LINK-ID-242#de-DE
2.xml:LINK-ID-323#en-GB,LINK-ID-122#en-GB
0
melpomene On

If a Perl solution is acceptable:

perl -pe 's/,____([^,]+)$// or next; my $x = $1; s/\$LANG\b/#$x/g'

If you can change the input so it doesn't have those four underscores in the last field, it would simplify the code a bit (just remove ____ from the first regex).

Idea:

For every input line, match the last field (a comma, followed by four underscores, followed by one or more non-comma characters, followed by the end of the line) and remove it (replace by nothing). If this replacement fails, leave the line unchanged and go to the next line.

If the replacement was successful, capture the contents of the removed field (minus the four leading underscores) in $1 and copy the value into $x for the next substitution.

Then scan over the remaining line again and replace every occurrence of $LANG as a word (i.e. not $LANGS or $LANGUAGE) by a #, followed by the extracted string $x.