As a beginner, I'm trying to solve the following problem (bash or python script):
the file (~50G!):
marker
xxx
xxx
xxx
pattern
marker
xxx
xxx
xxx
marker
xxx
xxx
xxx
pattern
I would like to find a way to remove the lines between two markers + the first marker, but not the last occurrence of the marker IF no pattern can be found throughout the lines.
Wanted result:
marker
xxx
xxx
xxx
pattern
[empty!]
marker
xxx
xxx
xxx
pattern
I tried to solve it with regex or awk (that's a very shy beginning)
awk '/marker/{f=1} f; /marker/{f=1}' file
but I'm having a hardtime understanding how to implement that in a function that would solve the entire problem. It would make me very happy if someone could help me with that!
Cheers
Here's a way to do it in python. Treat
markeras a separator, then remove anything from the text snippets between that don't containpatternEdit: the
or not entrybit in the list comprehension just handles the case wheremarkeris the first line in the file.Edit 2: Here's a streaming version (better suited for large files.) It uses
islicefromitertoolsto getnlines of the file at a time. The rest of the algorithm is more or less the same.