How can I achieve the equivalent of:
tail -f file.txt | grep 'regexp'
to only output the buffered lines that match a regular expression such as 'Result'
from the file type:
$ file file.txt
file.txt:Little-endian UTF-16 Unicode text, with CRLF line terminators
Example of the tail -f
stream content below converted to utf-8
:
Package end.
Total warnings: 40
Total errors: 0
Elapsed time: 24.4267192 secs.
...Package Executed.
Result: Success
Awk?
The problems in piping to grep
led me to awk
as a on-stop-shop solution for stripping the offending characters and also producing matched lines from regex.
awk
seems to be giving the most promising results, however, I am finding that it returns the whole stream rather than individual matching lines:
tail -f file.txt | awk '{sub("/[^\x20-\x7F]/", "");/Result/;print}'
Package end.
Total warnings: 40
Total errors: 0
Elapsed time: 24.4267192 secs.
...Package Executed.
Result: Success
What I have tried
converting the stream and piping to grep
tail -f file.txt | iconv -t UTF-8 | grep 'regexp'
using
luit
to change terminal encoding as per this postluit -encoding UTF-8 -- tail -f file.txt | grep 'regexp'
delete non
ASCII
characters, described here, then piping togrep
tail -f file.txt | tr -d '[^\x20-\x7F]' | grep 'regexp' tail -f file.txt | sed 's/[^\x00-\x7F]//' | grep 'regexp'
various combinations of the above using
grep
flags--line-buffered
,-a
as well assed -u
- using
luit -encoding UTF-8 --
pre-pended to the above - using a file with the same encoding containing the regular expression for
grep -f
Why they failed
- Most attempts, simply nothing is printed to the screen because
grep
searches'regexp'
when in fact the text is something like'\x00r\x00e\x00g\x00e\x00x\x00p'
- for example'R'
will return the line'Result: Success'
but'Result'
won't - If a full regular expression gets a match, such as in the case of using
grep -f
, it will return the whole stream and doesn't seem to just return the matched lines - piping through
sed
ortr
oriconv
seems to break the pipe togrep
andgrep
seems to still only be able to match individual characters
Edit
I looked at the raw file in it's utf-16
format using xxd
with an aim of using regex to match the encoding, which gave the following output:
$ tail file.txt | xxd
00000000: 0050 0061 0063 006b 0061 0067 0065 0020 .P.a.c.k.a.g.e.
00000010: 0065 006e 0064 002e 000d 000a 000d 000a .e.n.d..........
00000020: 0054 006f 0074 0061 006c 0020 0077 0061 .T.o.t.a.l. .w.a
00000030: 0072 006e 0069 006e 0067 0073 003a 0020 .r.n.i.n.g.s.:.
00000040: 0034 0030 000d 000a 0054 006f 0074 0061 .4.0.....T.o.t.a
00000050: 006c 0020 0065 0072 0072 006f 0072 0073 .l. .e.r.r.o.r.s
00000060: 003a 0020 0030 000d 000a 0045 006c 0061 .:. .0.....E.l.a
00000070: 0070 0073 0065 0064 0020 0074 0069 006d .p.s.e.d. .t.i.m
00000080: 0065 003a 0020 0032 0034 002e 0034 0032 .e.:. .2.4...4.2
00000090: 0036 0037 0031 0039 0032 0020 0073 0065 .6.7.1.9.2. .s.e
000000a0: 0063 0073 002e 000d 000a 002e 002e 002e .c.s............
000000b0: 0050 0061 0063 006b 0061 0067 0065 0020 .P.a.c.k.a.g.e.
000000c0: 0045 0078 0065 0063 0075 0074 0065 0064 .E.x.e.c.u.t.e.d
000000d0: 002e 000d 000a 000d 000a 0052 0065 0073 ...........R.e.s
000000e0: 0075 006c 0074 003a 0020 0053 0075 0063 .u.l.t.:. .S.u.c
000000f0: 0063 0065 0073 0073 000d 000a 000d 000a .c.e.s.s........
00000100: 00
I realised a simple regex to ignore any characters between letters in the search string might work...
This matches
'Result'
whilst allowing any one character between each letter...or as per this answer: to avoid typing all the tedious dots...