grep and tail -f for a UTF-16 binary file - trying to use simple awk

Question

grep and tail -f for a UTF-16 binary file - trying to use simple awk

3.2k views Asked by Alexander McFarlane At 23 June 2015 at 22:20

How can I achieve the equivalent of:

tail -f file.txt | grep 'regexp'

to only output the buffered lines that match a regular expression such as 'Result' from the file type:

$ file file.txt
file.txt:Little-endian UTF-16 Unicode text, with CRLF line terminators

Example of the tail -f stream content below converted to utf-8:

Package end.

Total warnings: 40
Total errors: 0
Elapsed time: 24.4267192 secs.
...Package Executed.

Result: Success

Awk?

The problems in piping to grep led me to awk as a on-stop-shop solution for stripping the offending characters and also producing matched lines from regex.

awk seems to be giving the most promising results, however, I am finding that it returns the whole stream rather than individual matching lines:

tail -f file.txt | awk '{sub("/[^\x20-\x7F]/", "");/Result/;print}'
Package end.

Total warnings: 40
Total errors: 0
Elapsed time: 24.4267192 secs.
...Package Executed.

Result: Success

What I have tried

converting the stream and piping to grep

tail -f file.txt | iconv -t UTF-8 | grep 'regexp'

using luit to change terminal encoding as per this post

luit -encoding UTF-8 -- tail -f file.txt | grep 'regexp'

delete non ASCII characters, described here, then piping to grep

tail -f file.txt | tr -d '[^\x20-\x7F]' | grep 'regexp'
tail -f file.txt | sed 's/[^\x00-\x7F]//' | grep 'regexp'

various combinations of the above using grep flags --line-buffered, -a as well as sed -u
using luit -encoding UTF-8 -- pre-pended to the above
using a file with the same encoding containing the regular expression for grep -f

Why they failed

Most attempts, simply nothing is printed to the screen because grep searches 'regexp' when in fact the text is something like '\x00r\x00e\x00g\x00e\x00x\x00p' - for example 'R' will return the line 'Result: Success' but 'Result' won't
If a full regular expression gets a match, such as in the case of using grep -f, it will return the whole stream and doesn't seem to just return the matched lines
piping through sed or tr or iconv seems to break the pipe to grep and grep seems to still only be able to match individual characters

Edit

I looked at the raw file in it's utf-16 format using xxd with an aim of using regex to match the encoding, which gave the following output:

$ tail file.txt | xxd
00000000: 0050 0061 0063 006b 0061 0067 0065 0020  .P.a.c.k.a.g.e.
00000010: 0065 006e 0064 002e 000d 000a 000d 000a  .e.n.d..........
00000020: 0054 006f 0074 0061 006c 0020 0077 0061  .T.o.t.a.l. .w.a
00000030: 0072 006e 0069 006e 0067 0073 003a 0020  .r.n.i.n.g.s.:.
00000040: 0034 0030 000d 000a 0054 006f 0074 0061  .4.0.....T.o.t.a
00000050: 006c 0020 0065 0072 0072 006f 0072 0073  .l. .e.r.r.o.r.s
00000060: 003a 0020 0030 000d 000a 0045 006c 0061  .:. .0.....E.l.a
00000070: 0070 0073 0065 0064 0020 0074 0069 006d  .p.s.e.d. .t.i.m
00000080: 0065 003a 0020 0032 0034 002e 0034 0032  .e.:. .2.4...4.2
00000090: 0036 0037 0031 0039 0032 0020 0073 0065  .6.7.1.9.2. .s.e
000000a0: 0063 0073 002e 000d 000a 002e 002e 002e  .c.s............
000000b0: 0050 0061 0063 006b 0061 0067 0065 0020  .P.a.c.k.a.g.e.
000000c0: 0045 0078 0065 0063 0075 0074 0065 0064  .E.x.e.c.u.t.e.d
000000d0: 002e 000d 000a 000d 000a 0052 0065 0073  ...........R.e.s
000000e0: 0075 006c 0074 003a 0020 0053 0075 0063  .u.l.t.:. .S.u.c
000000f0: 0063 0065 0073 0073 000d 000a 000d 000a  .c.e.s.s........
00000100: 00

Original Q&A

There are 3 answers

that other guy On 23 June 2015 at 23:15

The sloppiest solution that should work on Cygwin is fixing your awk statement:

tail -f file.txt | \
    LC_CTYPE=C awk '{ gsub("[^[:print:]]", ""); if($0 ~ /Result/) print; }'

This has a few bugs that cancel each other out, like tail cutting a UTF-16LE file in awkward places but awk stripping what we hope is garbage.

A robust solution might be:

tail -c +1 -f file.txt | \
    script -qc 'iconv -f UTF-16LE -t UTF-8' /dev/null | grep Result

but it reads the entire file and I don't know how well Cygwin works with using script to convince iconv not to buffer (it would work on GNU/Linux).

Cyril Chaboisseau On 11 October 2022 at 08:13

You can use ripgrep instead which will handle nicely UTF-16 without having to convert your input

tail -f file.txt | rg regexp

**Alexander McFarlane** · Accepted Answer · 2015-06-23T23:28:15+00:00

I realised a simple regex to ignore any characters between letters in the search string might work...

This matches 'Result' whilst allowing any one character between each letter...

$ tail -f file.txt | grep -a 'R.e.s.u.l.t'
Result: Success

$ tail -f file.txt | awk '/R.e.s.u.l.t./'
Result: Success

or as per this answer: to avoid typing all the tedious dots...

search="Result"
tail -f file.txt | grep -a -e "$(echo "$search" | sed 's/./&./g')"

TechQA.

grep and tail -f for a UTF-16 binary file - trying to use simple awk

Awk?

What I have tried

Why they failed

Edit

There are 3 answers

Related Questions in AWK

Related Questions in GREP

Related Questions in CYGWIN

Related Questions in UTF-16

Related Questions in TAIL

Popular Questions

Popular Tags

Trending Questions