Formatting Errors with tail

227 views Asked by At

How can I correctly parse this file through tail, without formatting errors?

I am using tail within cygwin to parse the last ten lines of two files. One file parses through correctly, the other contains a space between every character.

$ tail file2.txt -n 4
22/06/2015 12:28 - Decompressing and saving profile extract...
22/06/2015 12:28 - Decompressing and saving profile extract...
22/06/2015 12:38 - Decompressing and saving profile extract...
22/06/2015 12:38 - Decompressing and saving profile extract...

$ tail file1.txt -n 4
P a c k a g e   s t a r t .
E l a p s e d   t i m e :   5 0 . 1 7 5 7 5 4 8   s e c s .
. . . P a c k a g e   E x e c u t e d .

R e s u l t :   S u c c e s s

When I read the raw contents of the file in python I get the folllowing, whjich I think is a load of unicode characters

In [1]: open('file1.text', 'r').read()
Out[1]: '\xff\xfeP\x00a\x00c\x00k\x00a\x00g\x00e\x00 \x00s\x00t\x00a\x00r\x00t\x00.\x00\r\x00\n\x00E\x00l\x00a\x00p\x00s\x00e\x00d\x00 \x00t\x00i\x00m\x00e\x00:\x00 \x005\x000\x00.\x001\x007\x005\x007\x005\x004\x008\x00 \x00s\x00e\x00c\x00s\x00.\x00\r\x00\n\x00.\x00.\x00.\x00P\x00a\x00c\x00k\x00a\x00g\x00e\x00 \x00E\x00x\x00e\x00c\x00u\x00t\x00e\x00d\x00.\x00\r\x00\n\x00\r\x00\n\x00R\x00e\x00s\x00u\x00l\x00t\x00:\x00 \x00S\x00u\x00c\x00c\x00e\x00s\x00s\x00\r\x00\n\x00\r\x00\n\x00'
In [2]: print open('temp.txt', 'r').read()
■P a c k a g e   s t a r t .
E l a p s e d   t i m e :   5 0 . 1 7 5 7 5 4 8   s e c s .
. . . P a c k a g e   E x e c u t e d .

R e s u l t :   S u c c e s s

When I copy the entire content of file1.txt into a new file test.txt - the issue does not reoccur.

$ tail test.txt
Package start.
Elapsed time: 50.1757548 secs.
...Package Executed.

Result: Success

The file seems to have the characters \x00 between every character and \xff at the start.

1

There are 1 answers

3
Keith Thompson On BEST ANSWER

The file is in UTF-16 format, which uses 2 8-bit bytes to represent most characters (and 4 8-bit bytes for some characters). Each of the 128 ASCII characters is represented as 2 bytes, a zero byte and a byte containing the actual character value. The \xff\xfe sequence at the start is a Byte Order Mark (BOM); it indicates whether the remaining characters are represented with the high-order or low-order byte first.

UTF-16 is one of several ways to represent Unicode text. It's most commonly used in Microsoft Windows.

I'm not sure why the null characters appear as spaces. That may be due to the way your terminal emulator behaves.

Use the iconv command to convert the file from UTF-16 to some other format.