How can I correctly parse this file through tail
, without formatting errors?
I am using tail
within cygwin
to parse the last ten lines of two files. One file parses through correctly, the other contains a space between every character.
$ tail file2.txt -n 4
22/06/2015 12:28 - Decompressing and saving profile extract...
22/06/2015 12:28 - Decompressing and saving profile extract...
22/06/2015 12:38 - Decompressing and saving profile extract...
22/06/2015 12:38 - Decompressing and saving profile extract...
$ tail file1.txt -n 4
P a c k a g e s t a r t .
E l a p s e d t i m e : 5 0 . 1 7 5 7 5 4 8 s e c s .
. . . P a c k a g e E x e c u t e d .
R e s u l t : S u c c e s s
When I read the raw contents of the file in python I get the folllowing, whjich I think is a load of unicode characters
In [1]: open('file1.text', 'r').read()
Out[1]: '\xff\xfeP\x00a\x00c\x00k\x00a\x00g\x00e\x00 \x00s\x00t\x00a\x00r\x00t\x00.\x00\r\x00\n\x00E\x00l\x00a\x00p\x00s\x00e\x00d\x00 \x00t\x00i\x00m\x00e\x00:\x00 \x005\x000\x00.\x001\x007\x005\x007\x005\x004\x008\x00 \x00s\x00e\x00c\x00s\x00.\x00\r\x00\n\x00.\x00.\x00.\x00P\x00a\x00c\x00k\x00a\x00g\x00e\x00 \x00E\x00x\x00e\x00c\x00u\x00t\x00e\x00d\x00.\x00\r\x00\n\x00\r\x00\n\x00R\x00e\x00s\x00u\x00l\x00t\x00:\x00 \x00S\x00u\x00c\x00c\x00e\x00s\x00s\x00\r\x00\n\x00\r\x00\n\x00'
In [2]: print open('temp.txt', 'r').read()
■P a c k a g e s t a r t .
E l a p s e d t i m e : 5 0 . 1 7 5 7 5 4 8 s e c s .
. . . P a c k a g e E x e c u t e d .
R e s u l t : S u c c e s s
When I copy the entire content of file1.txt
into a new file test.txt
- the issue does not reoccur.
$ tail test.txt
Package start.
Elapsed time: 50.1757548 secs.
...Package Executed.
Result: Success
The file seems to have the characters \x00
between every character and \xff
at the start.
The file is in UTF-16 format, which uses 2 8-bit bytes to represent most characters (and 4 8-bit bytes for some characters). Each of the 128 ASCII characters is represented as 2 bytes, a zero byte and a byte containing the actual character value. The
\xff\xfe
sequence at the start is a Byte Order Mark (BOM); it indicates whether the remaining characters are represented with the high-order or low-order byte first.UTF-16 is one of several ways to represent Unicode text. It's most commonly used in Microsoft Windows.
I'm not sure why the null characters appear as spaces. That may be due to the way your terminal emulator behaves.
Use the
iconv
command to convert the file from UTF-16 to some other format.