Writing to UTF-16-LE text file with BOM

2.1k views Asked by At

I've read a few postings regarding Python writing to text files but I could not find a solution to my problem. Here it is in a nutshell.

The requirement: to write values delimited by thorn characters (u00FE; and surronding the text values) and the pilcrow character (u00B6; after each column) to a UTF-16LE text file with BOM (FF FE).

The issue: The written-to text file has whitespace between each column that I did not script for. Also, it's not showing up right in UltraEdit. Only the first value ("mom") shows. I welcome any insight or advice.

The script (simplified to ease troubleshooting; the actual script uses a third-party API to obtain the list of values):

import os
import codecs
import shutil
import sys
import codecs

first = u''
textdel = u'\u00FE'.encode('utf_16_le')   #thorn
fielddel = u'\u00B6'.encode('utf_16_le')  #pilcrow
list1 = ['mom', 'dad', 'son']
num = len(list1)  #pretend this is from the metadata profile

f = codecs.open('c:/myFile.txt', 'w', 'utf_16_le')
f.write(u'\uFEFF')
for item in list1:
  mytext2 = u''
  i = 0
  i = i + 1
  mytext2 = mytext2 + item + textdel
  if i < (num - 1):
    mytext2 = mytext2 + fielddel
  f.write(mytext2 + u'\n')

f.close()
1

There are 1 answers

2
Derek T. Jones On BEST ANSWER

You're double-encoding your strings. You've already opened your file as UTF-16-LE, so leave your textdel and fielddel strings unencoded. They will get encoded at write time along with every line written to the file.

Or put another way, textdel = u'\u00FE' sets textdel to the "thorn" character, while textdel = u'\u00FE'.encode('utf-16-le') sets textdel to a particular serialized form of that character, a sequence of bytes according to that codec; it is no longer a sequence of characters:

textdel = u'\u00FE'
len(textdel)                      # -> 1
type(textdel)                     # -> unicode
len(textdel.encode('utf-16-le'))  # -> 2
type(textdel.encode('utf-16-le')) # -> str