I import data from txt-files that are encoded in utf-8 and from xml files. Data from both formates are transformed and put separately in a pandas dataframe (one for the imported txt-file, one for the imported xml-file). Each dataframe is exported as txt-file. The problem is that these txt-files must be encoded cp1252. But they get coded in utf-8.
reading of xml files is done with
import xml.etree.ElementTree as ET
xml_file_path = path_files + xml_filename
tree = ET.parse(xml_file_path)
root = tree.getroot()
reading of txt-files is done with
import csv
with open(txt_filename, 'r', encoding=('cp1252')) as f:
reader = csv.reader(f, delimiter='\t')
list_results = list(tuple(line) for line in reader)
non-cp1252 codable data are replaced with "?" using this code.
def filter_cp1252(text):
try:
encoded_text = text.encode('cp1252')
return text
except UnicodeEncodeError:
text2= ''
for nl, letter in enumerate(text):
try:
encoded_letter = letter.encode('cp1252')
text2 = text2 + letter
except:
text2 = text2 + '?'
return text2
Signs that are not cp 1252 are successfully replaced with "?". I could test this only when importing files as utf-8, but I let it in the code to stay on the safe side.
Data are collected in a list that is transformed to a pandas dataframe. After this export to txt is done with
df_import2.to_csv(output, header=None, index=None, sep='\t', mode ='w', encoding = 'cp1252')
Nevertheless the output is utf-8 (and that drives me crazy).
Is there any suggestion, how I can solve my problem?
I have also exported the dataframe first to Excel and reimported with pandas hoping that this could solve the problem. But python obviously "prefers" utf-8 where ever possible.
The question is unclear and quite a lot of the code isn't useful.
open(...,encoding='cp1252')
would throw an error if the text was in the wrong codepage. To load characters that can't be mapped you'd have to use the errors parameter in open. If you useerrors='replace'
the characters will be replaced by?
.That means the non-Latin text comes from XML. Instead of trying to convert the text in code, you can use the errors argument in
to_csv
to replace non-Latin characters with?
too.Finally, you can use
read_csv
withencoding='cp1252' encoding_errors='replace'
to read the text data directly into a dataframe.Once all frames are combined, you can save them as CP1252 with
encoding_errors='replace'
: