write csv with encoding cp1252 with python 3.12

220 views Asked by At

I import data from txt-files that are encoded in utf-8 and from xml files. Data from both formates are transformed and put separately in a pandas dataframe (one for the imported txt-file, one for the imported xml-file). Each dataframe is exported as txt-file. The problem is that these txt-files must be encoded cp1252. But they get coded in utf-8.

reading of xml files is done with

    import xml.etree.ElementTree as ET

    xml_file_path = path_files + xml_filename

    tree = ET.parse(xml_file_path)
    root = tree.getroot()

reading of txt-files is done with

import csv
            
with open(txt_filename, 'r', encoding=('cp1252')) as f:
    reader = csv.reader(f, delimiter='\t')       
    list_results = list(tuple(line) for line in reader)

non-cp1252 codable data are replaced with "?" using this code.

            def filter_cp1252(text):
                try:

                    encoded_text = text.encode('cp1252')
                    return text
                except UnicodeEncodeError:

                  
                    text2= ''
                    for nl, letter in enumerate(text):

                        try:
                            encoded_letter = letter.encode('cp1252')
                            text2 = text2 + letter

                        except:
                            text2 = text2 + '?'

                    return text2

Signs that are not cp 1252 are successfully replaced with "?". I could test this only when importing files as utf-8, but I let it in the code to stay on the safe side.

Data are collected in a list that is transformed to a pandas dataframe. After this export to txt is done with


df_import2.to_csv(output, header=None, index=None, sep='\t', mode ='w', encoding = 'cp1252')

Nevertheless the output is utf-8 (and that drives me crazy).
Is there any suggestion, how I can solve my problem?

I have also exported the dataframe first to Excel and reimported with pandas hoping that this could solve the problem. But python obviously "prefers" utf-8 where ever possible.

1

There are 1 answers

4
Panagiotis Kanavos On

The question is unclear and quite a lot of the code isn't useful. open(...,encoding='cp1252') would throw an error if the text was in the wrong codepage. To load characters that can't be mapped you'd have to use the errors parameter in open. If you use errors='replace' the characters will be replaced by ?.

That means the non-Latin text comes from XML. Instead of trying to convert the text in code, you can use the errors argument in to_csv to replace non-Latin characters with ? too.

Finally, you can use read_csv with encoding='cp1252' encoding_errors='replace' to read the text data directly into a dataframe.

var txt_frames=pd.read_csv(txt_filename,encoding='cp1252', encoding_errors='replace',sep='\t')

Once all frames are combined, you can save them as CP1252 with encoding_errors='replace':

df_import2=pd.concat(all_frames)
df_import2.to_csv(output, encoding = 'cp1252', encoding_errors='replace', header=None, index=None, sep='\t')