File type from pandas.DataFrame.to_excel is "Zip archive data, at least v2.0 to extract"

1.3k views Asked by At

I notice that the file type from an Excel file generated by pandas.DataFrame.to_excel is Zip archive data, at least v2.0 to extract. Please do note that the content type is fine: content_type, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet.

In my Django project, I am essentially validating a file type before processing the uploaded file, and although the file generated by pandas.DataFrame.to_excel is a valid Excel file, the validation module is rejecting the uploaded file because of the file type being Zip archive data, at least v2.0 to extract, instead of Microsoft Excel 2007+.

Please let me know how I can get around this validation.

The code I used to replicate (i.e., to create an Excel file with the file type Zip archive data, at least v2.0 to extract) this issue is here.

import pandas as pd
import os
import magic

uploaded_file_path = r'somepath'
path, filename = os.path.split(uploaded_file_path)
filename_without_extension = os.path.splitext(filename)
new_file_name = os.path.join(path, filename_without_extension[0]) + '_TESTING_BLAH_' + str(1) + '.xlsx'


df1 = pd.DataFrame([['a', 'b'], ['c', 'd']],
                   index=['row 1', 'row 2'],
                   columns=['col 1', 'col 2'])

df1.to_excel(new_file_name)

file_type = magic.from_file(new_file_name)
print(file_type)

1

There are 1 answers

3
Chris On

As suspected the behaviour seems to have something to do with the way the Excel files are created. The xlsx files created by open source libraries have a dffierent magic number then the xlsx files created by MS Excel. A similar issue can be found here. The default dB libmagic uses obviously does not recognize those files as Excel files.

The post also desribes a possible solution. You can add custom definitions to the file /etc/magic. And there is a file you can copy and paste which seems to work.

So copy the contents of this msooxml file to the the file /etc/magic on your computer. After doing that the files were identified as Excel 2007 on my machine.