I have a bunch (100+) CSV files. Each of them can have blank rows, or rows I don't need (Some fuzz info like "Congrats, you all bla bla"). When reading in Pandas I need to specify which row is the header row. It's a lot of work to do with multiple files. Keep in mind that all files are of different format.
Currently I iterate over all rows and just check if all cells in a row are strings and select that one as a header.
I need a better function to compress list of strings into a single confidence score (so that I can see what row is header with the highest probability)
def detect_header(df, num_header_rows=2, threshold=0):
potential_header_rows = []
# Iterate through each row and compare it with the previous row
for i in range(1, len(df)):
row1 = df.iloc[i].apply(preprocess_string)
row2 = df.iloc[i - 1].apply(preprocess_string)
if list(row1).count('') > 0.5 * len(row1):
# skip Nones
continue
# method that needs better implementation
similarity = string_similarity(row1, row2)
print(similarity)
if similarity >= threshold:
potential_header_rows.append(i)
if len(potential_header_rows) == num_header_rows:
break
return potential_header_rows
Could you please help me out?
some time ago I was dealing with the same requirement and my approach was the below.
This function receives the dataframe already loaded and can create the headers if the first row is not numeric (if not, we left the table as it is):
To execute the function above you may need another function to check if a string represents a number, something like the below:
I hope it can help you to start playing...