Messy CSV auto header extractor

48 views Asked by At

I have a bunch (100+) CSV files. Each of them can have blank rows, or rows I don't need (Some fuzz info like "Congrats, you all bla bla"). When reading in Pandas I need to specify which row is the header row. It's a lot of work to do with multiple files. Keep in mind that all files are of different format.

Currently I iterate over all rows and just check if all cells in a row are strings and select that one as a header.

I need a better function to compress list of strings into a single confidence score (so that I can see what row is header with the highest probability)

def detect_header(df, num_header_rows=2, threshold=0):
    potential_header_rows = []
    
    # Iterate through each row and compare it with the previous row
    for i in range(1, len(df)):
        row1 = df.iloc[i].apply(preprocess_string)
        row2 = df.iloc[i - 1].apply(preprocess_string)
        if list(row1).count('') > 0.5 * len(row1):
            # skip Nones
            continue
        # method that needs better implementation
        similarity = string_similarity(row1, row2)
        print(similarity)
        if similarity >= threshold:
            potential_header_rows.append(i)
            if len(potential_header_rows) == num_header_rows:
                break 
    return potential_header_rows

Could you please help me out?

1

There are 1 answers

1
Wilson Salgado On

some time ago I was dealing with the same requirement and my approach was the below.

This function receives the dataframe already loaded and can create the headers if the first row is not numeric (if not, we left the table as it is):

def reset_headers(df):
    indexes = []
    indexes_final = []
    for index, row in df.iterrows():
        #Check if all the cells in these first rows are not numeric:
        if row.apply(lambda x: False if represents_number(x) else True).all():
            #So, this one should be considered as part of the header:
            indexes.append(df.iloc[index])
        else:
            break
    #The data of the dataframe discarding the rows that are considered header:
    df = df.iloc[index:, :].copy()
    # Concatenate the information in the header if there are more than one row considered header:
    if len(indexes) > 0:
        for row in zip(*indexes):
            temp = ""
            for i in row:
                temp += i + " "
            indexes_final.append(temp.strip())
        df.columns = indexes_final
    # At the end, rename properly if there are some duplicate column names:
    duplicated_cols = df.columns.duplicated()
    duplicated = df.columns[duplicated_cols].unique()
    rename_cols = []
    i=1
    for col in df.columns:
        if col in duplicated:
            rename_cols.extend([col + '_' + str(i)])
            i=i+1
        else:
            rename_cols.extend([col])
    df.columns = rename_cols

    return df

To execute the function above you may need another function to check if a string represents a number, something like the below:

def represents_number(s):
    try:
        float(s)
    except ValueError:
        return False
    except TypeError:
        return False
    else:
        return True

I hope it can help you to start playing...