Most time-efficient way of reconciling csv/xlsx data converted between schemas?

52 views Asked by At
Old Schema New Schema
Id AccountId
ProductId ProductCode
ProductName

So, colleagues of mine are converting data from an older system to a new one, which involves small changes in the schema. My job is to make sure it's reconciling and that no data has been corrupted or lost, and then show the data in Excel afterwards so it can be manually seen what rows failed to reconcile and why. It's mostly just that column names are changed or columns are excluded, so it's very easy to check one entry against another -- just a couple basic if statements -- but the challenge is doing this quickly.

I need to match entries across the tables (OldSchema and NewSchema) by their IDs, so I need to search the Old Schema for a matching ID (which may not exist -- I'm also checking if data is lost) each time, and then perform the if checks, for tens of thousands of rows. I've tried a few ways of doing this with both OfficeScripts and Pandas, but it's been incredibly slow each time, and I know I must be doing something stupid because I've never worked with large amounts of data before.

I need to go row by row and then compare column by column, and also there may be additional business rules (i.e. normally the values must be identical, but sometimes a value of "Premium" for Column X in the Old Schema should match "Super Premium" for Column X's equivalent in the New Schema). I couldn't figure out how to do checks like this in a Python Pandas DataFrame without using iterrows(), which was even slower than the following and came with its own problems, so I tried just keeping it all within Excel instead. The following code is a highly abridged and altered version of what I did -- converted the normal Excel "List of Lists" format into a list of objects, and then iterated through those objects and compared the keys.

function main(workbook: ExcelScript.Workbook) {
    const start = Date.now();


    const formattedOldTable = getAsListOfObjects(workbook.getTable("OldTable"));
    const formattedNewTable = getAsListOfObjects(workbook.getTable("NewTable"));

    for (const entry of formattedNewTable) {
        const oldEntry = formattedOldTable.find((oldEntry) => oldEntry["Id"] == entry["Id"]);
        const match = checkMatch(entry, oldEntry);
        if (match) {
            // mark this on the table
        }
    }
}

function checkMatch(newEntry: newSchema, oldEntry: OldSchema) {
    let match = true;
    if (!oldEntry) {
        return false;
    }
    if (newEntry["ColumnName"] !== oldEntry["SlightlyAlteredColumnName"]) {
        match = false;
    }
    // ...
    return match;
}

This is so slow with even tables of a few tens of thousands of rows that I believe I'm barking up the wrong tree entirely. I only posted it to show that I actually tried to come up with my own solutions before asking -- my question is, what language/method should I be using? How are basic reconciliations like this supposed to be done? Google hasn't been helpful, only showing me how I can compare if two tables are totally identical or find the difference between them, but I want to mark line by line if an entry has been matched and if not why.

0

There are 0 answers