Optimizing Bulk Insert Performance for Large Databases: Seeking Suggestions for Efficient Handling of Duplicate Entries

54 views Asked by shoaib siddiquie At 20 March 2024 at 10:17

Let's consider the current state of the student table:

ID	First Name	Last Name	Marks
1	John	Doe	85
2	Alice	Smith	90
3	Bob	Johnson	78
4	Emily	Brown	95
5	Michael	Davis	88

Suppose I receive a list of records like this:

First Name	Last Name	Marks
Sarah	Lee	92
Alice	Smith	85
Emma	Harris	80
Michael	Davis	90
Kevin	White	85

I want to insert but not update the records into the database table. If any row fails due to duplicates, it has to be mentioned in a separate "Status" column. So, the response is:

First Name	Last Name	Marks	Status
Sarah	Lee	92	Success (New Record Inserted)
Alice	Smith	85	Error: Duplicate entry for Alice Smith
Emma	Harris	80	Success (New Record Inserted)
Michael	Davis	90	Error: Duplicate entry for Michael Davis
Kevin	White	85	Success (New Record Inserted)

I want to perform these operations in bulk, but if an error occurs, I won't know where the error is within that bulk. Therefore, if the bulk operation fails, I will break it down into smaller chunks and try to insert. If there is no error, that means all are successful, and I will just update that in the response list. My code for this is as follows:

func createStudentsAndUpdateStatus(ctx context.Context, db *sql.DB, students []Student, outputCsvData *[][]string, start int, end int) int {
    if end < start {
        return 0
    }
    if end-start < 12 { 
        updated := 0
        for i := start; i <= end; i++ {
            err := insertStudent(ctx, db, students[i])
            if err == nil {
                (*outputCsvData)[i] = append((*outputCsvData)[i], "Success (New Record Inserted)")
                updated++
            } else {
                (*outputCsvData)[i] = append((*outputCsvData)[i], fmt.Sprintf("Error: %s", err.Error()))
            }
        }
        return updated
    }

    err := insertStudentsBatch(ctx, db, students[start:end+1])
    if err == nil {
        for i := start; i <= end; i++ {
            (*outputCsvData)[i] = append((*outputCsvData)[i], "Success (New Record Inserted)")
        }
        return end - start + 1
    }

    m := start + (end-start)/2
    updatedLeft := createStudentsAndUpdateStatus(ctx, db, students, outputCsvData, start, m)
    updatedRight := createStudentsAndUpdateStatus(ctx, db, students, outputCsvData, m+1, end)
    return updatedLeft + updatedRight
}

In this code, if the size is less than 12, I am just running it in sequence. Otherwise, I am breaking the chunks into 2 parts if the bulk insert fails.

In real-life scenarios, the database has approximately 10^8 entries, and I typically receive a list of records of size 10^5.

I've experimented with different approaches to improve performance:

Inserting records one by one took approximately 60 minutes.
Attempting bulk updates in batches of 1000 records and then iterating over failed records to update them individually reduced the time to 30 minutes.
Currently, I've managed to reduce the insertion time to 15 minutes using the approach described above.

Despite the improvement, I'm still looking for ways to optimize the insertion process further. Are there any alternative methods or techniques that could potentially reduce the insertion time even more?

I appreciate any suggestions or insights you can provide. Thank you!

Original Q&A

TechQA.

Optimizing Bulk Insert Performance for Large Databases: Seeking Suggestions for Efficient Handling of Duplicate Entries

There are 0 answers

Related Questions in SQL

Related Questions in POSTGRESQL

Related Questions in GO

Related Questions in RECURSION

Related Questions in GO-GORM

Popular Questions

Trending Questions