Optimizing Bulk Insert Performance for Large Databases: Seeking Suggestions for Efficient Handling of Duplicate Entries

54 views Asked by At

Let's consider the current state of the student table:

ID First Name Last Name Marks
1 John Doe 85
2 Alice Smith 90
3 Bob Johnson 78
4 Emily Brown 95
5 Michael Davis 88

Suppose I receive a list of records like this:

First Name Last Name Marks
Sarah Lee 92
Alice Smith 85
Emma Harris 80
Michael Davis 90
Kevin White 85

I want to insert but not update the records into the database table. If any row fails due to duplicates, it has to be mentioned in a separate "Status" column. So, the response is:

First Name Last Name Marks Status
Sarah Lee 92 Success (New Record Inserted)
Alice Smith 85 Error: Duplicate entry for Alice Smith
Emma Harris 80 Success (New Record Inserted)
Michael Davis 90 Error: Duplicate entry for Michael Davis
Kevin White 85 Success (New Record Inserted)

I want to perform these operations in bulk, but if an error occurs, I won't know where the error is within that bulk. Therefore, if the bulk operation fails, I will break it down into smaller chunks and try to insert. If there is no error, that means all are successful, and I will just update that in the response list. My code for this is as follows:

func createStudentsAndUpdateStatus(ctx context.Context, db *sql.DB, students []Student, outputCsvData *[][]string, start int, end int) int {
    if end < start {
        return 0
    }
    if end-start < 12 { 
        updated := 0
        for i := start; i <= end; i++ {
            err := insertStudent(ctx, db, students[i])
            if err == nil {
                (*outputCsvData)[i] = append((*outputCsvData)[i], "Success (New Record Inserted)")
                updated++
            } else {
                (*outputCsvData)[i] = append((*outputCsvData)[i], fmt.Sprintf("Error: %s", err.Error()))
            }
        }
        return updated
    }

    err := insertStudentsBatch(ctx, db, students[start:end+1])
    if err == nil {
        for i := start; i <= end; i++ {
            (*outputCsvData)[i] = append((*outputCsvData)[i], "Success (New Record Inserted)")
        }
        return end - start + 1
    }

    m := start + (end-start)/2
    updatedLeft := createStudentsAndUpdateStatus(ctx, db, students, outputCsvData, start, m)
    updatedRight := createStudentsAndUpdateStatus(ctx, db, students, outputCsvData, m+1, end)
    return updatedLeft + updatedRight
}

In this code, if the size is less than 12, I am just running it in sequence. Otherwise, I am breaking the chunks into 2 parts if the bulk insert fails.

In real-life scenarios, the database has approximately 10^8 entries, and I typically receive a list of records of size 10^5.

I've experimented with different approaches to improve performance:

  1. Inserting records one by one took approximately 60 minutes.
  2. Attempting bulk updates in batches of 1000 records and then iterating over failed records to update them individually reduced the time to 30 minutes.
  3. Currently, I've managed to reduce the insertion time to 15 minutes using the approach described above.

Despite the improvement, I'm still looking for ways to optimize the insertion process further. Are there any alternative methods or techniques that could potentially reduce the insertion time even more?

I appreciate any suggestions or insights you can provide. Thank you!

0

There are 0 answers