What is the difference between inserting data using Sql insert statements and SqlBulkCopy?

1k views Asked by At

I have a problem of inserting huge amount of data to SQL server.

Previously I was using Entity framework, but it was damn slow for just 100K root level records ( containing separately two distinct collections, where each one is further operating on 200K records roughly ) = roughly 500K-600K records in memory. Here I applied all optimization ( e.g AutoDetectChangesEnabled = false, and recreated and disposed the context after each batch. )

I rejected the solution, and used BulkInsert that's substantially very fast and much efficient. Was just able to insert 100K records in a minute or so.

But the main problem is getting back primary keys from newly inserted records. For this , I am thinking to write stored procedure which could operate on TVP ( i.e in memory data table holding all root level 100K records ). and there inside I would use OUTPUT INSERTED.Id in order to get all primary keys inside application).

So, how can I compare this approach ( i.e Sql Insert query inside stored procedure ) with SqlBulkCopy approach.

Any idea if somehow, I can get all primary keys back after SqlBulkCopy operation? Or something concrete regarding OUTPUT Inserted.Id would return all correct new keys in application.

PS : I don't want to create any staging table during the process. This is just an overhead.

2

There are 2 answers

0
JohnLBevan On BEST ANSWER

Here's an example based on discussion in the comments / expanding on the idea mentioned here: Possible to get PrimayKey IDs back after a SQL BulkCopy?

i.e.

  • Do a bulk upload from C# to a temp table in SQL
  • Use Sql to copy the data from the temp table to the actual table (at which point the IDs are generated), and return the IDs.

I've not had a chance to test this, but hopefully this will help:

//using System.Data.SqlClient;
//using System.Collections.Generic;

public DataTable CreatePersonDataTable(IEnumerable<PersonDTO> people) 
{
    //define the table
    var table = new DataTable("People");
    table.Columns.Add(new DataColumn("Name", typeof(string)));
    table.Columns.Add(new DataColumn("DOB", typeof(DateTime)));
    //populate it
    foreach (var person in people)
    {
        table.Rows.Add(person.Name, person.DOB);
    }
    return table;
}

readonly string ConnectionString; //set this in the constructor
readonly int BulkUploadPeopleTimeoutSeconds = 600; //default; could override in constructor
public IEnumerable<long> BulkUploadPeople(IEnumerable<PersonDTO> people) //you'd want to break this up a bit; for simplicty I've bunged everything into one big method
{
    var data = CreatePersonDataTable(people);
    using(SqlConnection con = new SqlConnection(ConnectionString)) 
    {
        con.Open(); //keep same connection open throughout session
        RunSqlNonQuery(con, "select top 0 Name, DOB into #People from People");
        BulkUpload(con, data, "#People");
        var results = TransferFromTempToReal(con, "#People", "People", "Name, DOB", "Id");
        RunSqlNonQuery(con, "drop table #People");  //not strictly required since this would be removed when the connection's closed as it's session scoped; but best to keep things clean.
    }
    return results;
}
private void RunSqlNonQuery(SqlConnection con, string sql)
{
    using (SqlCommand command = con.CreateCommand())
    {
        command.CommandText = sql;
        command.ExecuteNonQuery();      
    }
}
private void BulkUpload(SqlConnection con, DataTable data, string targetTable)
{
    using(SqlBulkCopy bulkCopy = new SqlBulkCopy(con))
    {
        bulkCopy.BulkCopyTimeout = 600; //define this in your config 
        bulkCopy.DestinationTableName = targetTable; 
        bulkCopy.WriteToServer(data);         
    }
}
private IEnumerable<long> TransferFromTempToReal(SqlConnection con, string tempTable, string realTable, string columnNames, string idColumnName)
{
    using (SqlCommand command = con.CreateCommand())
    {
        command.CommandText = string.Format("insert into {0} output inserted.{1} select {2} from {3}", realTable, idColumnName, columnNames, tempTable);
        using (SqlDataReader reader = command.ExecuteReader()) 
        {
            while(reader.Read()) 
            {
                yield return r.GetInt64(0);
            }
        }
    }
}

Whilst in your question you've added that you don't want to use a staging table as it's an "overhead"... please try. You may find that the small overhead of creating a staging table is less than the performance gain in using this method.

Obviously it's not going to be as fast as inserting and ignoring the returned ids; but if that's your requirement, in the absence of other answers, this may be the best alternative.

1
Jonathan Magnan On

Any idea if somehow, I can get all primary keys back after SqlBulkCopy operation

You cannot. There is no way to do it directly from the SqlBulkCopy.

PS : I don't want to create any staging table during the process. This is just an overhead.

Unfortunately, if you want to get your primary keys back, you will need to do it or use another approach (TVP like you suggested).

Disclaimer: I'm the owner of Entity Framework Extensions

One alternative solution is using a library which already support BulkInsert for Entity Framework. Under the hood, it's use SqlBulkCopy + Staging Tables.

By default, the BulkInsert method already output the primary key values.

The library is not free, however, it adds some flexibility to your company and you will not have to code/support anything.

Example:

// Easy to use
context.BulkSaveChanges();

// Easy to customize
context.BulkSaveChanges(bulk => bulk.BatchSize = 100);

// Perform Bulk Operations
context.BulkDelete(customers);
context.BulkInsert(customers);
context.BulkUpdate(customers);

// Customize Primary Key
context.BulkMerge(customers, operation => {
   operation.ColumnPrimaryKeyExpression = 
        customer => customer.Code;
});