BULK generate data SQL Server

1k views Asked by At

I need to create fake data into fakeData table that follows this psuedocode:

foreach(t1.id in table1)
   foreach(t2.id in table2)
      foreach(t3.id in table3)
        INSERT INTO fakeData (t1.id, t2.id, t3.id, random(30,80))

Where the id is the primary key of that table.

I need to do this as fast as possible, because I plan to insert billions of records. I am not sure if using SQL to do the statements is the best method, or using c# or what the best option is for getting this data into the table.

This question really has two parts, how do I execute the psuedocode in SQL Server, and what is the best way to do this really fast. ( I currently have no indices setup )

This may seem like a duplicate of all the other "Fastest way to bulk insert". I think this question is different because the data I am loading can actually be generated my SQL Server, so a BULK generate compared to BULK INSERT

PS: I got SQL Server 2012

Edit: More data

This is a star schema. fakeData will be the fact table.

table2 is a date dimension of 20 years, with 7300 records. table3 is a time dimension of 96 records. table1 is another dimension with 100 million records.

3

There are 3 answers

1
LearningJrDev On

Ok well... Since none really showed how to do random values as well. Ill contribute my solution so far. I am doing this right now, along with recovery model simple :

BEGIN TRAN

declare @x int = 1
while @x <= 5000
begin
INSERT INTO dimSpeed
Select T1.id as T1ID, T2.DateValue as T2ID, T3.TIME_ID as T3ID, ABS(Checksum(NewID()) % 70) + 20
From lines T1, dimDate T2, dimTime T3
WHERE T1.id = @x AND T2.DateValue > '1/1/2015' AND T2.DateValue < '1/1/2016'

    if (@x % 100) = 0
    begin
        COMMIT TRAN
        BEGIN TRAN
    end

    set @x = @x + 1
end

COMMIT TRAN

Where 5000 is how many elements of TABLE1 (t1) I am inserting. Doing just 5000 takes 5 minutes or so. At this rate it will take 70 days to insert all the data I need. A speedier option is needed for sure

6
Greg On

You can insert data really fast by using BCP. If your PK column is an identity column, it will still work. If not, then you can skip the BCP part and just insert with while loops.

To create your BCP file:

  1. In a for loop, populate the table quickly with a fixed # of rows (say 1 million).
  2. BCP out the data into a BCP file.
  3. BCP the data back in as many times as you want

Do this for each table necessary (pseudo code below):

Example: -- if this is dev machine, then I would change recovery model to simple, -- if it isn't already, to reduce amount of logging. then change it back -- to what it was previously

BEGIN TRAN

declare @x int = 1
while @x <= 1000000
begin
    insert into t1 (col1, col2) values (rand() * 100, rand(123) * 100)

    if (@x % 10000) = 0
    begin
        COMMIT TRAN
        BEGIN TRAN
    end

    set @x = @x + 1
end

COMMIT TRAN

Then, bcp out the data (from a command prompt) to create your bcp file

bcp out [dbname].dbo.t1 c:\file1.bcp -T

Now, bcp in (from a command prompt) the data as many times as needed

bcp in [dbname].dbo.t1 c:\file1.bcp -T
bcp in [dbname].dbo.t1 c:\file1.bcp -T
bcp in [dbname].dbo.t1 c:\file1.bcp -T
bcp in [dbname].dbo.t1 c:\file1.bcp -T
bcp in [dbname].dbo.t1 c:\file1.bcp -T
2
Randall On

The first part is easy enough;

Insert Into FakeData
Select T1.id as T1ID, T2.id as T2ID, T3.id as T3ID
From Table1 T1, Table2 T2, Table3 T3

That will perform a Cartesian join of the three tables and return all possible combinations. However if you are talking billions of possible combinations, batching is probably a good idea. I'll use offset fetch since it's easy to understand, but you could probably find a more performant way, I'm also going to store the results in a temp table, so you only run the join once and rest are selects and inserts.

Declare @BatchSize int = 100000
Declare @RowsToBeInserted bigint = 
((select count(1) as t1count from Table1) * 
(select count(1) as t2count from Table2) * 
(select count(1) as t3count from Table3))
Declare @Iterations int = 0
Declare @CompletedRowCount bigint = 0

Select T1.id as T1ID, T2.id as T2ID, T3.id as T3ID
Into #TempTable
From Table1 T1, Table2 T2, Table3 T3

While @CompletedRowCount < @RowsToBeInserted
begin

Insert into FakeData
select T1ID, T2ID, T3ID
Order by T1ID, T2ID, T3ID
Offset @CompletedRowCount
Fetch next @BatchSize Only
set @Iterations = @Iterations + 1
set @CompletedRowCount = @Iterations * @BatchSize

end

That should allow you to dial in the batch size to whatever works best for your server. With no indexes set up, you don't need to drop them and recreate them for performance. Hope that points you in the correct direction.