I noticed that rx* functions (eg. rxKmeans
, rxDataStep
) insert data to SQL Server table in a row-by-row fashion when outFile parameter is set to a table. This is obviously very slow and something like bulk-insert would be desirable instead. Can this be obtained and how to do it?
Currently I am trying to insert about 14 mln rows to a table by invoking rxKmeans
function with outFile
parameter specified and it takes about 20 minutes.
Example of my code:
clustersLogInitialPD <- rxKmeans(formula = ~LogInitialPD
,data = inDataSource
,algorithm = "Lloyd"
,centers = start_c
,maxIterations = 1
,outFile = sqlLogPDClustersDS
,outColName = "ClusterNo"
,overwrite = TRUE
,writeModelVars = TRUE
,extraVarsToWrite = c("LoadsetId", "ExposureId")
,reportProgress = 0
)
sqlLogPDClustersDS
points to a table in my database.
I am working on SQL Server 2016 SP1 with R Services installed and configured (both in-database and standalone). Generally everything works fine except this terrible performance of writing rows to database tables from R scrip.
Any comments will be greatly appreciated.
I brought this up on this Microsoft R MSDN forum thread recently as well.
I ran into this problem and I'm aware of 2 reasonable solutions.
Use SQL Server
bcp.exe
orBULK INSERT
(only if running on the SQL box itself) after first writing a data frame to a flat fileI've written some code that does this but it's not very polished and I've had to leave sections with
<<<VARIABLE>>>
that assume connection string information (server, database, schema, login, password). If you find this useful or any bugs please let me know. I'd also love to see Microsoft incorporate the ability to save data from R back to SQL Server using BCP APIs. Solution (1) above only works viasp_execute_external_script
. Basic testing also leads me to believe that bcp.exe can be roughly twice as fast as option (1) for a million rows. BCP will result in a minimally-logged SQL operation so I'd expect it to be faster.