Inserting millions of records into MySQL database using Python

14.9k views Asked by At

I have a txt file with about 100 million records (numbers). I am reading this file in Python and inserting it to MySQL database using simple insert statement from python. But its taking very long and looks like the script wouldn't ever finish. What would be the optimal way to carry out this process ? The script is using less than 1% of memory and 10 to 15% of CPU.

Any suggestions to handle such large data and insert it efficiently into database, would be greatly appreciated.

Thanks.

3

There are 3 answers

1
emican On BEST ANSWER

Sticking with python, you may want to try creating a list of tuples from your input and using the execute many statement from the python mysql connector.

If the file is too big, you can use a generator to chunk it into something more digestible.

http://dev.mysql.com/doc/connector-python/en/connector-python-api-mysqlcursor-executemany.html

1
spencer7593 On

The fastest way to insert rows into a table is with LOAD DATA INFILE statement.

Reference: https://dev.mysql.com/doc/refman/5.6/en/load-data.html

Doing individual INSERT statements to insert one row at a time, RBAR (row by agonizing row) is tediously slow, because of all the work that the database has to go through to execute a statement... parsing for syntax, for semantics, preparing execution plan, obtaining and releasing locks, writing to the binary log, ...

If you have to do INSERT statements, then you could make use of MySQL multi-row insert, that would be faster.

  INSERT INTO mytable (fee, fi, fo, fum) VALUES 
   (1,2,3,'shoe')
  ,(4,5,6,'sock')
  ,(7,8,9,'boot') 

If you insert four rows at a time, that's a 75% reduction in the number of statements that need to be executed.

0
Coffee and Code On

Having tried to do this recently, I found a fast method, but this may be because I'm using an AWS Windows server to run python from that has a fast connection to the database. However, instead of 1 million rows in one file, it was multiple files that added up to 1 million rows. It's faster than other direct DB methods I tested anyway.

With this approach, I was able to read files sequentially and then run the MySQL Infile command. I then used threading with this process too. Timing the process it took 20 seconds to import 1 million rows into MySQL.

Disclaimer: I'm new to Python, so I was trying to see how far I could push this process, but it caused my DEV AWS-RDS DB to become unresponsive (I had to restart it), so taking an approach that doesn't overwhelm the process is probably best!