PostgreSQL connection closes unexpectedly when doing a large insert

5.9k views Asked by At

I'm populating a PostgreSQL table with ~11.000.000 rows that have been selected before from another database. I'm using Python and psycopg2. The whole process takes an estimated 1.5 hours to complete. However, after ~30 minutes I get "connection closed unexpectedly" exception. The source code looks like this:

incursor = indb.cursor()
incursor.execute("SELECT ...")
indb.commit() # (1) close transaction
outcursor = outdb.cursor()
rows = 0
for (col1, col2, col3) in incursor: # incursor contains ~11.000.000 rows
    outcursor.execute("INSERT ...", (col1, col2, col3)) # This fails after ~30 minutes
    row += 1
    if row % 100 == 0: # (2) Write data every 100 rows
         outcursor.close()
         outdb.commit()
         outcursor = outdb.cursor()
incursor.close()
outcursor.close()
outdb.commit()

I inserted (1) and (2) after the first tries that failed, assuming that an open transaction has an upper time limit of ~30 minutes or that a cursor has an upper limit of pending inserts. It seems that none of this assumptions are true and the error lies somewhere else.

Both databases are stored on a VirtualBox machine that I connect via port forwarding from the host. I run the program on the host machine.

Both database are just for testing purposes and they have no other connections to manage. Maybe I have to rewrite the problem to get around this, but I need very time-consuming inserts elsewhere (running approx. for days) so I'm very concerned about some hidden time limits in psycopg2 or PostgreSQL.

3

There are 3 answers

0
araqnid On

I don't know of any such "hidden" timeout in postgresql itself. PostgreSQL does have statement_timeout, but if you hit that you should get a ERROR: canceling statement due to statement timeout in the server log (and it will log the cancelled statement too). I can't speak for psycopg2. Definitely check the server log for anything that looks relevant.

Maybe it's a networking issue? A long-running statement will be a TCP connection that stays idle for a long time. Perhaps your port forwarding purges connections that are idle for more than 30 minutes? Maybe your TCP connections aren't using keepalive. Postgresql has some settings for tuning TCP keepalive (tcp_keepalives_interval etc.) and you may also need to do some kernel/networking configuration to make sure they are actually enabled.

e.g. I just tried connecting to my own machine here and tcp_keepalives_interval defaults to 7200, which is 2 hours. If your port forwarding cuts off after 30 minutes, this default won't do. You can override the setting used in the client connection string (assuming you can twiddle the conninfo string directly), or set the GUC variable in user/database properties or postgresql.conf.

See:

0
dfichter On

To insert millons of rows, I'd look through the official guide to populating a db and consider using copy.

0
Georgie Porgie On

I have a django admin commands which updates thousands of thousands of rows. After some time, I see the same error. I believe the memory usage exceeds the limit. Don't know how to manually control the transaction in commands, though.