Effective iterate over big tables

170 views Asked by At

Currently i operate on postgresql 9.6 (debian stretch). My table has about 24 mio entries per day. In total it contains about 7 days of data. This is a temporary data storage because it cant be processed in real time. After i have processed this data it gets deleted. Just keep the results.

My SQL is

// build sql string

m_sql = "select dt, adc0, adc1, adc2, adc3, id from " + to_string( table ) +
        " WHERE date(dt) = '" +
        boost::gregorian::to_iso_extended_string( day ) + "'" +
        " order by dt";

The table looks like

                                    Tabelle »public.data«                                                                                                                                                                                   
 Spalte |             Typ             |                       Attribute                                                                                                                                                                     
--------+-----------------------------+-------------------------------------------------------                                                                                                                                              
 dt     | timestamp without time zone | not null                                                                                                                                                                                            
 adc0   | smallint                    |                                                                                                                                                                                                     
 adc1   | smallint                    |                                                                                                                                                                                                     
 adc2   | smallint                    |                                                                                                                                                                                                     
 adc3   | smallint                    |                                                                                                                                                                                                     
 id     | integer                     | not null Vorgabewert nextval('data_id_seq'::regclass)                                                                                                                                               
Indexe:                                                                                                                                                                                                                                     
    "data_pkey" PRIMARY KEY, btree (id)                                                                                                                                                                                                     
    "date_idx" btree (date(dt))                                                                                                                                                                                                             
    "dt_idx" btree (dt)                                                                                                                                                                                                                     
    "id_idx" btree (id)    

My current attempt is to go over them in batches of 1 mio per step and use OFFSET/LIMIT and order by time. Right now each iteration searches the whole data space again and sort them. until the offset is reached. When i fetch the data all at once my process segfaults as it kets killed by OOM.

This is inside a transaction.

// get the values
auto add_limit_offset = []( std::string sql,
                            size_t &offset ) -> std::string {
    static const size_t delta = 1000000;
    auto ret = sql + " LIMIT " + std::to_string( delta ) + " OFFSET " +
               std::to_string( offset ) + ";";
    offset += delta;
    return ret;

};

bool run = true;
size_t offset = 0;
int nr_of_sinks = static_cast<int>( m_next_sinks.size() );
while ( run )
{
    std::cout << "adc_source: at " << offset << std::endl;
    auto sql = add_limit_offset( m_sql, offset );
    auto dbvalues = T.exec( sql.c_str() );
    run = dbvalues.size() > 0;

    // fill dbvalues into stl container
    for ( pqxx::result::const_iterator d = dbvalues.begin();
          d != dbvalues.end(); ++d )
    {
        // timepoint
        std::string s_dt = ( *d )[0].c_str();
        s_dt.erase( s_dt.begin() + 19, s_dt.end() );
        auto dt = boost::posix_time::time_from_string( s_dt );

        for ( int adc = 0; adc < nr_of_sinks; ++adc )
        {
            std::string s_adc = ( *d )[1 + adc].c_str();
            double v = std::atoi( s_adc.c_str() );
            m_next_sinks[adc]->push( dt, v );
        }
    }
}
0

There are 0 answers