Update mysql big table hang too time

7.1k views Asked by At

Performance problem on update MySql MyISAM big table making column ascending order based on an index on same table

My problem is that the server have only 4 GB memory.
I have to do an update query like this: previous asked question
Mine is this:

set @orderid = 0;  
update images im
    set im.orderid = (select @orderid := @orderid + 1) 
    ORDER BY im.hotel_id, im.idImageType;

On im.hotel_id, im.idImageType I have an ascending index.
On im.orderid I have an ascending index too.

The table have 21 millions records and is an MyIsam table.

The table is this:

CREATE TABLE `images` (
`photo_id` int(11) NOT NULL,
`idImageType` int(11) NOT NULL,
`hotel_id` int(11) NOT NULL,
`room_id` int(11) DEFAULT NULL,
`url_original` varchar(150) COLLATE utf8_unicode_ci NOT NULL,
`url_max300` varchar(150) COLLATE utf8_unicode_ci NOT NULL,
`url_square60` varchar(150) COLLATE utf8_unicode_ci NOT NULL,
`archive` int(11) NOT NULL DEFAULT '0',
`orderid` int(11) NOT NULL DEFAULT '0',
PRIMARY KEY (`photo_id`),
KEY `idImageType` (`idImageType`),
KEY `hotel_id` (`hotel_id`),
KEY `hotel_id_idImageType` (`hotel_id`,`idImageType`),
KEY `archive` (`archive`),
KEY `room_id` (`room_id`),
KEY `orderid` (`orderid`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

The problem is the performance: hang for several minutes!
Server disk go busy too.

My question is: there is a better manner to achieve the same result?
Have I to partition the table or something else to increase the performance?
I cannot modify server hardware but can tuning MySql application db server settings.

best regards

3

There are 3 answers

10
AudioBubble On BEST ANSWER

Tanks to every body. Yours answers help me much. I think that now I have found a better solution.

This problem involve in two critical issue:

  • efficient paginate on large table
  • update large table.

To go on efficient paginate on large table I have found a solution by make a previous update on the table but doing so I fall in issues on the 51 minute time needed to the updates and consequent my java infrastructure time out (spring-batch step).

Now by yours help, I found two solution to paginate on large table, and one solution to update large table.
To reach this performance the server need memory. I try this solution on develop server using 32 GB memory.

common solution step

To paginate follow a fields tupla like I needed I have make one index:

KEY `hotel_id_idImageType` (`hotel_id`,`idImageType`) 

to achieve the new solution we have to change this index by add the primary key part to the index tail KEY hotel_id_idImageType (hotel_id,idImageType, primary key fields):

drop index hotel_id_idImageType on images;
create index hotelTypePhoto on images (hotel_id, idImageType, photo_id);  

This is needed to avoid touch table and use only the index file ...

Suppose we want the 10 records after the 19000000 record.

The decimal point is this , in this answers

solution 1

This solution is very practice and not needed the extra field orderid and you have not to do any update before the pagination:

select * from images im inner join 
  (select photo_id from images 
  order by hotel_id, idImageType, photo_id 
  limit 19000000,10) k 
on im.photo_id = k.photo_id;  

To make the table k on my 21 million table records need only 1,5 sec because it use only the three field in index hotelTypePhoto so haven't to access to the table file and work only on index file.

The order was like the original required (hotel_id, idImageType) because is included in (hotel_id, idImageType, photo_id): same subset...

The join take no time so every first time the paginate is executed on the same page need only 1,5 sec and this is a good time if you have to execute it in a batch one on 3 months.

On production server using 4 GB memory the same query take 3,5 sec.

Partitioning the table do not help to improve performance.

If the server take it in cache the time go down or if you do a jdbc params statment the time go down too (I suppose).

If you have to use it often, it have the advantage that it do not care if the data change.

solution 2

This solution need the extra field orderid and need to do the orderid update one time by batch import and the data have not to change until the next batch import.

Then you can paginate on the table in 0,000 sec.

set @orderid = 0;  
update images im inner join (
  select photo_id, (@orderid := @orderid + 1) as newOrder 
  from images order by hotel_id, idImageType, photo_id
) k
on im.photo_id = k.photo_id
set im.orderid = k.newOrder;  

The table k is fast almost like in the first solution.

This all update take only 150,551 sec much better than 51 minute!!! (150s vs 3060s)

After this update in the batch you can do the paginate by:

 select * from images im where orderid between 19000000 and 19000010;

or better

 select * from images im where orderid >= 19000000 and orderid< 19000010;  

this take 0,000sec to execute first time and all other time.

Edit after Rick comment

Solution 3

This solution is to avoid extra fields and offset use. But need too take memory of the last page read like in this solution

This is a fast solution and can work on online server production using only 4GB memory

Suppose you need to read last ten records after 20000000.
There is two scenario to take care:

  • You can start read it from the first to the 20000000 if you need all of it like me and update some variable to take memory of last page read.
  • you have to read only the last 10 after 20000000.

In this second scenario you have to do a pre query to find the start page:

select hotel_id, idImageType, photo_id 
  from images im 
  order by hotel_id, idImageType, photo_id limit 20000000,1

It give to me:

+----------+-------------+----------+
| hotel_id | idImageType | photo_id |
+----------+-------------+----------+
|  1309878 |           4 | 43259857 |
+----------+-------------+----------+

This take 6,73 sec.
So you can store this values in variable to next use.
Suppose we named @hot=1309878, @type=4, @photo=43259857 Then you can use it in a second query like this:

select * from images im  
  where  
  hotel_id>@hot OR (
    hotel_id=@hot and idImageType>@type OR (
     idImageType=@type and photo_id>@photo
    )
  )  
  order by hotel_id, idImageType, photo_id limit 10;  

The first clause hotel_id>@hot take all records after the actual first field on scrolling index but lost some record. To take it we have to do the OR clause that take on the first index field all remained unread records.

This take only 0,10 sec now.
But this query can be optimized (bool distributive):

select * from images im  
  where  
  hotel_id>@hot OR (
    hotel_id=@hot and 
     (idImageType>@type or idImageType=@type) 
     and (idImageType>@type or photo_id>@photo
    )
  )  
  order by hotel_id, idImageType, photo_id limit 10;  

that become:

select * from images im  
  where  
  hotel_id>@hot OR (
    hotel_id=@hot and 
     idImageType>=@type
     and (idImageType>@type or photo_id>@photo
    )
  )  
  order by hotel_id, idImageType, photo_id limit 10;  

that become:

select * from images im  
  where  
  (hotel_id>@hot OR hotel_id=@hot) and 
  (hotel_id>@hot OR
     (idImageType>=@type and (idImageType>@type or photo_id>@photo))
  )
  order by hotel_id, idImageType, photo_id limit 10;  

that become:

select * from images im  
  where  
  hotel_id>=@hot and 
  (hotel_id>@hot OR
     (idImageType>=@type and (idImageType>@type or photo_id>@photo))
  )
  order by hotel_id, idImageType, photo_id limit 10;  

Are they the same data we can get by the limit?

To quick not exhaustive test do:

select im.* from images im inner join (
  select photo_id from images order by hotel_id, idImageType, photo_id limit 20000000,10
) k 
on im.photo_id=k.photo_id 
order by im.hotel_id, im.idImageType, im.photo_id;

This take 6,56 sec and the data is the same that the query above.
So the test is positive.

In this solution you have to spend 6,73 sec only the first time you need to seek on first page to read (but if you need all you haven't).

To real all other page you need only 0,10 sec a very good result.

Thanks to rick to his hint on a solution based on store the last page read.

Conclusion

On solution 1 you haven't any extra field and take 3,5 sec on every page
On solution 2 you have extra field and need a big memory server (32 GB tested) in 150 sec. but then you read the page in 0,000 sec.
On solution 3 you haven't any extra field but have to store last page read pointer and if you do not start reading by the first page you have to spend 6,73 sec for first page. Then you spend only 0,10 sec on all the other pages.

Best regards

Edit 3

solution 3 is exactly that suggested by Rick. Im sorry, in my previous solution 3 I have do a mistake and when I coded the right solution then I have applied some boolean rule like distributive property and so on, and after all I get the same Rich solution! regards

5
BaBL86 On

You can use some of this:

  1. Update engine to InnoDB, it blocks only one row, not all the table on update.

  2. Create #temp table with photo_id and good orderid and than update your table from this temp:

    update images im, temp tp
    set im.orderid = tp.orderid
    where im.photo_id = tp.photo_id
    

it will be fastest way and when you fill your tmp table - you have no blocks on primary table.

  1. You can drop indexes before mass update. After all your single update you have rebuilding of indexes and it has a long time.
5
Rick James On
KEY `hotel_id`             (`hotel_id`),
KEY `hotel_id_idImageType` (`hotel_id`,`idImageType`),

DROP the former; the latter takes care of any need for it. (This won't speed up the original query.)

"The problem is the performance: hang for several minutes!" What is the problem?

  • Other queries are blocked for several minutes? (InnoDB should help.)
  • You run this update often and it is annoying? (Why in the world??)
  • Something else?

This one index is costly while doing the Update:

KEY `orderid` (`orderid`)

DROP it and re-create it. (Don't bother dropping the rest.) Another reason for going with InnoDB is that these operations can be done (in 5.6) without copying the table over. (21M rows == long time if it has to copy the table!)

Why are you building a second Unique index (orderid) in addition to photo_id, which is already Unique? I ask this because there may be another way to solve the real problem that does not involve this time-consuming Update.

I have two more concrete suggestions, but I want to here your answers first.

Edit Pagination, ordered by hotel_id, idImageType, photo_id:

It is possible to read the records in order by that triple. And even to "paginate" through them.

If you "left off" after ($hid, $type, $pid), here would be the 'next' 20 records:

WHERE   hotel_id >= $hid
  AND ( hotel_id >  $hid
     OR       idImageType >= $type
        AND ( idImageType >  $type
           OR      photo_id > $pid
            )
      )
ORDER BY hotel_id, idImageType, photo_id
LIMIT 20

and have

INDEX(hotel_id, idImageType, photo_id)

This avoids the need for orderid and its time consuming Update.

It would be simpler to paginate one hotel_id at a time. Would that work?

Edit 2 -- eliminate downtime

Since you are reloading the entire table periodically, do this when you reload:

  1. CREATE TABLE New with the recommended index changes.
  2. Load the data into New. (Be sure to avoid your 51-minute timeout; I don't know what is causing that.)
  3. RENAME TABLE images TO old, New TO images;
  4. DROP TABLE old;

That will avoid blocking the table for the load and for the schema changes. There will be a very short block for the atomic Step #3.

Plan on doing this procedure each time you reload your data.

Another benefit -- After step #2, you can test the New data to see if it looks OK.