Optimizing MySQL Query Performance for 'ARCHIVED' Status Selection

112 views Asked by At

I have a table named "article" containing 1,145,141 records with various fields including

"id," "uiqPID," "tenant," "status," "title," "body," "user_id," "category_id," "created_at," and "updated_at." 

The "status" column can have one of three values: 'PUBLISHED,' 'DRAFT,' or 'ARCHIVED,' with the following counts:

  • PUBLISHED: 2
  • DRAFT: 26,145
  • ARCHIVED: 1,118,993

I have set up the following indexes:

  1. PRIMARY index on 'id'
  2. article_abstract_unq_id index on 'uiqPID'
  3. article_abstract_unq_id index on 'tenant'
  4. article_status_idx index on 'status'
  5. idx_composite_search index on 'id'
  6. idx_composite_search index on 'uiqPID'
  7. idx_composite_search index on 'created_at'

My issue is with the performance of the following query, which took 5.7 seconds to execute:

SELECT 
    a.id AS id,
    a.created_at AS created_at
FROM 
    article a
WHERE 
    a.status = 'ARCHIVED'
ORDER BY a.created_at DESC
LIMIT 50;

However, if I remove the WHERE condition or change it to a.status = 'DRAFT', the query completes within 1 second.

Upon inspecting the query plan, I noticed the difference in execution strategies. With the 'ARCHIVED' or 'DRAFT' status filter, the plan shows:

key: article_status_idx
Extra: Using index condition; Using filesort

But without the 'ARCHIVED' filter, the plan simply states:

key:
Extra: Using filesort

My question is: How can I optimize the query performance for filtering on 'ARCHIVED' status, ensuring it executes faster than the current 5.7 seconds, similar to the queries without this condition or with 'DRAFT' status?

3

There are 3 answers

11
Lajos Arpad On

The behavior you experienced is actually what I would expect. writing 1,118,993 number-timestamp tuples into memory takes time, especially if you order the results by created_at. You need to test the following:

SELECT 
    a.id AS id,
    a.created_at AS created_at
FROM 
    article a
ORDER BY a.created_at DESC
LIMIT 0, 1118993;

and if the time to execute this query is roughly the same as executing the query with the where clause, then it's not the where clause that causes the performance drop, but ordering the query + loading the results into memory. Run the following as well:

SELECT 
    a.id AS id,
    a.created_at AS created_at
FROM 
    article a
LIMIT 0, 1118993;

Here we do not even have an order by. If this is similarly slow, then the thing you mainly wait for is to write this stuff into memory.

Anyway, do these measurements and the result will either confirm that the slowness is caused by the where or order by clause, or not. If these clauses cause the slowness, then let me know in the comment section and I will provide ways to optimize that issue. But it's highly probable that the real problem is that you are waiting for the full query to be executed.

Maybe it's better to break the query into partitions and run queries with limit 0, 10000 and then limit 10000, 10000 and so on, so you can work with the first results already while waiting for the others, maybe reducing the frustration of impatient users. But that does not change the fact that loading all this stuff takes time, even though you can make this waiting user-friendlier if the showing of partial results makes some useful sense.

EDIT

You can try creating a (status, created_at) key before you create a copy table, maybe such indexing will be useful for your performance, the idea being to have an indexed order for created_at for each status type.

If all else fails, you could create a table like this:

create table article_archive(
    id int primary key,
    created_at timestamp
);

insert into article_archive(id, created_at)
select id, created_at
from article
where `status` = 'ARCHIVE';

ALTER TABLE article_archive ADD INDEX (created_at DESC);

And then you can select from article_archive, like:

select id, created_at
from article_archive
order by created_at desc
limit 50;

You can even join that with article by id.

3
Amin Zayeromali On

Here's an alternative approach that may improve speed, try it:

SELECT 
    a.id AS id,
    a.created_at AS created_at
FROM (
    SELECT id
    FROM article
    WHERE status = 'ARCHIVED'
    ORDER BY created_at DESC
    LIMIT 50
) AS subquery
JOIN article a ON subquery.id = a.id;

Explanation: This query utilizes a subquery to first identify the id values of the 50 most recent 'ARCHIVED' records based on the created_at column. It then performs a join to retrieve the additional columns (id and created_at) from the main table article. This approach can be beneficial when you want to optimize queries with complex filtering and sorting.

Please ensure that the article_status_idx index on the status column and the idx_composite_search index on the created_at and id columns are well-maintained for efficient execution.

1
Thorsten Kettner On

You can create a composite index matching the query exactly:

create index idx on article (status, created_at desc);

Thus the DBMS can go to status = 'ARCHIVED' in the index, read the first 50 entries and be done with the task.

https://dev.mysql.com/doc/refman/8.0/en/descending-indexes.html