How to speed up the query in PostgreSQL

9.1k views Asked by At

I have DB in PostgreSQL with a big data (now it is somewhere around 46 GB and the db will keep growing). I created indexes on often used columns and adapted the config file:

shared_buffers = 1GB
temp_buffers = 256MB
work_mem = 512MB

But this query is still slow:

select distinct us_category_id as cat, count(h_user_id) as res from web_hits 
inner join users on h_user_id = us_id 
where (h_datetime)::date = ('2015-06-26')::date and us_category_id != ''
group by us_category_id

Explain Analyze:

HashAggregate (cost=2870958.72..2870958.93 rows=21 width=9) (actual time=899141.683..899141.683 rows=0 loops=1)

Group Key: users.us_category_id, count(web_hits.h_user_id)
-> HashAggregate (cost=2870958.41..2870958.62 rows=21 width=9) (actual time=899141.681..899141.681 rows=0 loops=1)

Group Key: users.us_category_id
 -> Hash Join (cost=5974.98..2869632.11 rows=265259 width=9) (actual time=899141.679..899141.679 rows=0 loops=1)

Hash Cond: ((web_hits.h_user_id)::text = (users.us_id)::text)
-> Seq Scan on web_hits (cost=0.00..2857563.80 rows=275260 width=7) (actual time=899141.676..899141.676 rows=0 loops=1)
-> Seq Scan on web_hits (cost=0.00..2857563.80 rows=275260 width=7) (actual time=899141.676..899141.676 rows=0 loops=1)
Filter: ((h_datetime)::date = '2015-06-26'::date)

Rows Removed by Filter: 55051918
-> Hash (cost=4292.99..4292.99 rows=134559 width=10) (never executed)
-> Seq Scan on users (cost=0.00..4292.99 rows=134559 width=10) (never executed)
Filter: ((us_category_id)::text <> ''::text)

"Planning time: 1.309 ms"
"Execution time: 899141.789 ms"

Date is changed. How can I speed up the query?

Table and index creating

CREATE TABLE web_hits (
  h_id integer NOT NULL DEFAULT nextval('w_h_seq'::regclass),
  h_user_id character varying,
  h_datetime timestamp without time zone,
  h_db_id character varying,
  h_voc_prefix character varying,
  ...
  h_bot_chek integer, -- 1-бот...
  CONSTRAINT w_h_pk PRIMARY KEY (h_id)
);
ALTER TABLE web_hits OWNER TO postgres;
COMMENT ON COLUMN web_hits.h_bot_chek IS '1-бот, 0-не бот';

CREATE INDEX h_datetime ON web_hits (h_datetime);
CREATE INDEX h_db_index ON web_hits (h_db_id COLLATE pg_catalog."default");
CREATE INDEX h_pref_index ON web_hits (h_voc_prefix COLLATE pg_catalog."default" text_pattern_ops);
CREATE INDEX h_user_index ON web_hits (h_user_id text_pattern_ops);

 CREATE TABLE users (
  us_id character varying NOT NULL,
  us_category_id character varying,
  ...
  CONSTRAINT user_pk PRIMARY KEY (us_id),
  CONSTRAINT cities_users_fk FOREIGN KEY (us_city_home)
      REFERENCES cities (city_id),
  CONSTRAINT countries_users_fk FOREIGN KEY (us_country_home)
      REFERENCES countries (country_id),
  CONSTRAINT organizations_users_fk FOREIGN KEY (us_institution_id)
      REFERENCES organizations (org_id),
  CONSTRAINT specialities_users_fk FOREIGN KEY (us_speciality_id)
      REFERENCES specialities (speciality_id),
  CONSTRAINT us_affiliation FOREIGN KEY (us_org_id)
      REFERENCES organizations (org_id),
  CONSTRAINT us_category FOREIGN KEY (us_category_id)
      REFERENCES categories (cat_id),
  CONSTRAINT us_reading_room FOREIGN KEY (us_reading_room_id)
      REFERENCES reading_rooms (rr_id)
);
ALTER TABLE users OWNER TO sveta;
COMMENT ON COLUMN users.us_type IS '0-аноним, 1-читатель, 2-удаленный';

CREATE INDEX us_cat_index ON users (us_category_id);
CREATE INDEX us_user_index ON users (us_id text_pattern_ops);
2

There are 2 answers

2
Erwin Brandstetter On BEST ANSWER

Essential information is missing in the question. I am going to base parts of my answer on educated guesses. web_hits.h_user_id is sometimes NULL, like you added in the comment.

Query

Basically, the query can be simplified / improved to this in any case:

SELECT u.us_category_id AS cat, count(*) AS res
FROM   users    u
JOIN   web_hits w ON w.h_user_id = u.us_id
WHERE  w.h_datetime >= '2015-06-26 0:0'::timestamp
AND    w.h_datetime <  '2015-06-27 0:0'::timestamp
AND    w.h_user_id IS NOT NULL  -- remove irrelevant rows, match index
AND    u.us_category_id <> ''
GROUP  BY 1;
  • DISTINCT is obviously unnecessary in an expensive way, since you already group by us_category_id (like @Gordon already mentioned).

  • Make the conditions sargable so that an indexes can be used:

  • Since you have joined on the column w.h_user_id, it follows logically that resulting rows are NOT NULL in this column. count(*) is equivalent in this case and a bit faster.

  • The condition h_user_id IS NOT NULL seems redundant since NULL is eliminated in the JOIN anyway, but it allows to use the partial index with matching condition (see below).

  • users.us_id (and consequently web_hits.h_user_id) should probably not have the data type varchar (character varying). That's an inefficient data type for a PK / FK column in a huge table. Use a numeric data type like int or bigint (or uuid if you must). Similar considerations for us_category_id: should be integer or related.

  • The standard SQL inequality operator is <>. Use that instead of the also supported !=.

  • Use table qualification to avoid ambiguities - and in any case to make your query clear to the readers in a public forum.

Optimize

Further assuming that:

  • users.us_category_id <> '' is true for most rows.
  • Most or all rows with web_hits.h_user_id IS NOT NULL are counted.

Then this will be faster, yet:

SELECT u.us_category_id AS cat, sum(ct) AS res
FROM   users u
JOIN  (
   SELECT h_user_id, count(*) AS ct
   FROM   web_hits
   WHERE  h_datetime >= '2015-06-26 0:0'::timestamp
   AND    h_datetime <  '2015-06-27 0:0'::timestamp
   AND    h_user_id IS NOT NULL  -- remove irrelevant rows, match index
   GROUP  BY 1
   ) w ON w.h_user_id = u.us_id
AND    u.us_category_id <> ''
GROUP  BY 1;

Indexes

Either way, partial indexes are best for your case:

1.

CREATE INDEX wh_usid_datetime_idx ON web_hits(h_user_id, h_datetime)
WHERE  h_user_id IS NOT NULL;

Eliminate rows where web_hits.h_user_id IS NULL from the index.

Columns in that order, not the other way round like has been suggested. Detailed explanation:

2.

CREATE INDEX us_usid_cat_not_empty_idx ON users(us_id)
WHERE  us_category_id <> '';

This will be considerably smaller, because we don't store the potentially lengthy varchar column us_category_id in the index - which we don't need for the case anyway. We just need to know it's <> ''. If you had integer columns this consideration would not apply.

And we also exclude rows with '' or NULL in us_category_id, making the index smaller, yet.

You have to weigh maintenance cost for special indexes against their benefits. If you run queries with a matching condition a lot, it will pay, else, it might not, and more general indexes might be better overall.


Of course, all the usual advice on performance optimization applies, too.

Frankly, there is not very much right about your query and many items are suspicious in your setup. Dealing with huge tables like you obviously are, you might consider professional help.

5
Gordon Linoff On

First, the distinct is not necessary:

select u.us_category_id as cat, count(h_user_id) as res
from web_hits h inner join
     users u
     on h.h_user_id = u.us_id 
where (h.h_datetime)::date = '2015-06-26'::date and
      u.us_category_id <> ''
group by u.us_category_id

Second, you want to remove the conversion on the column. So:

select u.us_category_id as cat, count(h_user_id) as res
from web_hits h inner join
     users u
     on h.h_user_id = u.us_id 
where (h.h_datetime >= '2015-06-26' and h.h_datetime < '2015-06-27) and
      u.us_category_id <> ''
group by u.us_category_id;

Then, the following index should help the query: web_hits(h_datetime, h_user_id). It might also be beneficial to have an index on users(us_id, us_category_id).