I have DB in PostgreSQL with a big data (now it is somewhere around 46 GB and the db will keep growing). I created indexes on often used columns and adapted the config file:
shared_buffers = 1GB
temp_buffers = 256MB
work_mem = 512MB
But this query is still slow:
select distinct us_category_id as cat, count(h_user_id) as res from web_hits
inner join users on h_user_id = us_id
where (h_datetime)::date = ('2015-06-26')::date and us_category_id != ''
group by us_category_id
Explain Analyze:
HashAggregate (cost=2870958.72..2870958.93 rows=21 width=9) (actual time=899141.683..899141.683 rows=0 loops=1) Group Key: users.us_category_id, count(web_hits.h_user_id) -> HashAggregate (cost=2870958.41..2870958.62 rows=21 width=9) (actual time=899141.681..899141.681 rows=0 loops=1) Group Key: users.us_category_id -> Hash Join (cost=5974.98..2869632.11 rows=265259 width=9) (actual time=899141.679..899141.679 rows=0 loops=1) Hash Cond: ((web_hits.h_user_id)::text = (users.us_id)::text) -> Seq Scan on web_hits (cost=0.00..2857563.80 rows=275260 width=7) (actual time=899141.676..899141.676 rows=0 loops=1) -> Seq Scan on web_hits (cost=0.00..2857563.80 rows=275260 width=7) (actual time=899141.676..899141.676 rows=0 loops=1) Filter: ((h_datetime)::date = '2015-06-26'::date) Rows Removed by Filter: 55051918 -> Hash (cost=4292.99..4292.99 rows=134559 width=10) (never executed) -> Seq Scan on users (cost=0.00..4292.99 rows=134559 width=10) (never executed) Filter: ((us_category_id)::text <> ''::text) "Planning time: 1.309 ms" "Execution time: 899141.789 ms"
Date is changed. How can I speed up the query?
Table and index creating
CREATE TABLE web_hits (
h_id integer NOT NULL DEFAULT nextval('w_h_seq'::regclass),
h_user_id character varying,
h_datetime timestamp without time zone,
h_db_id character varying,
h_voc_prefix character varying,
...
h_bot_chek integer, -- 1-бот...
CONSTRAINT w_h_pk PRIMARY KEY (h_id)
);
ALTER TABLE web_hits OWNER TO postgres;
COMMENT ON COLUMN web_hits.h_bot_chek IS '1-бот, 0-не бот';
CREATE INDEX h_datetime ON web_hits (h_datetime);
CREATE INDEX h_db_index ON web_hits (h_db_id COLLATE pg_catalog."default");
CREATE INDEX h_pref_index ON web_hits (h_voc_prefix COLLATE pg_catalog."default" text_pattern_ops);
CREATE INDEX h_user_index ON web_hits (h_user_id text_pattern_ops);
CREATE TABLE users (
us_id character varying NOT NULL,
us_category_id character varying,
...
CONSTRAINT user_pk PRIMARY KEY (us_id),
CONSTRAINT cities_users_fk FOREIGN KEY (us_city_home)
REFERENCES cities (city_id),
CONSTRAINT countries_users_fk FOREIGN KEY (us_country_home)
REFERENCES countries (country_id),
CONSTRAINT organizations_users_fk FOREIGN KEY (us_institution_id)
REFERENCES organizations (org_id),
CONSTRAINT specialities_users_fk FOREIGN KEY (us_speciality_id)
REFERENCES specialities (speciality_id),
CONSTRAINT us_affiliation FOREIGN KEY (us_org_id)
REFERENCES organizations (org_id),
CONSTRAINT us_category FOREIGN KEY (us_category_id)
REFERENCES categories (cat_id),
CONSTRAINT us_reading_room FOREIGN KEY (us_reading_room_id)
REFERENCES reading_rooms (rr_id)
);
ALTER TABLE users OWNER TO sveta;
COMMENT ON COLUMN users.us_type IS '0-аноним, 1-читатель, 2-удаленный';
CREATE INDEX us_cat_index ON users (us_category_id);
CREATE INDEX us_user_index ON users (us_id text_pattern_ops);
Essential information is missing in the question. I am going to base parts of my answer on educated guesses.
web_hits.h_user_id
is sometimes NULL, like you added in the comment.Query
Basically, the query can be simplified / improved to this in any case:
DISTINCT
is obviously unnecessary in an expensive way, since you alreadygroup by us_category_id
(like @Gordon already mentioned).Make the conditions sargable so that an indexes can be used:
Since you have joined on the column
w.h_user_id
, it follows logically that resulting rows areNOT NULL
in this column.count(*)
is equivalent in this case and a bit faster.The condition
h_user_id IS NOT NULL
seems redundant since NULL is eliminated in theJOIN
anyway, but it allows to use the partial index with matching condition (see below).users.us_id
(and consequentlyweb_hits.h_user_id
) should probably not have the data typevarchar
(character varying
). That's an inefficient data type for a PK / FK column in a huge table. Use a numeric data type likeint
orbigint
(oruuid
if you must). Similar considerations forus_category_id
: should beinteger
or related.The standard SQL inequality operator is
<>
. Use that instead of the also supported!=
.Use table qualification to avoid ambiguities - and in any case to make your query clear to the readers in a public forum.
Optimize
Further assuming that:
users.us_category_id <> ''
is true for most rows.web_hits.h_user_id IS NOT NULL
are counted.Then this will be faster, yet:
Indexes
Either way, partial indexes are best for your case:
1.
Eliminate rows where
web_hits.h_user_id IS NULL
from the index.Columns in that order, not the other way round like has been suggested. Detailed explanation:
2.
This will be considerably smaller, because we don't store the potentially lengthy
varchar
columnus_category_id
in the index - which we don't need for the case anyway. We just need to know it's<> ''
. If you hadinteger
columns this consideration would not apply.And we also exclude rows with
''
orNULL
inus_category_id
, making the index smaller, yet.You have to weigh maintenance cost for special indexes against their benefits. If you run queries with a matching condition a lot, it will pay, else, it might not, and more general indexes might be better overall.
Of course, all the usual advice on performance optimization applies, too.
Frankly, there is not very much right about your query and many items are suspicious in your setup. Dealing with huge tables like you obviously are, you might consider professional help.