If the following database (postgres) queries are executed, the second call is much faster.
I guess the first query is slow since the operating system (linux) needs to get the data from disk. The second query benefits from caching at filesystem level and in postgres.
Is there a way to optimize the database to get the results fast on the first call?
First call (slow)
foo3_bar_p@BAR-FOO3-Test:~$ psql
foo3_bar_p=# explain analyze SELECT "foo3_beleg"."id", ... FROM "foo3_beleg" WHERE
foo3_bar_p-# (("foo3_beleg"."id" IN (SELECT beleg_id FROM foo3_text where
foo3_bar_p(# content @@ 'footown'::tsquery)) AND "foo3_beleg"."belegart_id" IN
foo3_bar_p(# ('...', ...));
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=75314.58..121963.20 rows=152 width=135) (actual time=27253.451..88462.165 rows=11 loops=1)
-> HashAggregate (cost=75314.58..75366.87 rows=5229 width=4) (actual time=16087.345..16113.988 rows=17671 loops=1)
-> Bitmap Heap Scan on foo3_text (cost=273.72..75254.67 rows=23964 width=4) (actual time=327.653..16026.787 rows=27405 loops=1)
Recheck Cond: (content @@ '''footown'''::tsquery)
-> Bitmap Index Scan on foo3_text_content_idx (cost=0.00..267.73 rows=23964 width=0) (actual time=281.909..281.909 rows=27405 loops=1)
Index Cond: (content @@ '''footown'''::tsquery)
-> Index Scan using foo3_beleg_pkey on foo3_beleg (cost=0.00..8.90 rows=1 width=135) (actual time=4.092..4.092 rows=0 loops=17671)
Index Cond: (id = foo3_text.beleg_id)
Filter: ((belegart_id)::text = ANY ('{...
Rows Removed by Filter: 1
Total runtime: 88462.809 ms
(11 rows)
Second call (fast)
Nested Loop (cost=75314.58..121963.20 rows=152 width=135) (actual time=127.569..348.705 rows=11 loops=1)
-> HashAggregate (cost=75314.58..75366.87 rows=5229 width=4) (actual time=114.390..133.131 rows=17671 loops=1)
-> Bitmap Heap Scan on foo3_text (cost=273.72..75254.67 rows=23964 width=4) (actual time=11.961..97.943 rows=27405 loops=1)
Recheck Cond: (content @@ '''footown'''::tsquery)
-> Bitmap Index Scan on foo3_text_content_idx (cost=0.00..267.73 rows=23964 width=0) (actual time=9.226..9.226 rows=27405 loops=1)
Index Cond: (content @@ '''footown'''::tsquery)
-> Index Scan using foo3_beleg_pkey on foo3_beleg (cost=0.00..8.90 rows=1 width=135) (actual time=0.012..0.012 rows=0 loops=17671)
Index Cond: (id = foo3_text.beleg_id)
Filter: ((belegart_id)::text = ANY ('...
Rows Removed by Filter: 1
Total runtime: 348.833 ms
(11 rows)
Table layout of the foo3_text table (28M rows)
foo3_egs_p=# \d foo3_text
Table "public.foo3_text"
Column | Type | Modifiers
----------+-----------------------+------------------------------------------------------------
id | integer | not null default nextval('foo3_text_id_seq'::regclass)
beleg_id | integer | not null
index_id | character varying(32) | not null
value | text | not null
content | tsvector |
Indexes:
"foo3_text_pkey" PRIMARY KEY, btree (id)
"foo3_text_index_id_2685e3637668d5e7_uniq" UNIQUE CONSTRAINT, btree (index_id, beleg_id)
"foo3_text_beleg_id" btree (beleg_id)
"foo3_text_content_idx" gin (content)
"foo3_text_index_id" btree (index_id)
"foo3_text_index_id_like" btree (index_id varchar_pattern_ops)
Foreign-key constraints:
"beleg_id_refs_id_6e6d40770e71292" FOREIGN KEY (beleg_id) REFERENCES foo3_beleg(id) DEFERRABLE INITIALLY DEFERRED
"index_id_refs_name_341600137465c2f9" FOREIGN KEY (index_id) REFERENCES foo3_index(name) DEFERRABLE INITIALLY DEFERRED
Hardware changes (SSD instead of traditional disks) or RAM disks are possible. But maybe there the current hardware can do faster results, too.
Version: PostgreSQL 9.1.2 on x86_64-unknown-linux-gnu
Please leave a comment if you need more details.
Sometimes moving an "WHERE x IN" into a JOIN can improve performance significantly. Try this:
Here's a repeatable experiment to support my claim.
I happen to have a big Postgres database handy (30 million rows) (http://juliusdavies.ca/2013/j.emse/bertillonage/), so I loaded that into postgres 9.4beta3.
The results are impressive. The sub-select approach is approximately 20 times slower:
For those interested in replicating, here are the raw SQL queries I used to test my theory.
This query uses a "SELECT IN" subquery, and it's 20 times slower (17 seconds on my laptop on the first execution):
This query moves the condition out of the subquery and into the joining criteria, and it's 20 times faster (0.8 seconds on my laptop on the first execution).
p.s. if you're curious what that database is for, you can use it to calculate how similar an arbitrary jar file is to all of the jar files in the maven repository circa 2011.