Postgresql is not using indexes if I use search text with 5 characters. With 6 it works. Why?

718 views Asked by At

I'm using Postgresql 13.

With this query PostgreSQL is using the indexes:

SELECT *
FROM
    "players"
WHERE team_id = 3
    AND (
    code ILIKE 'lushij'
    OR
    REPLACE(lastname||firstname,' ','') ILIKE '%lushij%'
    OR REPLACE(firstname||lastname,' ','') ILIKE '%lushij%'
    OR personal_info->>'houses' ILIKE '%lushij%'
    )
LIMIT 15
Limit  (cost=333.01..385.77 rows=15 width=360)
  ->  Bitmap Heap Scan on players  (cost=333.01..4061.29 rows=1060 width=360)
        Recheck Cond: ((code ~~* 'lushij'::text) OR (replace((lastname || firstname), ' '::text, ''::text) ~~* '%lushij%'::text) OR (replace((firstname || lastname), ' '::text, ''::text) ~~* '%lushij%'::text) OR ((personal_info ->> 'houses'::text) ~~* '%lushij%'::text))
        Filter: (team_id = 3)
        ->  BitmapOr  (cost=333.01..333.01 rows=1060 width=0)
              ->  Bitmap Index Scan on players_code_trgm  (cost=0.00..116.75 rows=100 width=0)
                    Index Cond: (code ~~* 'lushij'::text)
              ->  Bitmap Index Scan on players_replace_last_first_name_trgm  (cost=0.00..66.40 rows=320 width=0)
                    Index Cond: (replace((lastname || firstname), ' '::text, ''::text) ~~* '%lushij%'::text)
              ->  Bitmap Index Scan on players_replace_first_last_name_trgm  (cost=0.00..66.40 rows=320 width=0)
                    Index Cond: (replace((firstname || lastname), ' '::text, ''::text) ~~* '%lushij%'::text)
              ->  Bitmap Index Scan on players_personal_info_houses_trgm_idx  (cost=0.00..82.40 rows=320 width=0)
                    Index Cond: ((personal_info ->> 'houses'::text) ~~* '%lushij%'::text)

With the same query but with search text with one character less (from lushij to lushi) the indexes are NOT used:

SELECT *
FROM
    "players"
WHERE team_id = 3
    AND (
    code ILIKE 'lushi'
    OR
    REPLACE(lastname||firstname,' ','') ILIKE '%lushi%'
    OR REPLACE(firstname||lastname,' ','') ILIKE '%lushi%'
    OR personal_info->>'houses' ILIKE '%lushi%'
    )
LIMIT 15
Limit  (cost=0.00..235.65 rows=15 width=360)
  ->  Seq Scan on players  (cost=0.00..76853.53 rows=4892 width=360)
        Filter: ((team_id = 3) AND ((code ~~* 'lushi'::text) OR (replace((lastname || firstname), ' '::text, ''::text) ~~* '%lushi%'::text) OR (replace((firstname || lastname), ' '::text, ''::text) ~~* '%lushi%'::text) OR ((personal_info ->> 'houses'::text) ~~* '%lushi%'::text)))

Why?

UPDATE:

If i comment LIMIT 15 line the indexes are used.


Here the structures:

Players table structure
-- ----------------------------
-- Table structure for players
-- ----------------------------
DROP TABLE IF EXISTS "public"."players";
CREATE TABLE "public"."players" (
  "id" int8 NOT NULL DEFAULT nextval('players_id_seq'::regclass),
  "created_at" timestamptz(6) NOT NULL DEFAULT now(),
  "updated_at" timestamptz(6),
  "team_id" int8 NOT NULL,
  "firstname" text COLLATE "pg_catalog"."default",
  "lastname" text COLLATE "pg_catalog"."default",
  "code" text COLLATE "pg_catalog"."default",
  "personal_info" jsonb
)
;

-- ----------------------------
-- Indexes structure for table players
-- ----------------------------
CREATE INDEX "players_personal_info_houses_trgm_idx" ON "public"."players" USING gin (
  (personal_info ->> 'houses'::text) COLLATE "pg_catalog"."default" "public"."gin_trgm_ops"
);
CREATE INDEX "players_code_trgm" ON "public"."players" USING gin (
  "code" COLLATE "pg_catalog"."default" "public"."gin_trgm_ops"
);
CREATE INDEX "players_lower_code" ON "public"."players" USING btree (
  lower(code) COLLATE "pg_catalog"."default" "pg_catalog"."text_ops" ASC NULLS LAST
);
CREATE INDEX "players_replace_first_last_name_trgm" ON "public"."players" USING gin (
  replace(firstname || lastname, ' '::text, ''::text) COLLATE "pg_catalog"."default" "public"."gin_trgm_ops"
);
CREATE INDEX "players_replace_last_first_name_trgm" ON "public"."players" USING gin (
  replace(lastname || firstname, ' '::text, ''::text) COLLATE "pg_catalog"."default" "public"."gin_trgm_ops"
);

-- ----------------------------
-- Primary Key structure for table players
-- ----------------------------
ALTER TABLE "public"."players" ADD CONSTRAINT "players_pkey" PRIMARY KEY ("id");

-- ----------------------------
-- Foreign Keys structure for table players
-- ----------------------------
ALTER TABLE "public"."players" ADD CONSTRAINT "players_team_id_fkey" FOREIGN KEY ("team_id") REFERENCES "public"."teams" ("id") ON DELETE NO ACTION ON UPDATE NO ACTION;
2

There are 2 answers

10
seanb On

OK.. this is based on my knowledge of SQL Server and SQL in general, but it probably applies here too.

To start with... because you're doing a SELECT *, it will need to go to the clustered index at some point.

The use of the non-clustered index (if it's used) is to identify the relevant rows, and then it goes and picks out those rows 1-by-1 (a nested loop join, or sometimes referred to as index seek/scan + key lookup).

If there are too many rows, this is actually inefficient - you end up doing more reads/etc than simply just reading the entire table.

Reducing the length of the LIKE filter increases the cardinality estimate e.g., increases the amount of rows the filter is expected to match in the query planner/optimiser.

I'm guessing the SQL engine takes a guess (including statistics on the index/data) and determines that it's probably more efficient to simply read ALL the data from the clustered index, vs determining the rows and reading them 1-by-1.


Update after OP update re removing limit.

Well... once again, it depends on how many rows it estimates exist based on the filter.

Imagine if you were doing ILIKE '%e%' in the original query. Every second row may match this. As you don't have a sort, it just needs to read (say) the first 30 rows of the clustered index and it will get your answer. Once again, the query planner/optimiser may come to the conclusion that this would be the most efficient way to get these.

However, without the limit, it will need to read all rows to get all results.

  • For %e% it may be more efficient to just do a full clustered index scan, as it expects many rows to match
  • For more complicated/selective ones filtering, searching the index first (then seeking straight to the data in the clustered index) is often more efficient
2
Laurenz Albe On

The shorter the string, the less selective your condition gets. Based on its estimates, PostgreSQL thinks that with the short string, enough rows match the condition that it is cheaper to just sequentially fetch rows and discard the ones that don't match until it has found 15 matching rows.

The many OR conditions may well make the optimizer underestimate the selectivity, because the conditions are considered uncorrelated, which may not be the case.