Exact same query and database. In production more than 100x slower

137 views Asked by At

I am running PG 9.6.1 both in production (Linux) and dev (Mac)

I am testing an identical query in both development and production. In development I've tested on multiple snapshots of the production db (only minutes apart).

For some reason, the same exact query gives substantially different query plans, the the plan resulting in production has terrible performance.

When I perform the following query

SELECT DISTINCT ON (v0."email") v0."id"
  ,v0."first_seen" ,v0."last_seen" ,v0."last_heard_from" ,v0."last_contacted" ,v0."last_page_viewed" ,v0."sessions" ,v0."seen_welcome_message" ,v0."signed_up_at" ,v0."referrer" ,v0."unsubscribed" ,v0."merged" ,v0."marked_at" ,v0."first_name" ,v0."last_name" ,v0."email" ,v0."slug" ,v0."name" ,v0."type" ,v0."avatar" ,v0."user_id" ,v0."location" ,v0."app_id" ,v0."actor_id" ,v0."marked_by_id" ,v0."inserted_at" ,v0."updated_at"
FROM "visitors" AS v0
INNER JOIN "auto_messages" AS a1 ON (a1."app_id" = v0."app_id")
  AND (a1."id" = $1)
INNER JOIN LATERAL(SELECT * FROM find_matching_visitors(a1."app_id", a1."formatted_default_filters", a1."formatted_custom_filters")) AS f2 ON f2."id" = v0."id"
WHERE (
    CASE 
      WHEN a1."window_enabled"
        THEN extract(dow FROM current_timestamp at TIME zone COALESCE((v0."location" - >> 'timezone'), 'Europe/Paris')) = ANY (a1."window_days_of_week"::INT [])
          AND CURRENT_TIME at TIME zone COALESCE((v0."location" - >> 'timezone'), 'Europe/Paris') BETWEEN a1."window_start"
            AND a1."window_end"
      ELSE true
      END
    )
  AND (
    NOT EXISTS (
      SELECT *
      FROM auto_message_events
      WHERE auto_message_id = a1."id"
        AND (
          visitor_id = v0."id"
          OR visitor_user_id = v0."user_id"
          OR visitor_email = v0."email"
          )
        AND event = 'sent'
      )
    )
  AND ((v0."type" = 'user') OR (v0."type" = 'lead'))
  AND (NOT (v0."merged"))
  AND (v0."marked_at" IS NULL)
  AND (NOT (v0."email" IS NULL) AND NOT (v0."unsubscribed"))
  AND ((a1."status" = $2) AND (a1."channel" = $3))
GROUP BY v0."id"
ORDER BY v0."email"
  ,v0."last_seen" DESC LIMIT $4 [446, "live", "email", 500]

In production I get the following query plan

 Limit  (cost=1085.72..1085.73 rows=1 width=1070) (actual time=16862.966..16862.966 rows=0 loops=1)
.16862.964 rows=0 loops=1)72..1085.73 rows=1 width=1070) (actual time=16862.964.--More--
         ->  Sort  (cost=1085.72..1085.73 rows=1 width=1070) (actual time=16862.963..16862.963 rows=0 loops=1)
               Sort Key: v0.email, v0.last_seen DESC
               Sort Method: quicksort  Memory: 25kB
               ->  Group  (cost=1085.71..1085.71 rows=1 width=1070) (actual time=16862.946..16862.946 rows=0 loops=1)
                     Group Key: v0.id
                     ->  Sort  (cost=1085.71..1085.71 rows=1 width=1070) (actual time=16862.936..16862.936 rows=0 loops=1)
                           Sort Key: v0.id
                           Sort Method: quicksort  Memory: 25kB
                           ->  Nested Loop Anti Join  (cost=890.31..1085.70 rows=1 width=1070) (actual time=16862.922..16862.922 rows=0 loops=1)
                                 Join Filter: (auto_message_events.auto_message_id = a1.id)
                                 ->  Nested Loop  (cost=885.54..1072.87 rows=1 width=1074) (actual time=46.013..16856.367 rows=107 loops=1)
                                       Join Filter: ((v0.id)::text = (find_matching_visitors.id)::text)
                                       Rows Removed by Join Filter: 112824
                                       ->  Nested Loop  (cost=885.29..1050.12 rows=1 width=1162) (actual time=4.020..14.710 rows=949 loops=1)
                                             ->  Index Scan using auto_messages_pkey on auto_messages a1  (cost=0.27..8.30 rows=1 width=122) (actual time=0.019..0.021 rows=1 loops=1)
                                                   Index Cond: (id = 435)
                                                   Filter: (((status)::text = 'live'::text) AND ((channel)::text = 'email'::text))
                                             ->  Bitmap Heap Scan on visitors v0  (cost=885.02..1041.81 rows=1 width=1070) (actual time=3.995..13.180 rows=949 loops=1)
                                                   Recheck Cond: ((email IS NOT NULL) AND ((app_id)::text = (a1.app_id)::text))
                                                   Filter: ((NOT merged) AND (marked_at IS NULL) AND (NOT unsubscribed) AND (((type)::text = 'user'::text) OR ((type)::text = 'lead'::text)) AND CASE WHEN a1.window_enabled THEN ((date_part('dow'::text, timezone(COALESCE((location ->> 'timezone'::text), 'Europe/Paris'::text), now())) = ANY ((a1.window_days_of_week)::double precision[])) AND (timezone(COALESCE((location ->> 'timezone'::text), 'Europe/Paris'::text), ('now'::cstring)::time with time zone) >= (a1.window_start)::time with time zone) AND (timezone(COALESCE((location ->> 'timezone'::text), 'Europe/Paris'::text), ('now'::cstring)::time with time zone) <= (a1.window_end)::time with time zone)) ELSE true END)
                                                   Rows Removed by Filter: 56
                                                   Heap Blocks: exact=624
                                                   ->  BitmapAnd  (cost=885.02..885.02 rows=39 width=0) (actual time=3.874..3.874 rows=0 loops=1)
                                                         ->  Bitmap Index Scan on email_idx  (cost=0.00..284.01 rows=6344 width=0) (actual time=2.161..2.161 rows=9749 loops=1)
                                                               Index Cond: (email IS NOT NULL)
                                                         ->  Bitmap Index Scan on visitors_app_id_signed_up_index  (cost=0.00..600.68 rows=6950 width=0) (actual time=1.086..1.086 rows=4502 loops=1)
                                                               Index Cond: ((app_id)::text = (a1.app_id)::text)
                                       ->  Function Scan on find_matching_visitors  (cost=0.25..10.25 rows=1000 width=32) (actual time=17.681..17.706 rows=119 loops=949)
                                 ->  Bitmap Heap Scan on auto_message_events  (cost=4.77..8.79 rows=1 width=74) (actual time=0.051..0.051 rows=1 loops=107)
                                       Recheck Cond: (((auto_message_id = 435) AND ((visitor_id)::text = (v0.id)::text) AND ((event)::text = 'sent'::text)) OR ((auto_message_id = 435) AND ((event)::text = 'sent'::text) AND ((visitor_user_id)::text = (v0.user_id)::text)) OR ((auto_message_id = 435) AND ((event)::text = 'sent'::text) AND ((visitor_email)::text = (v0.email)::text)))
                                       Heap Blocks: exact=107
                                       ->  BitmapOr  (cost=4.77..4.77 rows=1 width=0) (actual time=0.044..0.044 rows=0 loops=107)
                                             ->  Bitmap Index Scan on auto_message_events_auto_message_id_visitor_id_event_index  (cost=0.00..3.56 rows=1 width=0) (actual time=0.019..0.019 rows=1 loops=107)
                                                   Index Cond: ((auto_message_id = 435) AND ((visitor_id)::text = (v0.id)::text) AND ((event)::text = 'sent'::text))
                                             ->  Bitmap Index Scan on auto_message_id_event_visitor_user_id_idx  (cost=0.00..0.59 rows=1 width=0) (actual time=0.008..0.008 rows=1 loops=107)
                                                   Index Cond: ((auto_message_id = 435) AND ((event)::text = 'sent'::text) AND ((visitor_user_id)::text = (v0.user_id)::text))
                                             ->  Bitmap Index Scan on auto_message_id_event_visitor_email_idx  (cost=0.00..0.62 rows=1 width=0) (actual time=0.008..0.008 rows=1 loops=107)
                                                   Index Cond: ((auto_message_id = 435) AND ((event)::text = 'sent'::text) AND ((visitor_email)::text = (v0.email)::text))
 Planning time: 2.566 ms
 Execution time: 16863.358 ms

The exact same query running in development (production db snapshot)

    Limit  (cost=417.94..417.95 rows=1 width=2096) (actual time=90.445..90.445 rows=0 loops=1)
   ->  Unique  (cost=417.94..417.95 rows=1 width=2096) (actual time=90.444..90.444 rows=0 loops=1)
         ->  Sort  (cost=417.94..417.95 rows=1 width=2096) (actual time=90.443..90.443 rows=0 loops=1)
               Sort Key: v0.email, v0.last_seen DESC
               Sort Method: quicksort  Memory: 25kB
               ->  Group  (cost=417.92..417.93 rows=1 width=2096) (actual time=90.435..90.435 rows=0 loops=1)
                     Group Key: v0.id
                     ->  Sort  (cost=417.92..417.93 rows=1 width=2096) (actual time=90.435..90.435 rows=0 loops=1)
                           Sort Key: v0.id
                            Sort Method: quicksort  Memory: 25kB
                            ->  Nested Loop Anti Join  (cost=388.06..417.91 rows=1 width=2096) (actual time=90.430..90.430 rows=0 loops=1)
                                  Join Filter: (auto_message_events.auto_message_id = a1.id)
                                  ->  Nested Loop  (cost=375.05..396.85 rows=1 width=2100) (actual time=87.181..87.341 rows=76 loops=1)
                                        ->  Index Scan using auto_messages_pkey on auto_messages a1  (cost=0.28..8.30 rows=1 width=120) (actual time=0.015..0.017 rows=1 loops=1)
                                              Index Cond: (id = 435)
                                              Filter: (((status)::text = 'live'::text) AND ((channel)::text = 'email'::text))
                                        ->  Hash Join  (cost=374.78..388.54 rows=1 width=2096) (actual time=87.163..87.292 rows=76 loops=1)
                                              Hash Cond: ((find_matching_visitors.id)::text = (v0.id)::text)
                                              ->  Function Scan on find_matching_visitors  (cost=0.25..10.25 rows=1000 width=32) (actual time=79.139..79.151 rows=81 loops=1)
                                              ->  Hash  (cost=374.52..374.52 rows=1 width=2096) (actual time=8.008..8.008 rows=911 loops=1)
                                                    Buckets: 1024  Batches: 1  Memory Usage: 468kB
                                                    ->  Bitmap Heap Scan on visitors v0  (cost=230.95..374.52 rows=1 width=2096) (actual time=5.024..6.986 rows=911 loops=1)
                                                          Recheck Cond: (((app_id)::text = (a1.app_id)::text) AND (email IS NOT NULL))
                                                          Filter: ((NOT merged) AND (marked_at IS NULL) AND (NOT unsubscribed) AND (((type)::text = 'user'::text) OR ((type)::text = 'lead'::text)) AND CASE WHEN a1.window_enabled THEN ((date_part('dow'::text, timezone(COALESCE((location ->> 'timezone'::text), 'Europe/Paris'::text), now())) = ANY ((a1.window_days_of_week)::double precision[])) AND (timezone(COALESCE((location ->> 'timezone'::text), 'Europe/Paris'::text), ('now'::cstring)::time with time zone) >= (a1.window_start)::time with time zone) AND (timezone(COALESCE((location ->> 'timezone'::text), 'Europe/Paris'::text), ('now'::cstring)::time with time zone) <= (a1.window_end)::time with time zone)) ELSE true END)
                                                          Rows Removed by Filter: 43
                                                          Heap Blocks: exact=1094
                                                          ->  BitmapAnd  (cost=230.95..230.95 rows=36 width=0) (actual time=4.804..4.804 rows=0 loops=1)
                                                                ->  Bitmap Index Scan on visitors_app_id_signed_up_index  (cost=0.00..101.44 rows=2802 width=0) (actual time=1.044..1.044 rows=4670 loops=1)
                                                                      Index Cond: ((app_id)::text = (a1.app_id)::text)
                                                                ->  Bitmap Index Scan on email_idx  (cost=0.00..129.26 rows=6512 width=0) (actual time=3.313..3.313 rows=20201 loops=1)
                                                                      Index Cond: (email IS NOT NULL)
                                  ->  Bitmap Heap Scan on auto_message_events  (cost=13.01..17.03 rows=1 width=73) (actual time=0.039..0.039 rows=1 loops=76)
                                        Recheck Cond: (((auto_message_id = 435) AND ((visitor_id)::text = (v0.id)::text) AND ((event)::text = 'sent'::text)) OR ((auto_message_id = 435) AND ((event)::text = 'sent'::text) AND ((visitor_user_id)::text = (v0.user_id)::text)) OR ((auto_message_id = 435) AND ((event)::text = 'sent'::text) AND ((visitor_email)::text = (v0.email)::text)))
                                        Heap Blocks: exact=76
                                        ->  BitmapOr  (cost=13.01..13.01 rows=1 width=0) (actual time=0.037..0.037 rows=0 loops=76)
                                              ->  Bitmap Index Scan on auto_message_events_auto_message_id_visitor_id_event_index  (cost=0.00..4.43 rows=1 width=0) (actual time=0.011..0.011 rows=1 loops=76)
                                                    Index Cond: ((auto_message_id = 435) AND ((visitor_id)::text = (v0.id)::text) AND ((event)::text = 'sent'::text))
                                              ->  Bitmap Index Scan on auto_message_id_event_visitor_user_id_idx  (cost=0.00..4.29 rows=1 width=0) (actual time=0.011..0.011 rows=1 loops=76)
                                                    Index Cond: ((auto_message_id = 435) AND ((event)::text = 'sent'::text) AND ((visitor_user_id)::text = (v0.user_id)::text))
                                              ->  Bitmap Index Scan on auto_message_id_event_visitor_email_idx  (cost=0.00..4.29 rows=1 width=0) (actual time=0.014..0.014 rows=0 loops=76)
                                                    Index Cond: ((auto_message_id = 435) AND ((event)::text = 'sent'::text) AND ((visitor_email)::text = (v0.email)::text))
  Planning time: 2.848 ms
  Execution time: 90.722 ms

UPDATE

It turns out that removing the expression (NOT (v0."email" IS NULL) from the query in production, the performance corrects itself.

Any ideas as to why? (the field is indexed)

0

There are 0 answers