PostgreSQL Joining Between Two Values

2.1k views Asked by At

I have the following tables and am trying to look up county codes for a list of several hundred thousand cities.

create table counties (
  zip_code_from  char(5) not null,
  zip_code_thru  char(5) not null,
  county_code    char(3) not null
);

create table cities (
  city      text    not null,
  zip_code  char(5) not null
);

My first approach was using a "between" in the join:

select
  ci.city, ci.zip_code, co.county_code
from
  cities ci
  join counties co on
    co.zip_code between ci.zip_code_from and ci.zip_code_thru

I know in the Oracle world, this was frowned upon, and indeed the performance appears to be miserable. It takes over 8 minutes to process around 16,000 cities. The zip code table has around 80,000 records. I'm guessing that this syntax is a glorified cross-join?

Both the from and thru codes are indexed, and I have control over the structures, so I can change the table if it helps.

My only other thought is to go ahead and expand the table out to all possible values -- something similar to this:

select
  generate_series (
    cast (zip_code_from as int),
    cast (zip_code_thru as int)
  ) as zip_code,
  *
from counties

This would expand the data to over 200,000 records, which isn't a big deal, but I wasn't sure if this is my only recourse to have queries that aren't horrible.

I'm guessing that even doing that on the fly and not having indexes would be preferable to the between in my join, but I was hoping there is an alternative, either in terms of my SQL and/or something I can do with the structure of the table itself.

I've seen this question posted for other DBMS platforms, but I've been able to pull off mini-miracles with PostgreSQL that weren't possible (or practical) in other databases, so I was hopeful there is something I've missed.

2

There are 2 answers

0
Hambone On BEST ANSWER

Months later, this has cropped its head again, and I decided to test some of my theories.

The original query:

select
  ci.city, ci.zip_code, co.fips_code
from
  cities ci
  join counties co on
    ci.zip_code between co.from_zip_code and co.thru_zip_code

Does in fact implement a cartesian. The query returns 34,000 rows and takes 597 seconds.

If I "pre-explode" the zip code ranges into discrete records:

with exploded_zip as (
  select
    generate_series (
      cast (from_zip_code as int),
      cast (thru_zip_code as int)
    )::text as zip_code,
    *
  from counties
)
select
  ci.city, ci.zip_code, co.fips_code
from
  cities ci
  join exploded_zip co on
    ci.zip_code = co.zip_code

The query returns the exact same rows but finishes in 2.8 seconds.

So it seems the bottom line is that using a between in a join (or any inequality) is a really bad idea.

0
Eli Johnson On

Late to the party, but I was recently suffering through this very issue with a join using a BETWEEN condition. To crystallize the problem into its simplest form, I have a pair of tables

CREATE TABLE customer_review(
  id              SERIAL PRIMARY KEY,
  review_text     TEXT,
  customer_ip_int BIGINT
);
CREATE TABLE ip_location(
  ip_int          BIGINT,
  country         TEXT,
  latitude        NUMERIC,
  longitude       NUMERIC
);

The first table has about 7.2 million records, and the second about 3 million records. The join needs to happen in such a way so that customer_ip_int is between ip_int (inclusive) and lead(ip_int) OVER(ORDER BY ip_int) (exclusive).

I tried numerous ways to rephrase the join condition, various indexes, etc. Most of this was done on a materialized view of a modified ip_location that allowed me to reference next_ip_int as a column (rather than fiddle with the lead(ip_int) ... in the query itself. None of this produced anything that was remotely workable.

The above solution (to generate a series) also was not workable for me, because the series for integer IP addresses (as opposed to zip codes) spans over billions of integers.

In the end, the thing that worked was quite simple (if a bit ugly):

SELECT * 
FROM customer_review cr
LEFT JOIN ip_location ip
  ON ip.ip_int = (SELECT MAX(ip2.ip_int) 
                  FROM ip_location ip2 
                  WHERE ip2.ip_int <= cr.customer_ip_int)

Note that this requires an index on ip_location.ip_int to be effective.