NATURAL JOIN on large tables

1.3k views Asked by At

I am performing a simple natural join on two big tables.

  • polygons contains 68,000 rows (45 MB)
  • roadshydro contains about 2 million rows(210 MB) .

Does that mean that the database engine makes a data set of 68,000*2 million rows while performing natural join internally? If so, then the amount of memory required must be 45*210 MB which is much larger than what my system has, which is only 1.5 GB.

When I executed this Query, after 5 minutes my system crashes (abrupty shuts down). Can't it handle 250 MB of data on the database? What good for are databases then?

"I am modifying the above Question to clear the doubts of readers. 29-02-2012 today."

It seems many of my friends got confused because i mention 'natural join' word in the Question above. The real spatial Query i was using is :

select p.OID , r.OID
    from poygons as p , roadshydro as r
                Where st_intersects(p.the_geom , r.the_geom) ;

where polygons & roadshydro tables each has two fields : OID , the_geom . Clearly , it is a cross product of two tables and not Natural Join on some common key.

I monitor the main memory consumption When i execute the above query. It happens nothing. There is not a slightest amount of memory consumption , neither i get any output ever but CPU usage is almost 100%. It seems database isnt doing any computation at all. However , if i remove the where clause from the query , the Main memory consumption gradually goes too high (after 5-6 minutes ) resulting into system crash and machine abruptly shut down. This is what i am experiencing. What so special about removing the where clause? why postgres is failing to execute the query !! Surprised at this behaviour.

4

There are 4 answers

0
Konerak On

It really depends on many different factors, but most of all on the DBMS you are using and its configuration.

But to clear out the biggest misunderstanding: the DBMS does not have to hold all the rows in memory: it can write to a temporary table (on the harddisk) and serve you the result... slowly... so if it's crashing, that is not normal.

Then again, why are you asking 68k*2M rows? That is 136,000,000,000 rows! You sure you don't want a straight join on some key instead?

20
Hugh Jones On

There is very little point in using the NATURAL JOIN construct. That having been said, the query you describe would only produce the product of the two tables if the join matched every record in both tables.

That would only happen if there was a field in both tables with the same name and the same value for every record - this is extrememly unlikely, but not logically impossible OR if there are no fields in the 2 tables that match on name.

If I were you I would discard the NATURAL JOIN in favour of a plain JOIN, specifying the fields you want to match.

If that solves the crashing then all well and good, but it would be a surprise to me if it did.

1
Hugh Jones On

As I have been criticised for my comments on this post, I have prepared an example to illustrate my opinion on the subject.

The following Oracle script is an illustration of what I think is the danger inherent in the use of the NATURAL JOIN construct. I accept it is a contrived example but in the interests of defensive development I believe my position holds true.

DROP TABLE TABLE1;
DROP TABLE TABLE2;

CREATE TABLE TABLE1 (
FIELD1   VARCHAR2(10),
FIELD2   VARCHAR2(10),
DESCR_T1 VARCHAR2(20)
);

CREATE TABLE TABLE2 (
FIELD1   VARCHAR2(10),
FIELD2   VARCHAR2(10),
DESCR_T2 VARCHAR2(20)
);

INSERT INTO TABLE1 VALUES('AAA','AAA',    'AAA_AAA_T1'   );
INSERT INTO TABLE1 VALUES('BBB','BBB',    'BBB_BBB_T1'   );
INSERT INTO TABLE1 VALUES('CCC','T1_CCC', 'CCC_T1_CCC_T1');
INSERT INTO TABLE1 VALUES('DDD','T1_DDD', 'DDD_T1_DDD_T1');
INSERT INTO TABLE1 VALUES('EEE',NULL    , 'EEE_NULL_T1'  );

INSERT INTO TABLE2 VALUES('AAA','AAA',    'AAA_AAA_T2'   );
INSERT INTO TABLE2 VALUES('BBB','BBB',    'BBB_BBB_T2'   );
INSERT INTO TABLE2 VALUES('CCC','T2_CCC', 'CCC_T1_CCC_T2');
INSERT INTO TABLE2 VALUES('DDD','T2_DDD', 'DDD_T1_DDD_T2');
INSERT INTO TABLE2 VALUES('EEE',NULL    , 'EEE_NULL_T2'  );

COMMIT;

-- try the following queries and review the results

SELECT 
  FIELD1, DESCR_T1, DESCR_T2 
FROM 
  TABLE1 NATURAL JOIN TABLE2;

SELECT 
  * 
FROM 
  TABLE1 NATURAL JOIN TABLE2;

SELECT 
  TABLE1.FIELD1, TABLE1.DESCR_T1, TABLE2.DESCR_T2 
FROM 
  TABLE1 JOIN 
    TABLE2 ON 
      TABLE2.FIELD1 = TABLE1.FIELD1 AND 
      TABLE2.FIELD2 = TABLE1.FIELD2;

SELECT * FROM 
  TABLE1 NATURAL JOIN TABLE2;

-- Issue the following statement then retry the previous 3 statements.
-- The 'NJs' silently change behaviour and produce radically different results
-- whereas the third requires hands-on attention.  I believe this third behaviour
-- is desirable.  (You could equally drop the column TABLE2.FIELD2 as dportas 
-- has suggested

-- ALTER TABLE TABLE2 RENAME COLUMN FIELD2 TO T2_FIELD2;
0
nvogel On

Extending Hugh's example data, here is an example of two NATURAL JOIN queries. Hopefully it can be seen that these are "safe" from the problem that Hugh described and that the NJ version is actually less verbose (and in my opinion more readable) than the INNER JOIN version.

SELECT *
FROM 
(SELECT FIELD1, DESCR_T1 FROM TABLE1) T1
NATURAL JOIN
(SELECT FIELD1, DESCR_T2 FROM TABLE2) T2;

SELECT * 
FROM 
(SELECT FIELD1, FIELD2, DESCR_T1 FROM TABLE1) T1
NATURAL JOIN
(SELECT FIELD1, FIELD2, DESCR_T2 FROM TABLE2) T2;

The problem Hugh is talking about does not exist unless you write sloppy code. If you do write sloppy code then INNER JOIN is "unsafe" too. What this exchange maybe does illustrate is that natural joins are not always well understood. That might be a reason why some people are unreasonably suspicious of them.