IN subquery's WHERE condition affects main query - Is this a feature or a bug?

5.6k views Asked by At

Assume the two tables:

Table A: A1, A2, A_Other
Table B: B1, B2, B_Other

In the following examples, is something is a condition checked against a fixed value, e.g. = 'ABC' or < 45.

I wrote a query like the following (1):

Select * from A
Where A1 IN (
    Select Distinct B1 from B
    Where B2 is something
    And A2 is something
);

What I really meant to write was (2):

Select * from A
Where A1 IN (
    Select Distinct B1 from B
    Where B2 is something
)
And A2 is something;

Strangely, both queries returned the same result. When looking at the explain plan of query 1, it looked like when the subquery was executed, because the condition A2 is something was not applicable to the subquery, it was deferred for use as a filter on the main query results.

I would normally expect query 1 to fail because the subquery by itself would fail:

Select Distinct B1 from B
Where B2 is something
And A2 is something; --- ERROR:  column "A2" does not exist

But I find this is not the case, and Postgres defers inapplicable subquery conditions to the main query.

Is this standard behaviour or a Postgres anomaly? Where is this documented, and what is this feature called?

Also, I find that if I add a column A2 in table B, only query 2 works as originally intended. In this case the reference A2 in query 2 would still refer to A.A2, but the reference in query 1 would refer to the new column B.A2 because it is now applicable directly in the subquery.

4

There are 4 answers

0
Rachcha On BEST ANSWER

Excellent question here, something that a lot of people come across but don't bother to stop and look.

What you are doing is writing a subquery in the WHERE clause; not an inline view in the FROM clause. There's the difference.

When you write a subquery in SELECT or WHERE clauses, you can access the tables that are in the FROM clause of the main query. This doesn't happen only in Postgres, but it is a standard behaviour and can be observed in all the leading RDBMSes, including Oracle, SQL Server and MySQL.

When you run the first query, the optimizer takes a look at your entire query and determines when to check for which conditions. It is this behaviour of the optimizer that you see the condition is deferred to the main query because the optimizer figures out that it is faster to evaluate this condition in main query itself without affecting the end result.

If you run just the subquery, commenting out the main query, it is bound to return an error at the position that you have mentioned as the column that is being referred to is not found.

In your last paragraph, you have mentioned that you added a column A2 to table tableB. What you have observed is right. That happens because of the implicit reference phenomenon. If you don't mention the table alias for a column, the database engine looks for the column first in the tables in FROM clause of the subquery. Only if the column is not found there, a reference is made to the tables in main query. If you use the following query, it would still return the same result:

Select * from A aa -- Check the alias
Where A1 IN (
    Select Distinct B1 from B bb
    Where B2 is something
    And aa.A2 is something -- Check the reference
);

Perhaps you can find more information in Korth's book on relational database, but I'm not sure. I have just answered your question based on my observations. I know this happens and why. I just don't know how I can provide you with further references.

0
Airan On

Correlated Subquery:- If the outcome of a subquery is depends on the value of a column of its parent query table then the Sub query is called Correlated Subquery. It is standard behavior, not an error.

It is not necessary that the column on which the correlated query is depended is included in the selected columns list of the parent query.

Select * from A
Where A1 IN (
    Select Distinct B1 from B
    Where B2 is something
    And A2 is something
);

A2 is a column of table A and parent query is on table A. That means A2 can be referenced in subquery. The above query might work slower than the following one.

Select * from A
Where A2 is something And A1 IN (
    Select Distinct B1 from B
    Where B2 is something
);

That is because A2 from parent query is referenced in loop. It depends upon the condition for the data to be fetched. If subquery is something like

Select Distinct B1 from B
Where B2 is A2

we have to reference parent query column. Alternatively, we can use joins.

0
Erwin Brandstetter On

You already have your explanation why correlated subqueries in the WHERE clause can reference all columns from tables in the FROM list.

Aside from that, using a JOIN or an EXISTS semi-join is often substantially faster than correlated subqueries. I would rewrite to this 100% equivalent query:

SELECT a.*
FROM   a
JOIN   (
   SELECT DISTINCT b1
   FROM   b
   WHERE  b2 is something
   ) b ON b.b1 = a.a1 
WHERE  a.a2 is something

Or, better yet:

SELECT *
FROM   a
WHERE  EXISTS (
   SELECT 1 
   FROM   b
   WHERE  b.b1 = a.a1 
   AND    b.b2 is something
   )
AND    a.a2 is something;
0
CodeCowboyOrg On

The results are not strange, the subquery CAN referernce the PARENT query. This is called a Correlated SubQuery and is very common. In your example you used the IN operator, but usually to OPTIMIZE a query with the IN operation is to replace IN with the EXISTS operator using a Correlated SubQuery.

To elaborate on Erwin's comment about EXISTS being faster, this is because when you use IN "sometimes" requires the Query to find all the values of the set. Whereas using EXISTS simply requires the First occurrence to be found to satisfy the condition. However it maybe the case that the Query Plan optimizes both to be the same. But using EXISTS explicitly assist the Optimizer in constructing the intended Query Plan faster.