Understanding cartesian product in SQL

8.4k views Asked by At

I am not able to understand how Cartesian product works. Consider the simple schema:

mysql> select * from account;
+----------------+-------------+---------+
| account_number | branch_name | balance |
+----------------+-------------+---------+
| A101           | Downtown    |     500 |
| A102           | Perryridge  |     400 |
| A201           | Brighton    |     900 |
| A215           | Mianus      |     700 |
| A217           | Brighton    |     750 |
| A222           | Redwood     |     700 |
| A305           | Round Hill  |     350 |
+----------------+-------------+---------+
7 rows in set (0.00 sec)

Now, when I pose the query

select a.balance from account a, account b where a.balance<b.balance;

I get a series of values except the maximum value 900. Then using the not in operator I determine the maximum value. Before that in the above query, when the join takes place based on the condition a.balance<b.balance, the first tuple in the relation must be 500. Theoretically, the first 5 values must be:

500
500
500
500
400

But I get :

+---------+
| balance |
+---------+
|     400 |
|     350 |
|     350 |
|     500 |
|     400 |

How is it working? I am using MySQL database.

2

There are 2 answers

2
Mureinik On BEST ANSWER

A Cartesian join joins every record in the first table with every record in the second table, so since your table has 7 rows and it's joined with itself, it should return 49 records had you not had a where clause. Your where clause only allows records where a's balance is smaller than b's balance. Since 900 is, as you said, the maximal balance in the table, it will never be smaller than any other balance, and therefore it will never be returned.

With regard to the first five rows, the normal rules of SQL apply to joins too. Since SQL tables have no intrinsic order, it's completely up to the database to decide how to return them, unless you explicitly state an order in the order by clause. The values you listed are perfectly valid values you'd expect the query to return.

0
Vlad Mihalcea On

The Cartesian Product generates all possible combinations of records from two given sets of data.

In your case, to generate a Cartesian Product, you'd have to either use CROSS JOIN:

SELECT 
  a.branch_name AS first_branch,
  b.branch_name AS second_branch,
  a.balance + b.balance AS total_balance
FROM account a
CROSS JOIN account b 

or, use the SQL:89 theta-style join:

SELECT 
  a.branch_name AS first_branch,
  b.branch_name AS second_branch,
  a.balance + b.balance AS total_balance
FROM account a, account b 

Anyway, the goal of the Cartesian product would be to associate all rows of two sets.

The moment you apply some filtering criteria to the Cartesian product generated by a CROSS JOIN, the result will no longer be a Cartesian product, but a subset of it, that matches to the given filtering conditions.

So, in your case, this query:

SELECT 
  a.balance 
FROM account a, account b 
WHERE a.balance < b.balance

does not generate a Cartesian Product.

In fact, a much better alternative to your query is this one:

SELECT 
  a.balance 
FROM account a
WHERE a.balance < (
  SELECT MAX(balance) FROM account 
)

If you want to get all rows that have a lower balance than the maximum one.

Anyway, the use of the self CROSS JOIN here looks suspicious. That's why you are probably better off using a subquery instead.