Spark: Equivalent to not in

525 views Asked by At

I have a where clause in Spark SQL that for some reason doesn't return any records. I think it doesn't work and so I would like to ask what would be equivalent to it?

SELECT
  c.client_id,
  current_date() as insert_date
FROM
  CLIENT_SUB c
WHERE
  (c.client_id, insert_date) not in (SELECT client_id, insert_date from CLIENT_SUBSCRIBER_CONTRACT)

I heard I could do it with a join

2

There are 2 answers

1
GMB On BEST ANSWER

I would recommend not exists: it is null-safe, while not it isn't - and it usually scales better, too.

I am also suspiscious about the reference to insert_date: do you really mean that, or you actually want current_date()?

select cs.client_id, current_date() as insert_date
from client_sub cs
where not exists (
    select 1 
    from client_subscriber_contract csc
    where 
        csc.client_id = c.client_id 
        and csc.insert_date = cs.insert_date
        -- or, maybe: csc.insert_date = current_date()
)

For performance, consider an index on client_subscriber_contract(client_id, insert_date).

0
Kathmandude On

I suspect you have null values in the output of your subquery because not in doesn't output anything when matched against values containing nulls. Try

not in (select client_id, insert_date 
        from CLIENT_SUBSCRIBER_CONTRACT
        where coalesce(client_id, insert_date) is not null)

I recommend looking into not exists for your purpose though