troubles with 'WHERE...IN' clause

56 views Asked by At

I'm trying to run the following query through pandasql, but the output I get is not what I was expecting. I was expecting to get a table with exactly 800 rows as I am selecting the only employee_day_transmitters of the table employee_days_transmitters, but what I get is a table with more than 800 rows. What's wrong? How can I get exactly 800 rows related to the employee_day_transmitters selected in the table employee_days_transmitters?


  query_text = '''WITH employee_days_transmitters AS (
                   SELECT DISTINCT
                   employeeId
                   , theDate
                   , transmitterId
                   , employeeId || '-' || CAST(theDate AS STRING) || '-' || transmitterId AS employee_day_transmitter
                   FROM
                   table1
                   WHERE variable='rpv'
                   ORDER BY
                   RANDOM()
                   LIMIT
                   800
                   )
                     SELECT
                     * 
                     FROM
                     table1
                     WHERE
                     (employeeId || '-' || CAST(theDate AS STRING) || '-' || transmitterId) IN (SELECT employee_day_transmitter FROM employee_days_transmitters) AND variable = 'rpv'
                     '''
table2=pandasql.sqldf(query_text,globals())
1

There are 1 answers

0
forpas On

You are using DISTINCT in the CTE, so I suspect you have duplicates for the combination of the columns employeeId, theDate, transmitterId and this why you get more than 800 rows.
You select 800 rows in the CTE but when you use the operator IN in your main query, all the rows that satisfy your conditions are returned, which are more than 800.

But why do you use the CTE?
You could apply the conditions directly in the main query:

SELECT DISTINCT employeeId, theDate, transmitterId
FROM table1
WHERE variable='rpv'
ORDER BY RANDOM()
LIMIT 800

Or maybe with ROW_NUMBER() window function:

WITH cte AS (
  SELECT id 
  FROM (
    SELECT rowid id,
           ROW_NUMBER() OVER (PARTITION BY employeeId, theDate, transmitterId ORDER BY RANDOM()) rn
    FROM table1
    WHERE variable='rpv'
  )
  WHERE rn = 1
  ORDER BY RANDOM()
  LIMIT 800
)
SELECT *
FROM table1
WHERE rowid IN cte