I'm trying to run the following query through pandasql, but the output I get is not what I was expecting. I was expecting to get a table with exactly 800 rows as I am selecting the only employee_day_transmitters of the table employee_days_transmitters, but what I get is a table with more than 800 rows. What's wrong? How can I get exactly 800 rows related to the employee_day_transmitters selected in the table employee_days_transmitters?
query_text = '''WITH employee_days_transmitters AS (
SELECT DISTINCT
employeeId
, theDate
, transmitterId
, employeeId || '-' || CAST(theDate AS STRING) || '-' || transmitterId AS employee_day_transmitter
FROM
table1
WHERE variable='rpv'
ORDER BY
RANDOM()
LIMIT
800
)
SELECT
*
FROM
table1
WHERE
(employeeId || '-' || CAST(theDate AS STRING) || '-' || transmitterId) IN (SELECT employee_day_transmitter FROM employee_days_transmitters) AND variable = 'rpv'
'''
table2=pandasql.sqldf(query_text,globals())
You are using
DISTINCT
in the CTE, so I suspect you have duplicates for the combination of the columnsemployeeId, theDate, transmitterId
and this why you get more than 800 rows.You select 800 rows in the CTE but when you use the operator
IN
in your main query, all the rows that satisfy your conditions are returned, which are more than 800.But why do you use the CTE?
You could apply the conditions directly in the main query:
Or maybe with
ROW_NUMBER()
window function: