I have written a query in databricks sql and the output isn't what I am expecting. Have tried a few methods to determine the cause and am stuck, hoping you can help.
The output is currently producing multiple PeriodStartDate rows per EmployeeID. Where it should only be a single PeriodStartDate row per EmployeeID.
When I update the query to only run for a single EmployeeID, it works fine. It is just when I run the query on the full employee list that it is duplicating data.
Image 1 shows the duplication
Image 2 shows it when run just on EmployeeID 100085, no duplicates.
It's a long query, apologies, hoping someone can spot what is going on or offer suggestions to de-bug.
WITH EmploymentPositionHistory AS (
SELECT
Trim(t104f005_employee_no) as EmployeeID,
Trim(t104f015_occupancy_type) as OccupancyType,
Trim(t104f020_movement_code) as MovementCode,
Cast(Trim(t104f025_date_effective) as DATE) as MovementStartDate,
Cast(Trim(t104f035_connect_date_to) as DATE) as MovementEndDate,
Trim(t104f040_position_no) as PositionID,
Trim(t104f050_classification) as Classification,
Trim(t104f055_increment_point) as PayPoint,
Trim(t104f085_employment_type) as EmploymentType,
Trim(t104f095_attendance_type) as AttendanceType,
Trim(t104f120_comments) as PayrollComments,
Trim(t104f125_job_code) as JobCode,
Trim(t001f010_wami_key) as WAMIKey,
Trim(t001f015_employee_status) as EmploymentStatus,
CAST(Trim(t001f250_date_commenced) as DATE) as EmployeeIDStartDate,
CAST(Trim(t001f251_original_date_comm) as DATE) as MaterStartDate,
CAST(Trim(t001f265_date_terminated) as DATE) as EmployeeIDTerminationDate,
Trim(t000f030_salutation) as Salutation,
Concat(Trim(t000f025_preferred_name),' ',INITCAP(Trim(t000f015_surname))) as EmployeeName,
Trim(t000f025_preferred_name) as PreferredName,
Trim(t000f020_given_names) as GivenName,
INITCAP(Trim(t000f015_surname)) as Surname,
Trim(t000f085_initials) as Initials,
Trim(t000f110_date_of_birth) as DateOfBirth,
Trim(t000f105_gender) as Gender,
Trim(t000f285_postal_postcode) as PostalPostCode,
Trim(t000f280_postal_state) as PostalState,
Trim(t000f295_postal_country) as PostalCountry,
Trim(t000f160_email_address) as PersonalEmail
FROM data_hub_prod.data_hub_raw_history.aurion_t104_employment_history t104
--Join the aurion_t001_basic_details to grab employee specific info
LEFT JOIN data_hub_prod.data_hub_raw_history.aurion_t001_basic_details t001
ON Trim(t001f005_employee_no) = Trim(t104f005_employee_no)
AND t001.dh_current_flag = true
AND t001.dh_record_deleted_flag = false
--Join the aurion_t000_wami to grab employee specific info
LEFT JOIN data_hub_prod.data_hub_raw_history.aurion_t000_wami t000
ON Trim(t000f005_wami_key) = Trim(t001f010_wami_key)
AND t000.dh_current_flag = true
AND t000.dh_record_deleted_flag = false
--Join the org unit table to exclude the other entities
LEFT JOIN data_hub_prod.data_hub_raw_history.aurion_t106_organisation_units t106
ON Trim(t104f045_organisation_unit_no) = Trim(t106f005_organisation_unit_no)
and t106.dh_current_flag = 'true'
and t106.dh_record_deleted_flag = 'false'
WHERE Trim(t106f075_org_unit_level_02) in ('Mater Foundation','Mater Misericordiae Ltd','Mater Research')
AND t104.dh_current_flag = true
AND t104.dh_record_deleted_flag = false
--AND trim(t104f005_employee_no) = 100085
),
EmploymentContractHistoryInitital as
( SELECT
trim(t050f015_employee_no) as EmployeeID,
CAST(trim(t050f030_effective_date) as DATE) as EffectiveDate,
trim(t050f045_award_code) as AwardCode,
trim(t050f050_pay_entity) as PayEntity,
trim(t050f055_pay_branch) as PayBranch,
CAST(trim(t050f070_contract_expiry_date) as DATE) as ContractExpiryDate,
trim(t050f075_hours_per_week) as HoursPerWeek,
if(Trim(t050f215_employment_type) in ('CASUAL','STUDNT','VOL','BOARD') , 0 ,
(substring(t050f075_hours_per_week,1,2) * 60 * 60
+ substring(t050f075_hours_per_week,4,2) * 60
+ substring(t050f075_hours_per_week,7,2)) * 1.0 / 3600 ) as ContractedHoursPerWeek,
if(Trim(t050f215_employment_type) in ('CASUAL','STUDNT','VOL','BOARD') , 0 ,
((substring(t050f075_hours_per_week,1,2) * 60 * 60
+ substring(t050f075_hours_per_week,4,2) * 60
+ substring(t050f075_hours_per_week,7,2)) * 1.0 / 3600 )
/
(CAST(SUBSTRING_INDEX(t206f050_hours_per_week, ':', 1) AS INT) +
CAST(SUBSTRING_INDEX(SUBSTRING_INDEX(t206f050_hours_per_week, ':', -2), ':', 1) AS INT) / 60)) as ContractedFTE,
trim(t050f180_position_number) as PositionID,
RANK () OVER (
PARTITION BY trim(t050f015_employee_no)
ORDER BY CAST(trim(t050f030_effective_date) as DATE) ASC
) AS ContractRank
FROM data_hub_prod.data_hub_raw_history.aurion_t050_change_schedule t050
--Join the position table to retrieve the award hours for contracted FTE calculation
LEFT JOIN data_hub_prod.data_hub_raw_history.aurion_t101_position t101
ON Trim(t050f180_position_number) = Trim(t101f005_position_no)
AND t101.dh_current_flag = true
AND t101.dh_record_deleted_flag = false
--Join the classification details to retrieve the award hours for contracted FTE calculation
LEFT JOIN data_hub_prod.data_hub_raw_history.aurion_t206_classification t206
ON Trim(t101.t101f020_classification) = Trim(t206.t206f005_classification_code)
and t206.dh_current_flag = 'true'
and t206.dh_record_deleted_flag = 'false'
--Join the org unit table to exclude the other entities
LEFT JOIN data_hub_prod.data_hub_raw_history.aurion_t106_organisation_units t106
ON Trim(t050f205_organisation_unit_no) = Trim(t106f005_organisation_unit_no)
and t106.dh_current_flag = 'true'
and t106.dh_record_deleted_flag = 'false'
--Join the aurion_t050_change_schedule table back onto itself to identify the when an effective date record was last modified per employee
LEFT JOIN (
SELECT Trim(t050f015_employee_no) as EmpID
,CAST(Trim(t050f030_effective_date) as DATE) as EffectiveDate
,MAX(Trim(t050f990_modify_date_time))as ModifiedDateTime
FROM data_hub_prod.data_hub_raw_history.aurion_t050_change_schedule
WHERE dh_current_flag = true
AND dh_record_deleted_flag = false
AND t050f075_hours_per_week is not null
AND (NOT t050f045_award_code LIKE ('%TRAN%') OR t050f045_award_code is null)
--AND Trim(t050f015_employee_no) = 100320
GROUP BY Trim(t050f015_employee_no),CAST(Trim(t050f030_effective_date) as DATE)
) t2
--Where statement for main part of the query
WHERE Trim(t106f075_org_unit_level_02) in ('Mater Foundation','Mater Misericordiae Ltd','Mater Research')
AND t050f075_hours_per_week is not null
AND (NOT t050f045_award_code LIKE ('%TRAN%') OR t050f045_award_code is null)
AND CAST(t050.t050f030_effective_date as DATE) = t2.EffectiveDate
AND Trim(t050.t050f990_modify_date_time) = t2.ModifiedDateTime
AND t050.dh_current_flag = true
AND t050.dh_record_deleted_flag = false
--AND trim(t050f015_employee_no) = 100085
),
EmploymentContractHistoryToDate as
(
SELECT
trim(t050f015_employee_no) as EmployeeID,
CAST(trim(t050f030_effective_date) as DATE) - 1 as EffectiveDateTo,
RANK () OVER (
PARTITION BY trim(t050f015_employee_no)
ORDER BY CAST(trim(t050f030_effective_date) as DATE) ASC
) -1 AS PrevContractRank
FROM data_hub_prod.data_hub_raw_history.aurion_t050_change_schedule t050
--Join the org unit table to exclude the other entities
LEFT JOIN data_hub_prod.data_hub_raw_history.aurion_t106_organisation_units t106
ON Trim(t050f205_organisation_unit_no) = Trim(t106f005_organisation_unit_no)
and t106.dh_current_flag = 'true'
and t106.dh_record_deleted_flag = 'false'
--Join the aurion_t050_change_schedule table back onto itself to identify the when an effective date record was last modified per employee
LEFT JOIN (
SELECT Trim(t050f015_employee_no) as EmpID
,CAST(Trim(t050f030_effective_date) as DATE) as EffectiveDate
,MAX(Trim(t050f990_modify_date_time))as ModifiedDateTime
FROM data_hub_prod.data_hub_raw_history.aurion_t050_change_schedule
WHERE dh_current_flag = true
AND dh_record_deleted_flag = false
AND t050f075_hours_per_week is not null
AND (NOT t050f045_award_code LIKE ('%TRAN%') OR t050f045_award_code is null)
--AND Trim(t050f015_employee_no) = 100320
GROUP BY Trim(t050f015_employee_no),CAST(Trim(t050f030_effective_date) as DATE)
) t2
--Where statement for main part of the query
WHERE Trim(t106f075_org_unit_level_02) in ('Mater Foundation','Mater Misericordiae Ltd','Mater Research')
AND t050f075_hours_per_week is not null
AND (NOT t050f045_award_code LIKE ('%TRAN%') OR t050f045_award_code is null)
AND CAST(t050.t050f030_effective_date as DATE) = t2.EffectiveDate
AND Trim(t050.t050f990_modify_date_time) = t2.ModifiedDateTime
AND t050.dh_current_flag = true
AND t050.dh_record_deleted_flag = false
--AND trim(t050f015_employee_no) = 100085
),
EmploymentContractHistory as
(
SELECT t1.*,
CAST(coalesce(t2.EffectiveDateTo,'2099-12-31') as DATE) as EffectiveDateTo
FROM EmploymentContractHistoryInitital t1
LEFT JOIN EmploymentContractHistoryToDate t2
ON t1.EmployeeID = t2.EmployeeID
and t1.ContractRank = t2.PrevContractRank
)
--this is the start of the final output
SELECT
CAST(trim(t916f015_period_start_date) AS DATE) AS PeriodStartDate,
CAST(trim(t916f010_period_end_date) AS DATE) AS PeriodEndDate,
Trim(t916f018_financial_year) AS FinancialYear,
Trim(t916f020_pay_no) AS PayNo,
if(nonsub.OccupancyType is null, sub.OccupancyType, nonsub.OccupancyType) as OccupancyType,
if(nonsub.OccupancyType is null, sub.MovementCode, nonsub.MovementCode) as MovementCode,
if(nonsub.EmployeeID is null, sub.EmployeeID, nonsub.EmployeeID) as EmployeeID,
if(nonsub.EmploymentType is null, sub.EmploymentType, nonsub.EmploymentType) as EmploymentType,
if(nonsub.AttendanceType is null, sub.AttendanceType, nonsub.AttendanceType) as AttendanceType,
if(nonsub.PositionID is null, sub.PositionID, nonsub.PositionID) as PositionID,
if(nonsub.Classification is null, sub.Classification, nonsub.Classification) as Classification,
if(nonsub.PayPoint is null, sub.PayPoint, nonsub.PayPoint) as PayPoint,
if(nonsub.JobCode is null, sub.JobCode, nonsub.JobCode) as JobCode,
if(nonsub.MovementStartDate is null, sub.MovementStartDate, nonsub.MovementStartDate) as MovementStartDate,
if(nonsub.MovementEndDate is null, sub.MovementEndDate, nonsub.MovementEndDate) as MovementEndDate,
if(nonsub.PayrollComments is null, sub.PayrollComments, nonsub.PayrollComments) as PayrollComments,
ech.EffectiveDate,
ech.EffectiveDateTo,
ech.ContractExpiryDate,
ech.AwardCode,
ech.PayEntity,
ech.PayBranch,
ech.ContractedHoursPerWeek,
ech.ContractedFTE,
ech.ContractRank as ContractNumber
FROM data_hub_prod.data_hub_raw_history.aurion_t916_pay_period
LEFT JOIN EmploymentPositionHistory nonsub
ON nonsub.OccupancyType = 'NONSUBS'
AND CAST(trim(t916f015_period_start_date) AS DATE) BETWEEN nonsub.MovementStartDate and nonsub.MovementEndDate
LEFT JOIN EmploymentPositionHistory sub
ON sub.OccupancyType = 'SUBS'
AND CAST(trim(t916f015_period_start_date) AS DATE) BETWEEN sub.MovementStartDate and sub.MovementEndDate
LEFT JOIN EmploymentContractHistory ech
ON if(nonsub.EmployeeID is null, sub.EmployeeID, nonsub.EmployeeID) = ech.EmployeeID
AND CAST(trim(t916f015_period_start_date) AS DATE) BETWEEN ech.EffectiveDate and ech.EffectiveDateTo
WHERE t916f005_pay_cycle_id = 1
AND dh_current_flag = true
AND dh_record_deleted_flag = false
AND CAST(SUBSTRING(t916f018_financial_year, 4, 2) AS INT) >= CAST(SUBSTRING((YEAR(CURRENT_DATE) - 2), 3, 2) as INT)
--AND if(nonsub.EmployeeID is null, sub.EmployeeID, nonsub.EmployeeID) = 50125
The issue was in how I was joining the EmploymentPositionHistory table. Initially it was just joining the table on twice, both based on date ranges, which was causing the individual employee records to duplicate.
The solution was for the second join to also join based on an employee ID match.
AND sub.EmployeeID = nonsub.EmployeeID
I also changed the order of the joins to ensure accuracy.