I have a large dataset of GPS coordinates for vessels. The dataset has over 8MM records and to try and simplify, I've focused on a specific vessel to test out the logic if this would work.
If I run a quick summary of records for a given vessel using the following MySQL query in a subset table of the whole dataset:
SELECT equipment_id,equipment_name,geolocation,unlocode,
status,when_created AS event_date,COUNT(*)
FROM gpsdata_20231205_AD
WHERE equipment_name = 'CMA CGM AMBER'
GROUP BY equipment_name,geolocation,status
ORDER BY when_created,status;
I get the following output:
equipment_id | equipment_name | geolocation | unlocode | status | event_date | COUNT(*) |
---|---|---|---|---|---|---|
215148000 | CMA CGM AMBER | MANILA | PHMNL | Arrival | 10/4/2023 2:03 | 5 |
215148000 | CMA CGM AMBER | MANILA | PHMNL | Departure | 10/5/2023 5:16 | 1 |
215148000 | CMA CGM AMBER | SHANGHAI | CNSHA | Arrival | 10/9/2023 11:06 | 4 |
215148000 | CMA CGM AMBER | SHANGHAI | CNSHA | Departure | 10/10/2023 0:48 | 5 |
215148000 | CMA CGM AMBER | CAI MEP | NULL | Arrival | 11/30/2023 20:06 | 4 |
215148000 | CMA CGM AMBER | CAI MEP | NULL | Departure | 12/1/2023 15:13 | 3 |
215148000 | CMA CGM AMBER | PORT KLANG ANCH | NULL | Arrival | 12/3/2023 15:29 | 3 |
215148000 | CMA CGM AMBER | PORT KLANG ANCH | NULL | Departure | 12/4/2023 5:15 | 3 |
215148000 | CMA CGM AMBER | PORT KLANG | NULL | Arrival | 12/4/2023 5:43 | 3 |
215148000 | CMA CGM AMBER | PORT KLANG | NULL | Departure | 12/5/2023 6:04 | 2 |
The above query result should give some insight on what the dataset looks like.
The desired output I'm attempting to achieve is to match equipment_id,equipment_name,geolocation and get arrival and departure status & event date for a given geolocation.
So my desired output, using the first 2 geolocations as examples would look something like:
equipment_id | equipment_name | geolocation | unlocode | a_status | a_event_date | d_status | d_event_date |
---|---|---|---|---|---|---|---|
215148000 | CMA CGM AMBER | MANILA | PHMNL | Arrival | 10/4/2023 2:03 | Departure | 10/5/2023 5:16 |
215148000 | CMA CGM AMBER | SHANGHAI | CNSHA | Arrival | 10/9/2023 11:06 | Departure | 10/10/2023 0:48 |
I believe that I have the general concept on how to build the query, I'm having issues on how to build the logic that ensures I'm getting the correct event dates (given, that the dataset has multiple records for the same equipment_id,equipment_name,geolocation & status.) ** The closest to my desired output was:**
SELECT g1.equipment_id,g1.equipment_name,g1.geolocation,
g1.unlocode,g1.status,g1.when_created AS a_event_date,
g2.status,g2.when_created AS d_event_date,
DATEDIFF(g2.when_created,g1.when_created) AS event_delta
FROM gpsdata_20231205_AD AS g1
JOIN gpsdata_20231205_AD AS g2 ON(g1.equipment_id = g2.equipment_id AND g2.status = 'Departure')
WHERE g1.status = 'Arrival'
AND g1.geolocation NOT LIKE '%ANCH%'
AND g1.equipment_name ='CMA CGM AMBER'
AND g1.when_created < g2.when_created
GROUP BY g1.equipment_id,g1.equipment_name,g1.geolocation
ORDER BY g1.when_created,g2.when_created;
This is the output for the above query:
equipment_id | equipment_name | geolocation | unlocode | status | a_event_date | status | d_event_date | event_delta | next_geoloc | next_unloc |
---|---|---|---|---|---|---|---|---|---|---|
215148000 | CMA CGM AMBER | MANILA | PHMNL | Arrival | 10/4/2023 2:03 | Departure | 10/5/2023 5:16 | 1 | SHANGHAI | CNSHA |
215148000 | CMA CGM AMBER | SHANGHAI | CNSHA | Arrival | 10/9/2023 11:06 | Departure | 10/10/2023 0:48 | 1 | CJK | NULL |
215148000 | CMA CGM AMBER | CJK | NULL | Arrival | 10/10/2023 6:03 | Departure | 10/16/2023 2:47 | 6 | CAI MEP | NULL |
215148000 | CMA CGM AMBER | CAI MEP | NULL | Arrival | 11/30/2023 20:06 | Departure | 12/1/2023 15:13 | 1 | PORT KLANG | NULL |
215148000 | CMA CGM AMBER | PORT KLANG | NULL | Arrival | 12/4/2023 5:43 | Departure | 12/5/2023 6:04 | 1 | TBD | TBD |
Given the time spent working with this data, I believe at this points I have tunnel vision where I can't see if my approach is correct. Not quite sure how I can ensure I'll get the "MIN" when created date for a_event_date (earliest/ 1st date of arrival) and "MAX" for d_event_date (latest/last date for departure).
If the above query is correct and there isn't a flaw, how would I incorporate the two last columns, next_geoloc & next_unloc? I have no idea how to approach that.
Any feedback is welcome.
Preemptively, thank you for your time.
TB.