I've got over 500million rows stored in BigQuery which basically represent the exact position of a device at a certain time (irregular).
I'm trying to find a fast and efficient way to determine the first and the last seen position of the device.
So far, I have it working with a join but it's taking over 10 mins to complete (unless I just a limit in the query). I also tried with a dense_rank query but I can't sort the count out (and I don't fully understand it tbh).
I have a client_id, device_id (which is fixed and represents the location within the building) and a timestamp.
First I did a group by client_id and device_id to validate what I should expect. Then I tried joining the table using the min and max timestamp:
SELECT count(FirstSet.device_id), FirstSet.device_id
FROM
(
SELECT client_id, device_id, created_at
FROM [mytable.visitsv3]
WHERE secret = 'xxx'
GROUP each BY client_id, device_id, created_at
ORDER BY client_id, created_at asc
LIMIT 1000
) as FirstSet
inner join
(
SELECT client_id, device_id, min(created_at)
FROM [mytable.visitsv3]
WHERE secret = 'xxx'
GROUP each BY client_id, device_id, created_at
LIMIT 1000
) SecondSet
on FirstSet.device_id = SecondSet.device_id
GROUP BY FirstSet.device_id
ORDER BY 1 DESC
limit 25
I'm new to this world so would appreciate some advice. As I said, it's more about performance as we need to run the queries live.
There's loads of tutorials about the same concept - I'm just hoping someone can help optimise things.
--- EDIT ---
Output format should look like so:
|count|device_id|
-----------------
|10000| 123|
| 9000| 345|
| 800| 234|
--- IN CONTEXT ---
We're trying to achieve a number of things with our dataset (not in a single query of course). Including:
- Create a heatmap of the most popular entry and exit points.
Initially this just needs to be a totals but long-term, we've got to understand which entry / exit points are busy at certain times of the day. And then use the correlation functions for some predictive analysis.
- Create paths of the clients as the move through the buildings.
Once we have the entry points, we need to map the paths the clients take. Again, we need to predict behaviour per day / hour etc.
Initially, we just need the most popular entry / exit points. For this reason, we grouped the clients together and did a count of the devices they appeared on.
The data looks like this:
client_id,device_id,created_at,start,end,duration,lat,lng
F047CA72E,0013BA30,2015-06-22 10:00:32 UTC,2015-06-22 09:30:31 UTC,2015-06-22 09:30:32 UTC,1,XX,YY
F40D8632F,00A30E00,2015-06-22 10:00:29 UTC,2015-06-22 09:30:26 UTC,2015-06-22 09:30:26 UTC,0,XX,YY
B808AA3E0,00138B20,2015-06-22 10:00:27 UTC,2015-06-22 09:30:25 UTC,2015-06-22 09:30:25 UTC,0,XX,YY
A0E532E96,00A33600,2015-06-22 10:00:34 UTC,2015-06-22 09:30:24 UTC,2015-06-22 09:30:25 UTC,1,XX,YY
D4F6F8D50,00149150,2015-06-22 10:00:34 UTC,2015-06-22 09:30:22 UTC,2015-06-22 09:30:22 UTC,0,XX,YY
70124EB7E,00A350A0,2015-06-22 10:00:33 UTC,2015-06-22 09:30:14 UTC,2015-06-22 09:30:14 UTC,0,XX,YY
200AF2A9E,00149090,2015-06-22 10:00:33 UTC,2015-06-22 09:30:13 UTC,2015-06-22 09:30:12 UTC,0,XX,YY
20AE4884A,0013EFC0,2015-06-22 10:00:30 UTC,2015-06-22 09:30:13 UTC,2015-06-22 09:30:13 UTC,0,XX,YY
CCD9BB01C,0013EFC0,2015-06-22 10:00:30 UTC,2015-06-22 09:30:12 UTC,2015-06-22 09:30:06 UTC,0,XX,YY
8CCE1F24E,0004165E,2015-06-22 10:00:12 UTC,2015-06-22 09:30:12 UTC,2015-06-22 09:30:12 UTC,0,XX,YY
The count of the clients per device is in someways irrelevant as it will be normalised for the heatmap regardless. We just need them ranked so we can see most popular AND least popular.
Let me rephrase how I understand the setup: - Devices are installed at fixed locations throughout the buildings - Clients (people) move through the building at when they pass nearby the device, this event is recorded - The time when client with client_id passes device with device_id is recorded in timestamp created_at
Therefore, the very first created_at timestamp for given client is when she entered the building, and the corresponding device_id will be an entry point. To find it, we can use
In order to find 10 most popular entries, we can use TOP with COUNT(*), i.e.
Putting it all together:
Same things with exit points, but this time we will use last_value window function: