I am using hive 0.14.0 in a hortonworks data platform, on a big file similar to this input data:
| tpep_pickup_datetime | pulocationid |
|---|---|
| 2022-01-28 23:32:52.0 | 100 |
| 2022-02-28 23:02:40.0 | 202 |
| 2022-02-28 17:22:45.0 | 102 |
| 2022-02-28 23:19:37.0 | 102 |
| 2022-03-29 17:32:02.0 | 102 |
| 2022-01-28 23:32:40.0 | 101 |
| 2022-02-28 17:28:09.0 | 201 |
| 2022-03-28 23:59:54.0 | 100 |
| 2022-02-28 21:02:40.0 | 100 |
I want to find out what was the most common hour in each locationid, this being the result:
| locationid | hour |
|---|---|
| 100 | 23 |
| 101 | 17 |
| 102 | 17 |
| 201 | 17 |
| 202 | 23 |
i was thinking in using a partition command like this:
select * from (
select hour(tpep_pickup_datetime), pulocationid
(max (hour(tpep_pickup_datetime))) over (partition by pulocationid) as max_hour,
row_number() over (partition by pulocationid) as row_no
from yellowtaxi22
) res
where res.row_no = 1;
but it shows me this error: SemanticException Failed to breakup Windowing invocations into Groups. At least 1 group must only depend on input columns. Also check for circular dependencies. Underlying error: Invalid function pulocationid
is there any other way of doing this?
You were half way there!
The idea was in the right direction however the syntax is a little bit off:
First find the count per each hour
Then add the
row_numberbut you need to order it by the total count in a descending way:select pulocationid , hour , cnt , row_number () over ( partition be pulocationid order by cnt desc ) as row_no from
Last but not the list, take only the rows with the highest count ( this can be done by the
maxfunction rather than therow_numberone by the way)Or in total :