I''m new to MongoDB and I need to do kind of a subquery, in which I want:
- Return the count of records for top3 cariers (the array in $in) for each week.
My collection is as follows:
"MONTH","DAY_OF_MONTH","DAY_OF_WEEK","FL_DATE","OP_UNIQUE_CARRIER","ORIGIN_CITY_NAME","DEST_CITY_NAME","ARR_DELAY","DISTANCE",
3,28,3,2018-03-28,"B6","Newark, NJ","Fort Myers, FL",4.00,1068.00,
3,29,4,2018-03-29,"B6","Newark, NJ","Fort Myers, FL",6.00,1068.00,
3,30,5,2018-03-30,"B6","Newark, NJ","Fort Myers, FL",-4.00,1068.00,
3,31,6,2018-03-31,"B6","Newark, NJ","Fort Myers, FL",-3.00,1068.00,
3,1,4,2018-03-01,"B6","New York, NY","Long Beach, CA",39.00,2465.00,
3,2,5,2018-03-02,"B6","New York, NY","Long Beach, CA",0.00,2465.00,
3,3,6,2018-03-03,"B6","New York, NY","Long Beach, CA",2.00,2465.00,
3,4,7,2018-03-04,"B6","New York, NY","Long Beach, CA",-2.00,2465.00,
3,5,1,2018-03-05,"B6","New York, NY","Long Beach, CA",25.00,2465.00,
3,6,2,2018-03-06,"B6","New York, NY","Long Beach, CA",-21.00,2465.00,
3,7,3,2018-03-07,"B6","New York, NY","Long Beach, CA",20.00,2465.00,
3,8,4,2018-03-08,"B6","New York, NY","Long Beach, CA",48.00,2465.00,
3,9,5,2018-03-09,"B6","New York, NY","Long Beach, CA",16.00,2465.00,
3,10,6,2018-03-10,"B6","New York, NY","Long Beach, CA",6.00,2465.00,
3,11,7,2018-03-11,"B6","New York, NY","Long Beach, CA",-13.00,2465.00,
and my current query only returns the #count for each of the top3 but with harcoded values inside $in stage:
total_per_week = [
{"$match" :{
"OP_UNIQUE_CARRIER": { "$in": ["DL","9E","B6"]}
,"ORIGIN_CITY_NAME": {"$regex": "^New York"}
}
},
{"$group" : {
"_id" : {
"week": { "$week": "$FL_DATE" },
"carrier": "$OP_UNIQUE_CARRIER"
},
"total" : { "$sum": 1 }
}
},
#// Maybe project a prettier "flatter" output
{"$project": {
"_id": 0,
"carrier": "$_id.carrier",
"week": "$_id.week",
"total": "$total"
}}
,
{"$sort": SON([("carrier", 1), ("week", 1)])}
]
This is the output:
carrier week total
0 9E 0 597
1 9E 1 964
2 9E 2 917
3 9E 3 968
4 9E 4 975
5 9E 5 927
6 9E 6 933
7 9E 7 917
8 9E 8 978
9 9E 9 1025
10 9E 10 1036
11 9E 11 1036
12 9E 12 1036
13 B6 0 797
14 B6 1 880
(...)
So, I've tried to mimic this SQL approach:
SELECT COUNT(*) , week, carrier where carrier in () group by carrier,week
by caming up wit this query:
lookQuery = [
# Returns df with carrier column
{"$match" : {"ORIGIN_CITY_NAME": {"$regex": "^New York"}} },
{"$sortByCount": "$OP_UNIQUE_CARRIER" },
{"$limit": 3},
#adds a new array (inner),top3, field whose elements are the matching documents from the “joined” collection
{ "$lookup":
{
"from": "Flights",
"let": {"the_carrier": "$_id" },
"pipeline": [
{"$match" :{
"$expr": {"$eq": ["$OP_UNIQUE_CARRIER","$$the_carrier"]}
,"ORIGIN_CITY_NAME": {"$regex": "^New York"}
}
},
{"$group" : {
"_id" : {
"week": { "$week": "$FL_DATE" },
"carrier": "$OP_UNIQUE_CARRIER"
},
"total_sem" : { "$sum": 1 }
}
}
],
"as": "top3" #output of inner-array field
}
} #closes lookup
,{"$unwind": "$top3"},
{"$project": {
"_id": 0,
"Carrier": "$top3._id.carrier",
"Week": "$top3._id.week",
"Total": "$top3.total_sem"
}
}
#adds a new outter array to math the carrier code in Carriers ref Table, providing the Description
,{"$lookup":
{
"from": "Carriers",
"localField": "Carrier",
"foreignField": "Code",
"as": "carriers"
}
},
{"$project": {
"_id": 0,
"Carrier": "$Carrier",
"Week": "$Week",
"Total": "$Total",
"CarrierName": "$carriers.Description"
}
}
,{"$sort": SON([("Carrier", 1), ("Week", 1)])}
]
which produces the desired output:
Carrier Week Total CarrierName
0 9E 0 597 [Endeavor Air Inc.]
1 9E 1 964 [Endeavor Air Inc.]
2 9E 2 917 [Endeavor Air Inc.]
3 9E 3 968 [Endeavor Air Inc.]
4 9E 4 975 [Endeavor Air Inc.]
5 9E 5 927 [Endeavor Air Inc.]
6 9E 6 933 [Endeavor Air Inc.]
7 9E 7 917 [Endeavor Air Inc.]
8 9E 8 978 [Endeavor Air Inc.]
9 9E 9 1025 [Endeavor Air Inc.]
10 9E 10 1036 [Endeavor Air Inc.]
11 9E 11 1036 [Endeavor Air Inc.]
12 9E 12 1036 [Endeavor Air Inc.]
13 B6 0 797 [JetBlue Airways]
14 B6 1 880 [JetBlue Airways]
15 B6 2 873 [JetBlue Airways]
My question is: Is this "query" efficient or is it overcomplicated? Is there a simpler way to achieve this with similar or better performance?