Pymongo: Subqueries with $lookup for the same collection and $in for certain ids. Is this the right way?

154 views Asked by At

I''m new to MongoDB and I need to do kind of a subquery, in which I want:

  • Return the count of records for top3 cariers (the array in $in) for each week.

My collection is as follows:

"MONTH","DAY_OF_MONTH","DAY_OF_WEEK","FL_DATE","OP_UNIQUE_CARRIER","ORIGIN_CITY_NAME","DEST_CITY_NAME","ARR_DELAY","DISTANCE",
3,28,3,2018-03-28,"B6","Newark, NJ","Fort Myers, FL",4.00,1068.00,
3,29,4,2018-03-29,"B6","Newark, NJ","Fort Myers, FL",6.00,1068.00,
3,30,5,2018-03-30,"B6","Newark, NJ","Fort Myers, FL",-4.00,1068.00,
3,31,6,2018-03-31,"B6","Newark, NJ","Fort Myers, FL",-3.00,1068.00,
3,1,4,2018-03-01,"B6","New York, NY","Long Beach, CA",39.00,2465.00,
3,2,5,2018-03-02,"B6","New York, NY","Long Beach, CA",0.00,2465.00,
3,3,6,2018-03-03,"B6","New York, NY","Long Beach, CA",2.00,2465.00,
3,4,7,2018-03-04,"B6","New York, NY","Long Beach, CA",-2.00,2465.00,
3,5,1,2018-03-05,"B6","New York, NY","Long Beach, CA",25.00,2465.00,
3,6,2,2018-03-06,"B6","New York, NY","Long Beach, CA",-21.00,2465.00,
3,7,3,2018-03-07,"B6","New York, NY","Long Beach, CA",20.00,2465.00,
3,8,4,2018-03-08,"B6","New York, NY","Long Beach, CA",48.00,2465.00,
3,9,5,2018-03-09,"B6","New York, NY","Long Beach, CA",16.00,2465.00,
3,10,6,2018-03-10,"B6","New York, NY","Long Beach, CA",6.00,2465.00,
3,11,7,2018-03-11,"B6","New York, NY","Long Beach, CA",-13.00,2465.00,

and my current query only returns the #count for each of the top3 but with harcoded values inside $in stage:

total_per_week = [
    {"$match" :{
        "OP_UNIQUE_CARRIER": { "$in": ["DL","9E","B6"]}
        ,"ORIGIN_CITY_NAME": {"$regex": "^New York"}
        }
    },    
    {"$group" : {
        "_id" : {                   
            "week": { "$week": "$FL_DATE" },
            "carrier": "$OP_UNIQUE_CARRIER"
        },                
        "total" : { "$sum": 1 } 
        }
    },
    #// Maybe project a prettier "flatter" output
    {"$project": {
        "_id": 0,
        "carrier": "$_id.carrier",
        "week": "$_id.week",
        "total": "$total"
    }}
    ,
    {"$sort": SON([("carrier", 1), ("week", 1)])}

]

This is the output:

    carrier  week  total
0       9E     0    597
1       9E     1    964
2       9E     2    917
3       9E     3    968
4       9E     4    975
5       9E     5    927
6       9E     6    933
7       9E     7    917
8       9E     8    978
9       9E     9   1025
10      9E    10   1036
11      9E    11   1036
12      9E    12   1036
13      B6     0    797
14      B6     1    880

(...)

So, I've tried to mimic this SQL approach:

SELECT COUNT(*) , week, carrier where carrier in () group by carrier,week

by caming up wit this query:

    lookQuery = [
    # Returns df with carrier column
    {"$match" : {"ORIGIN_CITY_NAME": {"$regex": "^New York"}} },
    {"$sortByCount":  "$OP_UNIQUE_CARRIER" },
    {"$limit": 3},
    #adds a new array (inner),top3, field whose elements are the matching documents from the “joined” collection
    { "$lookup": 
        {
        "from": "Flights",
        "let": {"the_carrier": "$_id" },
        "pipeline": [
        {"$match" :{
            "$expr": {"$eq": ["$OP_UNIQUE_CARRIER","$$the_carrier"]}
             ,"ORIGIN_CITY_NAME": {"$regex": "^New York"}
            }
        }, 
        {"$group" : {
            "_id" : {                   
            "week": { "$week": "$FL_DATE" },
            "carrier": "$OP_UNIQUE_CARRIER"
            },                
            "total_sem" : { "$sum": 1 } 
            }
        }         
      ],
      "as": "top3" #output of inner-array field
      }
    } #closes lookup
    ,{"$unwind": "$top3"},
        {"$project": {
            "_id": 0,
            "Carrier": "$top3._id.carrier",
            "Week": "$top3._id.week",
            "Total": "$top3.total_sem"
            }
        }
        #adds a new outter array to math the carrier code in Carriers ref Table, providing the Description
        ,{"$lookup":
          {
          "from": "Carriers",
          "localField": "Carrier",
          "foreignField": "Code",
          "as": "carriers"
          }  
        },
        {"$project": {
            "_id": 0,
            "Carrier": "$Carrier",
            "Week": "$Week",
            "Total": "$Total",
            "CarrierName": "$carriers.Description"
            }
        }
        ,{"$sort": SON([("Carrier", 1), ("Week", 1)])}      

]

which produces the desired output:

    Carrier  Week  Total             CarrierName
0       9E     0    597     [Endeavor Air Inc.]
1       9E     1    964     [Endeavor Air Inc.]
2       9E     2    917     [Endeavor Air Inc.]
3       9E     3    968     [Endeavor Air Inc.]
4       9E     4    975     [Endeavor Air Inc.]
5       9E     5    927     [Endeavor Air Inc.]
6       9E     6    933     [Endeavor Air Inc.]
7       9E     7    917     [Endeavor Air Inc.]
8       9E     8    978     [Endeavor Air Inc.]
9       9E     9   1025     [Endeavor Air Inc.]
10      9E    10   1036     [Endeavor Air Inc.]
11      9E    11   1036     [Endeavor Air Inc.]
12      9E    12   1036     [Endeavor Air Inc.]
13      B6     0    797       [JetBlue Airways]
14      B6     1    880       [JetBlue Airways]
15      B6     2    873       [JetBlue Airways]

My question is: Is this "query" efficient or is it overcomplicated? Is there a simpler way to achieve this with similar or better performance?

0

There are 0 answers