MongoDB, slow query with bigdata

1.3k views Asked by At

I'm trying to execute a query on a big collection in mongodb, actually the query is composed of two parts and takes a total of about 900ms to be executed, i need it to be much faster.

These are the collections, stoptimes:

> db.stoptimes.find().limit(1);
{
    "trip_id": "24893A459B661",
    "arrival_time": "22:30:00",
    "departure_time": "22:30:00",
    "stop_id": "1904",
    "stop_sequence": 2,
    "stop_headsign": "",
    "pickup_type": "0",
    "drop_off_type": "0",
    "shape_dist_traveled": "0.88659123054",
    "agency_key": "alamedaoakland-ferry",
    "_id": ObjectId("52b394c680052ea30918fd62")
}
> db.stoptimes.count();
5959551

and trips:

> db.trips.find().limit(1);
{
    "route_id": "60",
    "service_id": "180A536",
    "trip_id": "23736A180B536",
    "trip_short_name": "",
    "trip_headsign": "San Francisco via Pier 41",
    "direction_id": "",
    "block_id": "282",
    "shape_id": "30",
    "trip_bikes_allowed": "2",
    "agency_key": "alamedaoakland-ferry",
    "_id": ObjectId("52b394c780052ea30918ff34")
}
> db.trips.count();
204884

i'm trying to find every distinct route_id inside trips collection where trip_id is equal to each trip id that matches a given stop_id from stoptimes.

------ stoptimes --- -> ---------- trips -----------------
stop_id1 -> trip_id1 -> trip_id1 -> route_id1 -> route_id1 
         -> trip_id2 -> trip_id2 -> route_id2 -> route_id2
         -> trip_id3 -> trip_id3 -> route_id2
         -> trip_id4 -> trip_id4 -> route_id2
         -> trip_id5 -> trip_id5 -> route_id3 -> route_id3

This is the query in mongodb shell:

> var tripids = db.stoptimes.aggregate([
... {$match : { 'stop_id' : '1904' }},
... {$project : { '_id' : 0, 'trip_id' : 1 }}
... ]);
> var arr = [];
> for(var i=0; i<tripids.result.length; i++)
... { arr.push(tripids.result[i].trip_id); }
> db.trips.aggregate([
... {$match : { 'trip_id' : {$in : arr}}},
... {$group : {
...  _id : "$route_id", 
...  direction_id : { $first: '$direction_id'}, 
...  shape_id : {$first : '$shape_id'}}}
... ])

This is the piece of javascript that i'm using, note that it's node.js + mongoose, but it should be easy to read as it was plain javascript:

StopTime
    .aggregate([
        {$match : {
            'stop_id' : stop_id
        }},
        {$project : {
            '_id' : 0,
            'trip_id' : 1
        }}
    ], function (err, trip_ids){
        var arr = [];
        for(var i=0;i<trip_ids.length;i++) {
            arr.push(trip_ids[i].trip_id);
        }
        Trip
            .aggregate([
                {$match : {
                    'trip_id' : {$in : arr}
                }},
                {$group : {
                    _id : "$route_id",
                    direction_id : { $first: '$direction_id'},
                    shape_id : { $first: '$shape_id'}
                }}
        ], function (err, route_ids){
            cb(err, route_ids);
        });
    });

what can i do to improve performances?

EDIT:

This is the only query that is taking so long:

> db.trips.aggregate([
... {$match : { 'trip_id' : {$in : arr}}},
... {$group : {
...  _id : "$route_id", 
...  direction_id : { $first: '$direction_id'}, 
...  shape_id : {$first : '$shape_id'}}}
... ])
2

There are 2 answers

0
Eric Herlitz On

This looks like you are running the aggregate method on all trips (204884 trips) matching any record in the array. If that's true you are processing about 228 records/millisecond and that is pretty good.

There are some obvious optimizations you can do in your code

Never use i++ unless you have a specific reason for it, always write it as ++i and put your counts in separate variables

var trip_ids_length = trip_ids.length;
for(var i=0;i<trip_ids_length;++i) {
    arr.push(trip_ids[i].trip_id);
}

Your trip_id is a pretty complex string, i.e. 24893A459B661, and string comparisons are always slower than integer comparison. Also the match will have to pull the specified json row for each match it will test.

Some options

  • Reconsider your objects for stoptimes and trips, best shortcut would be to replace the trip_id with an integer value
  • Create an index list with all trip_id's that is smaller and faster to run match towards, you should store the INDEX of the associated object in trips and/or stoptimes; I.e. t_index and s_index
  • Create a web service configured to hold the trips and stoptimes in static memory and make your matches there

My very personal opinion is that MongoDB and similar engines isn't really there yet to handle these kind of operations compared to a regular relational database engine, i.e. SQL Server, MySQL, PostgreSQL.

0
Daniel Coupal On

Ensure that you have an index on 'trip_id' in the 'trips' collection. Even with the index, if you provide a long list of values for 'arr', you will not get the best performance. The '$in' operator is difficult to optimize as each value must be looked at. For example, if the 'arr' array has 10 values, for each value the index has to be searched. It basically looks like 10 sub-queries.

You can design your schema to avoid using the '$in' operator, looking up 2 collections and using the aggregation framework.

I will assume "trip_id+stop_id" is unique in the "stoptimes" collection, and "route_id" is unique in the "trips" collection.

Let's denormalize the data. Keep the "stoptimes" collection to carry the detailed info on the stops, but let's add some of that info into the "trips" collection:

{
"route_id": "60",
"service_id": "180A536",
"trip_id": "23736A180B536",
"stop_id" : [ 1800, 1830, 1904]   <==============
"trip_short_name": "",
"trip_headsign": "San Francisco via Pier 41",
"direction_id": "",
"block_id": "282",
"shape_id": "30",
"trip_bikes_allowed": "2",
"agency_key": "alamedaoakland-ferry",
"_id": ObjectId("52b394c780052ea30918ff34")
}

Then your query becomes:

db.trips.find({"stop_id":1904}, {"_id":0, "route_id":1, "direction_id":1, "shape_id":1})

With an index on "stop_id", your queries should be very fast.

In summary, design your schema so it is optimized for your most important queries. If the above query is the most important, then you will profit by a new schema design. If this is an isolated query, and you are already optimized for your common cases, then Eric's suggestions may just do what you need. If you decide to keep the aggregation framework solution, you can evaluate the performance of the first step of the aggregation pipeline. Run the following command to ensure the $match step is using an index.

db.collection.runCommand("aggregate", {pipeline: YOUR_PIPELINE, explain: true})