I'm trying to execute a query on a big collection in mongodb, actually the query is composed of two parts and takes a total of about 900ms to be executed, i need it to be much faster.
These are the collections, stoptimes:
> db.stoptimes.find().limit(1);
{
"trip_id": "24893A459B661",
"arrival_time": "22:30:00",
"departure_time": "22:30:00",
"stop_id": "1904",
"stop_sequence": 2,
"stop_headsign": "",
"pickup_type": "0",
"drop_off_type": "0",
"shape_dist_traveled": "0.88659123054",
"agency_key": "alamedaoakland-ferry",
"_id": ObjectId("52b394c680052ea30918fd62")
}
> db.stoptimes.count();
5959551
and trips:
> db.trips.find().limit(1);
{
"route_id": "60",
"service_id": "180A536",
"trip_id": "23736A180B536",
"trip_short_name": "",
"trip_headsign": "San Francisco via Pier 41",
"direction_id": "",
"block_id": "282",
"shape_id": "30",
"trip_bikes_allowed": "2",
"agency_key": "alamedaoakland-ferry",
"_id": ObjectId("52b394c780052ea30918ff34")
}
> db.trips.count();
204884
i'm trying to find every distinct route_id inside trips collection where trip_id is equal to each trip id that matches a given stop_id from stoptimes.
------ stoptimes --- -> ---------- trips -----------------
stop_id1 -> trip_id1 -> trip_id1 -> route_id1 -> route_id1
-> trip_id2 -> trip_id2 -> route_id2 -> route_id2
-> trip_id3 -> trip_id3 -> route_id2
-> trip_id4 -> trip_id4 -> route_id2
-> trip_id5 -> trip_id5 -> route_id3 -> route_id3
This is the query in mongodb shell:
> var tripids = db.stoptimes.aggregate([
... {$match : { 'stop_id' : '1904' }},
... {$project : { '_id' : 0, 'trip_id' : 1 }}
... ]);
> var arr = [];
> for(var i=0; i<tripids.result.length; i++)
... { arr.push(tripids.result[i].trip_id); }
> db.trips.aggregate([
... {$match : { 'trip_id' : {$in : arr}}},
... {$group : {
... _id : "$route_id",
... direction_id : { $first: '$direction_id'},
... shape_id : {$first : '$shape_id'}}}
... ])
This is the piece of javascript that i'm using, note that it's node.js + mongoose, but it should be easy to read as it was plain javascript:
StopTime
.aggregate([
{$match : {
'stop_id' : stop_id
}},
{$project : {
'_id' : 0,
'trip_id' : 1
}}
], function (err, trip_ids){
var arr = [];
for(var i=0;i<trip_ids.length;i++) {
arr.push(trip_ids[i].trip_id);
}
Trip
.aggregate([
{$match : {
'trip_id' : {$in : arr}
}},
{$group : {
_id : "$route_id",
direction_id : { $first: '$direction_id'},
shape_id : { $first: '$shape_id'}
}}
], function (err, route_ids){
cb(err, route_ids);
});
});
what can i do to improve performances?
EDIT:
This is the only query that is taking so long:
> db.trips.aggregate([
... {$match : { 'trip_id' : {$in : arr}}},
... {$group : {
... _id : "$route_id",
... direction_id : { $first: '$direction_id'},
... shape_id : {$first : '$shape_id'}}}
... ])
This looks like you are running the aggregate method on all trips (204884 trips) matching any record in the array. If that's true you are processing about 228 records/millisecond and that is pretty good.
There are some obvious optimizations you can do in your code
Never use i++ unless you have a specific reason for it, always write it as ++i and put your counts in separate variables
Your trip_id is a pretty complex string, i.e. 24893A459B661, and string comparisons are always slower than integer comparison. Also the match will have to pull the specified json row for each match it will test.
Some options
My very personal opinion is that MongoDB and similar engines isn't really there yet to handle these kind of operations compared to a regular relational database engine, i.e. SQL Server, MySQL, PostgreSQL.