I am using MongoDB as a temporary log store. The collection receives ~400,000 new rows an hour. Each row contains a UNIX timestamp and a JSON string.
Periodically I would like to copy the contents of the collection to a file on S3, creating a file for each hour containing ~400,000 rows (eg. today_10_11.log contains all the rows received between 10am and 11am). I need to do this copy while the collection is receiving inserts.
My question: what is the performance impact of having an index on the timestamp column on the 400,000 hourly inserts verses the additional time it will take to query an hours worth of rows.
The application in question is using written in Ruby running on Heroku and using the MongoHQ plugin.
Mongo indexes the _id field by default, and the ObjectId already starts with a timestamp, so basically, Mongo is already indexing your collection by insertion time for you. So if you're using the Mongo defaults, you don't need to index a second timestamp field (or even add one).
To get the creation time of an object id in ruby:
To generate the object ids for a given time:
So, for example, if you wanted to load all docs inserted in the past week, you'd simply search for _ids greater than past_id and less than id. So, through the Ruby driver:
You can, of course, also add a separate field for timestamps, and index it, but there's no point in taking that performance hit when Mongo's already doing the necessary work for you with its default _id field.
More information on object ids.