I am storing a large amount of Twitter data, and would like to retrieve about 500k records for data processing at a time. I have a TwitterTweet mongo document that contains basic tweet data, and try to retrieve it as follows:
weekly_tweets = TwitterTweet.all(:created_at.gt => 1.week.ago, :fields => [:created_at, :text, :from_user])
Trouble is, this take up a LOT of time and memory - is there any way to make this more scalable and efficient. I have thought of using map reduce, but it looks very complicated for what I want to do - text processing and regexp stuff on the tweets.
Do not call all as this has the effect of making an object of all 500k of your entries in mongo and will as you noticed use a ton of memory and time. Use find_each instead and iterate through. Find returns a cursor which is way more efficient.