Loading large data sets into a Rails application

3k views Asked by At

I'm dealing with millions of rows of data that I want to load into my Rails application as Models. I'm using MySQL as a database, and I'm on Rails 2.3.14.

One of my co-workers says that it's inadvisable to add records directly to MySQL, bypassing the Rails ActiveRecord system. He's short on specifics, but the gist of it is that Rails does a lot of "magic" when you use it's ActiveRecord system, and it will confuse Rails if you enter data outside of this system. Can someone elaborate on whether this is accurate?

If I should be loading data into Rails through ActiveRecord, I've read that the activerecord-import plugin is the way to go for this type of job.

Any feedback on the best approach for loading in massive amounts of data into Rails would be welcomed.

3

There are 3 answers

1
Michael Durrant On BEST ANSWER

I can think of six main items to consider, the last five relate to rails 'magic':

  1. Speed. This is huge. Active Record, one-at-a-time inserts can take a second for each row. So that's a million seconds for a million rows - that's 11.5 DAYS which would give it a bad rap by many folks!

  2. Validation. You'll need to make sure that the database enforces the same validations that you have in your models / existing data.

  3. Timestamps. You need to update timestamps manually if you want to update created_at / updated_at the same way rails would

  4. Counter Caches. You'll need to update counts manually.

  5. ActiveRecord gems For example if you use acts_as_audited which lets you keep a record trail for data changes to Model records, you won't have that functionaity if you're outside ActiveRecord.

  6. Business Logic at the Model Layer. Good programmers try to put functionality at the model (or higher) level when they can. This might include items like updating other data, sending emails, writing to logs, etc. This would not happen if ActiveRecord was not invoked.

0
Noah Clark On

There are a few reasons why you shouldn't load it directly. Some of these may or may not apply to you.

Data Validations -- You're loading data that hasn't been validated. Your rails app probably has certain assumptions being made about the data that is loaded in. Also, unvalidated data could raise some interesting issues as it works its way through your app.

Data Manipulation -- This is somewhat related to Data Validations, but if you're doing any sort of Data Manipulation (between data input on the web and insertion into the db) you'd want to at a minimum recreate this manipulation when you upload it.

Overall, it's probably not the best idea to do, but that is not because of "magic" in rails, but more because your data has assumptions built into it that you aren't recreating doing a direct dump.

2
Carl Zulauf On

It is possible inserting directly into MySQL may bypass model observers, counter caches, and other functionality your app depends on ActiveRecord doing for you. If you decide to insert data directly into MySQL then be aware of this and make sure you account for all of the changes and validations ActiveRecord would make. Whatever insert script you write should make the same changes.

Example: You have students and teachers tables. Inserting a record into students might require you update the teachers.students_count counter cache column ActiveRecord normally increments for you.

Beyond that there is no reason you can't insert data directly. Beyond those concerns any concerns are just unfounded FUD.

The real bottleneck with using ActiveRecord is the instantiation of ActiveRecord model objects, which are very complex. You might want to consider writing your insert/import script as a rake task and use arel (the low-level query interface that powers ActiveRecord) or a gem like activerecord-import. Keep in mind that both of these approaches will (or at least can) skip the normal validation, observers, counter caches, etc, so you'll still need custom logic to account for that.