I want to add a column to my users table and generate slug for all of them. The problem is I have over 1 million users in my DB.

I have seen various blogs explaining different methods but I do not want to take risk by doing it in my production DB.

Methods that I found:

  1. The below method suggests to add the code to generate slug in migration file itself.

    class AddStatusToUser < ActiveRecord::Migration
      class User < ActiveRecord::Base
      end
    
      def up
        add_column :users, :status, :string
        User.find_each do |user|
          user.status = 'active'
          user.save!
        end
      end
    
      def down
        remove_column :users, :status
      end
    end
    
  2. I have written this method which is run by a rake task: The problem with the below one is that it has been running 4 days in which only 400 000 slugs have been generated so far. I wanted to do it quickly, but do not know how.

find_in_batches:

Yields each batch of records that was found by the find options as an array. The size of each batch is set by the :batch_size option; the default is 1000.

You can control the starting point for the batch processing by supplying the :start option. This is especially useful if you want multiple workers dealing with the same processing queue. You can make worker 1 handle all the records between id 0 and 10,000 and worker 2 handle from 10,000 and beyond (by setting the :start option on that worker).

It’s not possible to set the order. That is automatically set to ascending on the primary key (“id ASC”) to make the batch ordering work. This also mean that this method only works with integer-based primary keys. You can’t set the limit either, that’s used to control the batch sizes.

In order to avoid DB performance issues I have given a sleep time of 2 seconds after every slug generation for 1000 users. Should I remove the sleep method? Should I just run User.find_each(&:save) or method 1?

task :add_slug_to_all_users => :environment do
  i=0
  batchSize = 1000
  puts "started at :#{Time.now}"
  # find_in_batches method provides the users in batches of 1000 
  # so that the update is not triggered for all the rows at once which may lock the table completely.
  User.where("slug is null and email is not null").find_in_batches(batch_size: batchSize) do |users|
    sleep(2)
    users.each {|u| u.save!; i+=1;} 
    puts "updated #{i} records at: #{Time.now}"
  end
  puts "Completed the Task at: #{Time.now}\n"
end

Update 1: I am using friendly_id gem to generate slugs.

Update 2: I have run SHOW CREATE TABLE users and I got this:

CREATE TABLE `users` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `first_name` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
  `last_name` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
  `p_views` int(11) DEFAULT '0',
  `p_desc` text COLLATE utf8_unicode_ci,
  `p_title` text COLLATE utf8_unicode_ci,
  `created_at` datetime DEFAULT NULL,
  `updated_at` datetime DEFAULT NULL,
  `t_zone` varchar(255) COLLATE utf8_unicode_ci DEFAULT 'UTC',
  `college` varchar(500) COLLATE utf8_unicode_ci DEFAULT NULL,
  `degree` text COLLATE utf8_unicode_ci,
  `p_no` varchar(15) COLLATE utf8_unicode_ci DEFAULT NULL,
  `slug` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `unique_phone_number` (`p_no`),
  UNIQUE KEY `index_users_on_phone_no` (`p_no`),
  UNIQUE KEY `index_users_on_slug` (`slug`),
  KEY `use_index_on_college` (`college`(255))
) ENGINE=InnoDB AUTO_INCREMENT=2194 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci |

Please note that I have removed most of the fields from the above result. The column slug stores the combination of first_name and last_name in a url friendly manner.

For e.g. if a user's name is:

id first_name   last_name
1  Arun         Kumar
2  Arun         Kumar

The slug generated would look like this:

id slug
1  arun-kumar
2  arun-kumar1
1

There are 1 answers

6
Rick James On BEST ANSWER

In a situation like this, all-purpose 3rd party software can only get in the way. You would be better off going down to the SQL to do the work.

If the "slug" is a simple sequence number, then adding an AUTO_INCREMENT would be the obvious solution, and a permanent solution. That is, all future additions would have the slug automatically generated. That could be done with one statement:

ALTER TABLE t
    ADD COLUMN slug INT UNSIGNED AUTO_INCREMENT,
    INDEX(slug);

It would possibly be better for slug to be the PRIMARY KEY (Please provide SHOW CREATE TABLE.) But that would probably require a serious lock on the table; so a plain index is better. Test it. It may be "fast enough".

The next thought is pt-online-schema-change (see Percona.com), which is a special tool for effectively doing ALTERs with nearly zero impact. It involves adding a TRIGGER to capture writes, and chunking of the copying. The slight impact comes with the "last little bit" needs to be copied. The final RENAME TABLE real TO old, new TO real; is atomic and essentially instantaneous. It even dynamically tunes the "sleep". It is an excellent tool with many years of experience put into it.

But, ptosc may not work for adding something as critical as the PRIMARY KEY, hence my suggestion (above) of a plain INDEX.

Setting the values (via UPDATE), one chunk at a time, is the right way to go. I have written on chunking tips; that was aimed at DELETE, but can be adapted to UPDATE.

Without knowing what is "under the covers" in find_in_batches(), I cannot say that it is good or bad. I do know that OFFSET is almost always bad; "remembering where you left off" is usually much better. But it is hard to do that if you don't already have a UNIQUE or PRIMARY key. PRIMARY is better because of its clustering. (Please provide SHOW CREATE TABLE, so I don't have to make guesses.)

If your sample code starts over at the beginning of the table each time, then it is as bad as using OFFSET -- each iteration will be slower than the previous because it is skipping over more and more rows.

After adding a column, be sure to check all references to the table -- SELECT * will now have one more column (one reason for not using *). UPDATEs and INSERTs may work with the missing column, but you need to check.

Update

There were two steps -- add the slug column, and populate it. You have done the first step.

To do the second step, I recommend stepping through the table 100 rows at a time, using the AUTO_INCREMENT PRIMARY KEY. 100 is low enough to be not-too-invasive. The AI PK will cover the entire table, and is efficient so that you don't need a slow OFFSET or search for un-slugged set. I discuss efficient chunking here. It is written with DELETE in mind, but the techniques apply to UPDATE.