I want to add a column to my users table and generate slug for all of them. The problem is I have over 1 million users in my DB.
I have seen various blogs explaining different methods but I do not want to take risk by doing it in my production DB.
Methods that I found:
The below method suggests to add the code to generate slug in migration file itself.
class AddStatusToUser < ActiveRecord::Migration class User < ActiveRecord::Base end def up add_column :users, :status, :string User.find_each do |user| user.status = 'active' user.save! end end def down remove_column :users, :status end end
I have written this method which is run by a rake task: The problem with the below one is that it has been running 4 days in which only 400 000 slugs have been generated so far. I wanted to do it quickly, but do not know how.
Yields each batch of records that was found by the find options as an array. The size of each batch is set by the :batch_size option; the default is 1000.
You can control the starting point for the batch processing by supplying the :start option. This is especially useful if you want multiple workers dealing with the same processing queue. You can make worker 1 handle all the records between id 0 and 10,000 and worker 2 handle from 10,000 and beyond (by setting the :start option on that worker).
It’s not possible to set the order. That is automatically set to ascending on the primary key (“id ASC”) to make the batch ordering work. This also mean that this method only works with integer-based primary keys. You can’t set the limit either, that’s used to control the batch sizes.
In order to avoid DB performance issues I have given a sleep time of 2 seconds after every slug generation for 1000 users. Should I remove the sleep method? Should I just run User.find_each(&:save)
or method 1?
task :add_slug_to_all_users => :environment do
i=0
batchSize = 1000
puts "started at :#{Time.now}"
# find_in_batches method provides the users in batches of 1000
# so that the update is not triggered for all the rows at once which may lock the table completely.
User.where("slug is null and email is not null").find_in_batches(batch_size: batchSize) do |users|
sleep(2)
users.each {|u| u.save!; i+=1;}
puts "updated #{i} records at: #{Time.now}"
end
puts "Completed the Task at: #{Time.now}\n"
end
Update 1: I am using friendly_id gem to generate slugs.
Update 2: I have run SHOW CREATE TABLE users
and I got this:
CREATE TABLE `users` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`first_name` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`last_name` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`p_views` int(11) DEFAULT '0',
`p_desc` text COLLATE utf8_unicode_ci,
`p_title` text COLLATE utf8_unicode_ci,
`created_at` datetime DEFAULT NULL,
`updated_at` datetime DEFAULT NULL,
`t_zone` varchar(255) COLLATE utf8_unicode_ci DEFAULT 'UTC',
`college` varchar(500) COLLATE utf8_unicode_ci DEFAULT NULL,
`degree` text COLLATE utf8_unicode_ci,
`p_no` varchar(15) COLLATE utf8_unicode_ci DEFAULT NULL,
`slug` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `unique_phone_number` (`p_no`),
UNIQUE KEY `index_users_on_phone_no` (`p_no`),
UNIQUE KEY `index_users_on_slug` (`slug`),
KEY `use_index_on_college` (`college`(255))
) ENGINE=InnoDB AUTO_INCREMENT=2194 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci |
Please note that I have removed most of the fields from the above result. The column slug
stores the combination of first_name and last_name in a url friendly manner.
For e.g. if a user's name is:
id first_name last_name
1 Arun Kumar
2 Arun Kumar
The slug generated would look like this:
id slug
1 arun-kumar
2 arun-kumar1
In a situation like this, all-purpose 3rd party software can only get in the way. You would be better off going down to the SQL to do the work.
If the "slug" is a simple sequence number, then adding an
AUTO_INCREMENT
would be the obvious solution, and a permanent solution. That is, all future additions would have the slug automatically generated. That could be done with one statement:It would possibly be better for
slug
to be thePRIMARY KEY
(Please provideSHOW CREATE TABLE
.) But that would probably require a serious lock on the table; so a plain index is better. Test it. It may be "fast enough".The next thought is
pt-online-schema-change
(see Percona.com), which is a special tool for effectively doingALTERs
with nearly zero impact. It involves adding aTRIGGER
to capture writes, and chunking of the copying. The slight impact comes with the "last little bit" needs to be copied. The finalRENAME TABLE real TO old, new TO real;
is atomic and essentially instantaneous. It even dynamically tunes the "sleep". It is an excellent tool with many years of experience put into it.But, ptosc may not work for adding something as critical as the
PRIMARY KEY
, hence my suggestion (above) of a plainINDEX
.Setting the values (via
UPDATE
), one chunk at a time, is the right way to go. I have written on chunking tips; that was aimed atDELETE
, but can be adapted toUPDATE
.Without knowing what is "under the covers" in
find_in_batches()
, I cannot say that it is good or bad. I do know thatOFFSET
is almost always bad; "remembering where you left off" is usually much better. But it is hard to do that if you don't already have aUNIQUE
orPRIMARY
key.PRIMARY
is better because of its clustering. (Please provideSHOW CREATE TABLE
, so I don't have to make guesses.)If your sample code starts over at the beginning of the table each time, then it is as bad as using
OFFSET
-- each iteration will be slower than the previous because it is skipping over more and more rows.After adding a column, be sure to check all references to the table --
SELECT *
will now have one more column (one reason for not using*
).UPDATEs
andINSERTs
may work with the missing column, but you need to check.Update
There were two steps -- add the
slug
column, and populate it. You have done the first step.To do the second step, I recommend stepping through the table 100 rows at a time, using the
AUTO_INCREMENT PRIMARY KEY
. 100 is low enough to be not-too-invasive. The AI PK will cover the entire table, and is efficient so that you don't need a slowOFFSET
or search for un-slugged set. I discuss efficient chunking here. It is written withDELETE
in mind, but the techniques apply toUPDATE
.