Data Update Scripts
We’re currently making rapid changes to the product so our docs may be out of date. If you need help, please email yo@forem.com.
What are Data Update Scripts?
Data Update Scripts were introduced in
this PR and allow us to run any data
updates we might need. For example, if we added a column to the database and
then wanted to backfill that column with data, rather than going and manually
doing it in a console, we would use a DataUpdateScript
.
How it works
First off, we added a DataUpdateScript model to Rails and a corresponding database table. This table is what we use to keep track of what scripts have been run and which ones have not/still need to be.
To create a script you can use our custom Rails generator:
rails generate data_update BackfillColumnForArticles
This will create a simple Ruby class like below and all you have to do is fill in the code it will run.
module DataUpdateScripts
class BackfillColumnForArticles
def run
# Place your data update logic here
# Make sure your code is idempotent and can be run safely
# multiple times at any time
end
end
end
The generator will also automatically create the corresponding spec file.
require "rails_helper"
require Rails.root.join(
"lib/data_updates/20201103042915_backfill_column_for_articles.rb",
)
describe DataUpdateScripts::BackfillColumnForArticles do
pending "add some examples to (or delete) #{__FILE__}"
end
While we encourage adding tests for data update scripts, you can skip spec
creation by adding the --no-spec
option to the rails generate
command:
rails generate data_update BackfillColumnForArticles --no-spec
Once your script is in place then you can either run rails data_updates:run
manually or you can let our setup script handle it. In our local
bin/setup script you will
see we have added an additional task to update data. This kicks off the rake
task data_updates:run
for you.
The rake task itself will check the lib/data_update_scripts
folder to see if
there are any new scripts that need to be run. It does this by reading all of
the files and then checking to see if they have a corresponding database entry.
If they do not, then we create a new one and run the script. If a database entry
already exists and it indicates the script has been run, then we skip that
script.
In production
DataUpdateScripts
are also run automatically when a production deploy goes out.
However, to ensure the new code they need to use has been deployed we use a
DataUpdateWorker
via Sidekiq and set it to run 10 minutes after the deploy script has completed.
Best practices
Working with large collections of rows
From time to time, scripts need to operate on a large amount of rows; in those cases we encourage:
- adding explicit logging to the script.
- reversing the order, to start processing the most recent records first.
For example:
def run
Article.find_each(order: :desc).with_index do |article, index|
Rails.logging.info("...") if index % 1000 == 0 # this will log every 1000 articles
article.save
end
end