Change `byte_size` in `ActiveStorage::Blob` #2621

pgwillia · 2021-11-15T16:48:41Z

Describe the bug
We had a request by a researcher to attach several 2.8 Gb zip files to an existing object. We couldn't fulfill this request because of the way we were storing metadata about the file. We try to store the byte_size of the file (2,968,164,521) in an Integer (2,147,483,647), and it doesn't fit.

To Reproduce
Steps to reproduce the behavior:

Go to the rails console
Create a variable item for the Item to attach the zip files.
Attempt to add and ingest one file:

item.add_and_ingest_files([File.open('/home/pjenkins/Downloads/ym2ym_absement-004.zip')])
  ActiveStorage::Blob Load (1.5ms)  SELECT "active_storage_blobs".* FROM "active_storage_blobs" INNER JOIN "active_storage_attachments" ON "active_storage_blobs"."id" = "active_storage_attachments"."blob_id" WHERE "active_storage_attachments"."record_id" = $1 AND "active_storage_attachments"."record_type" = $2 AND "active_storage_attachments"."name" = $3  [["record_id", "b470bdac-540b-4356-a796-eba1de754e6c"], ["record_type", "Item"], ["name", "files"]]
  TRANSACTION (0.9ms)  BEGIN
  User Load (0.7ms)  SELECT "users".* FROM "users" WHERE "users"."id" = $1 LIMIT $2  [["id", 1], ["LIMIT", 1]]
  Community Load (0.5ms)  SELECT "communities".* FROM "communities" WHERE "communities"."id" = $1 LIMIT $2  [["id", "46972899-2299-4913-b288-745889a8ee17"], ["LIMIT", 1]]
  Collection Load (0.6ms)  SELECT "collections".* FROM "collections" WHERE "collections"."id" = $1 LIMIT $2  [["id", "1ac2a45a-d684-463a-94f1-674112489536"], ["LIMIT", 1]]
  ActiveStorage::Attachment Load (0.8ms)  SELECT "active_storage_attachments".* FROM "active_storage_attachments" WHERE "active_storage_attachments"."record_id" = $1 AND "active_storage_attachments"."record_type" = $2 AND "active_storage_attachments"."name" = $3  [["record_id", "b470bdac-540b-4356-a796-eba1de754e6c"], ["record_type", "Item"], ["name", "files"]]
  TRANSACTION (0.5ms)  ROLLBACK
ActiveModel::RangeError: 2968164521 is out of range for ActiveModel::Type::Integer with limit 4 bytes
from /home/pjenkins/.asdf/installs/ruby/2.6.8/lib/ruby/gems/2.6.0/gems/activemodel-6.1.4.1/lib/active_model/type/integer.rb:49:in `ensure_in_range'

Expected behavior
Metadata attributes should not be a limitation for adding and ingesting files.

Additional context
#2617
https://github.com/ualbertalib/digital-preservation/issues/22

The text was updated successfully, but these errors were encountered:

pgwillia · 2021-11-15T23:48:00Z

Strong Migrations

add strong_migrations gem

Strong Migrations will catch unsafe migrations in development

✓ Detects potentially dangerous operations
✓ Prevents them from running by default
✓ Provides instructions on safer ways to do what you want

Supports for PostgreSQL, MySQL, and MariaDB
Battle tested at Instacart

This is what they say about changing the type of a column

Changing the type of a column causes the entire table to be rewritten. During this time, reads and writes are blocked in Postgres, and writes are blocked in MySQL and MariaDB.

A safer approach is to:
Create a new column
Write to both columns
Backfill data from the old column to the new column
Move reads from the old column to the new column
Stop writing to the old column
Drop the old column

pgwillia · 2021-11-17T16:38:39Z

First pass

My initial response to this issue was to just change the column type.

#  db/migrate/20211110162254_change_byte_size_in_active_storage_blob.rb
class ChangeByteSizeInActiveStorageBlob < ActiveRecord::Migration[6.1]
  def change
    change_column :active_storage_blobs, :byte_size, :bigint
  end
end

The reviewer of my proposed change kindly pointed out that

This kind of migration will cause a "lock" on the table, which could prevent any reads/writes for a few seconds. Think we are okay since we deploy after hours and the number of users we have is low but it's something to be aware of.

The management team want to get this out the door and can tolerate small outages so this version of the change is what we're going to use.

pgwillia · 2021-11-17T18:42:32Z

Is there a better place for this than in an issue comment?

Zero downtime

From Zero Downtime Migrations in Rails

The rule we have to adhere by is that all migrations need to be compatible with the code that is already running. This involves deploying your changes in multiple steps:

Make your code compatible with the migration you need to run
Run the migration
Remove code specifically written to allow the migration to run

This means that instead of using a single deployment to make our code and database changes at once, we will use multiple deployments to ensure each change is compatible with the code that is already running.

Let's work through an example

This is what strong_migrations say about changing the type of a column

Changing the type of a column causes the entire table to be rewritten. During this time, reads and writes are blocked in Postgres, and writes are blocked in MySQL and MariaDB.

A safer approach is to:
  - Create a new column
  - Write to both columns
  - Backfill data from the old column to the new column
  - Move reads from the old column to the new column
  - Stop writing to the old column
  - Drop the old column

Create a new column

#  db/migrate/20211110162254_add_byte_size_bigint_to_active_storage_blob.rb
class AddByteSizeBigIntInActiveStorageBlob < ActiveRecord::Migration[6.1]
  def change
    add_column :active_storage_blobs, :byte_size_bigint, :bigint
  end
end

Write to both columns

# app/models/active_storage/blob.rb
def byte_size=(value)
    self[:byte_size] = value
    self[:byte_size_bigint] = value
  end

  def byte_size
    self[:byte_size]
  end

Could also do this by adding a trigger in SQL

Back-fill data from the old column to the new column

There seems to be a few ways that people have used to back-fill data

There are three keys to backfilling safely: batching, throttling, and running it outside a transaction. Use the Rails console or a separate migration with disable_ddl_transaction!.

In the migration

class BackfillByteSize < ActiveRecord::Migration[6.1]
  disable_ddl_transaction!

  def up
    ActiveStorage::Blob.find_each |blob|
      blob.update_column byte_size_bigint: blob.byte_size
      sleep(0.01) # throttle
    end
  end
end

In the migration with SQL

class BackfillByteSize < ActiveRecord::Migration[6.1]
  def up
    execute "UPDATE active_storage_blobs SET byte_size_bigint = byte_size;"
  end
end

In the migration but with a gem

Change data in migrations like a boss cautions

against using Class names in migrations because: naming is hard and there is often drift; if old migrations don't get updated with changed Class names then the migrations will fail
you could redefine the Class name in the migration, but this won't work with polymorphic associations
use SQL, but you must have strong SQL foo

Their solution is migration_data which allows you to do data migrations in def data; end within the database migration and then write tests that ensure data migrations work. I probably wouldn't go down this route because the way it suggests you couple db and data migrations doesn't solve the zero downtime problem, but it is an option. The value it has is in discussing the downsides of other methods.

Another possibly better solution from a gem

data_migrate adds rake tasks to migrate data alongside schema changes. Data migrations are stored in db/data. They act like schema migrations, except they should be reserved for data migrations. For instance, if you realize you need to titleize all your titles, this is the place to do it.

rails g data_migration add_this_to_that creates a migration in db/data. rails db:migrate is unchanged and wouldn't run this data migration, but rails db:migrate:with_data will also run any data migrations. The time stamps ensure that migrations happen in a particular order. This gem might be a good option in combination with the in migration advice above.

Short Lived Task

If a rake task fails it doesn't matter, but if a migration fails it must be undone manually in order to resume production deploys.
For complicated stuff we usually add a dry run to the rake task.
It can also take a couple hours and it’s not a big deal.

# lib/tasks/data_migration.rake
namespace :data_migration do
  desc 'backfill the byte_size_bigint column'
  task backfill_byte_size: :environment do
    Job.in_batches.update_all("byte_size_bigint = byte_size")
  end
end

Move reads from the old column to the new column

# app/models/active_storage/blob.rb
def byte_size
  self[:byte_size_bigint]
end

Stop writing to the old column

# app/models/active_storage/blob.rb
self.ignored_columns = [:byte_size]

def byte_size=(value)
  self[:byte_size_bigint] = value
end

Another way I saw this done

# app/models/active_storage/blob.rb
def self.columns
  super.reject { |column| column.name == "byte_size" }
end

Drop the old column

class RemoveByteSizeFromActiveStorageBlob < ActiveRecord::Migration[6.1]
  def change
    safety_assured { remove_column :active_storage_blobs, :byte_size, :integer }
  end
end

Fewer deployments

This is probably my bias showing (deployments are hard right now), but I think this could be done in three deployments

first deployment
- Create a new column
- Write to both columns
- Backfill data from the old column to the new column
second deployment
- Move reads from the old column to the new column
- Stop writing to the old column
third deployment
- Drop the old column

Renaming the Column

If your end goal is to restore the column name to it's original name you would then have to do the same set of steps again:

Create a new column with the original column name
Write to both columns
Backfill data from the tempoarary column to the new column
Move reads from the temporary column to the new column
Stop writing to the temporary column
Drop the temporary column

Sources

pgwillia mentioned this issue Nov 17, 2021

Catch unsafe migrations in development #2628

Merged

pgwillia added data-migration data labels Nov 17, 2021

pgwillia closed this as completed Dec 8, 2021

pgwillia mentioned this issue Aug 25, 2022

[WIP] Change DraftItem date_created value to string type #2952

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change `byte_size` in `ActiveStorage::Blob` #2621

Change `byte_size` in `ActiveStorage::Blob` #2621

pgwillia commented Nov 15, 2021

pgwillia commented Nov 15, 2021 •

edited

Loading

pgwillia commented Nov 17, 2021

pgwillia commented Nov 17, 2021 •

edited

Loading

Change byte_size in ActiveStorage::Blob #2621

Change byte_size in ActiveStorage::Blob #2621

Comments

pgwillia commented Nov 15, 2021

pgwillia commented Nov 15, 2021 • edited Loading

Strong Migrations

pgwillia commented Nov 17, 2021

First pass

pgwillia commented Nov 17, 2021 • edited Loading

Zero downtime

Let's work through an example

Create a new column

Write to both columns

Back-fill data from the old column to the new column

In the migration

In the migration with SQL

In the migration but with a gem

Another possibly better solution from a gem

Short Lived Task

Move reads from the old column to the new column

Stop writing to the old column

Drop the old column

Fewer deployments

Renaming the Column

Sources

Change `byte_size` in `ActiveStorage::Blob` #2621

Change `byte_size` in `ActiveStorage::Blob` #2621

pgwillia commented Nov 15, 2021 •

edited

Loading

pgwillia commented Nov 17, 2021 •

edited

Loading