Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change byte_size in ActiveStorage::Blob #2621

Closed
pgwillia opened this issue Nov 15, 2021 · 3 comments
Closed

Change byte_size in ActiveStorage::Blob #2621

pgwillia opened this issue Nov 15, 2021 · 3 comments

Comments

@pgwillia
Copy link
Member

Describe the bug
We had a request by a researcher to attach several 2.8 Gb zip files to an existing object. We couldn't fulfill this request because of the way we were storing metadata about the file. We try to store the byte_size of the file (2,968,164,521) in an Integer (2,147,483,647), and it doesn't fit.

To Reproduce
Steps to reproduce the behavior:

  1. Go to the rails console
  2. Create a variable item for the Item to attach the zip files.
  3. Attempt to add and ingest one file:
item.add_and_ingest_files([File.open('/home/pjenkins/Downloads/ym2ym_absement-004.zip')])
  ActiveStorage::Blob Load (1.5ms)  SELECT "active_storage_blobs".* FROM "active_storage_blobs" INNER JOIN "active_storage_attachments" ON "active_storage_blobs"."id" = "active_storage_attachments"."blob_id" WHERE "active_storage_attachments"."record_id" = $1 AND "active_storage_attachments"."record_type" = $2 AND "active_storage_attachments"."name" = $3  [["record_id", "b470bdac-540b-4356-a796-eba1de754e6c"], ["record_type", "Item"], ["name", "files"]]
  TRANSACTION (0.9ms)  BEGIN
  User Load (0.7ms)  SELECT "users".* FROM "users" WHERE "users"."id" = $1 LIMIT $2  [["id", 1], ["LIMIT", 1]]
  Community Load (0.5ms)  SELECT "communities".* FROM "communities" WHERE "communities"."id" = $1 LIMIT $2  [["id", "46972899-2299-4913-b288-745889a8ee17"], ["LIMIT", 1]]
  Collection Load (0.6ms)  SELECT "collections".* FROM "collections" WHERE "collections"."id" = $1 LIMIT $2  [["id", "1ac2a45a-d684-463a-94f1-674112489536"], ["LIMIT", 1]]
  ActiveStorage::Attachment Load (0.8ms)  SELECT "active_storage_attachments".* FROM "active_storage_attachments" WHERE "active_storage_attachments"."record_id" = $1 AND "active_storage_attachments"."record_type" = $2 AND "active_storage_attachments"."name" = $3  [["record_id", "b470bdac-540b-4356-a796-eba1de754e6c"], ["record_type", "Item"], ["name", "files"]]
  TRANSACTION (0.5ms)  ROLLBACK
ActiveModel::RangeError: 2968164521 is out of range for ActiveModel::Type::Integer with limit 4 bytes
from /home/pjenkins/.asdf/installs/ruby/2.6.8/lib/ruby/gems/2.6.0/gems/activemodel-6.1.4.1/lib/active_model/type/integer.rb:49:in `ensure_in_range'

Expected behavior
Metadata attributes should not be a limitation for adding and ingesting files.

Additional context
#2617
https://github.com/ualbertalib/digital-preservation/issues/22

@pgwillia
Copy link
Member Author

pgwillia commented Nov 15, 2021

Strong Migrations

Strong Migrations will catch unsafe migrations in development

✓ Detects potentially dangerous operations
✓ Prevents them from running by default
✓ Provides instructions on safer ways to do what you want

  • Supports for PostgreSQL, MySQL, and MariaDB
  • Battle tested at Instacart

This is what they say about changing the type of a column

Changing the type of a column causes the entire table to be rewritten. During this time, reads and writes are blocked in Postgres, and writes are blocked in MySQL and MariaDB.

A safer approach is to:
Create a new column
Write to both columns
Backfill data from the old column to the new column
Move reads from the old column to the new column
Stop writing to the old column
Drop the old column

@pgwillia
Copy link
Member Author

First pass

My initial response to this issue was to just change the column type.

#  db/migrate/20211110162254_change_byte_size_in_active_storage_blob.rb
class ChangeByteSizeInActiveStorageBlob < ActiveRecord::Migration[6.1]
  def change
    change_column :active_storage_blobs, :byte_size, :bigint
  end
end

The reviewer of my proposed change kindly pointed out that

This kind of migration will cause a "lock" on the table, which could prevent any reads/writes for a few seconds. Think we are okay since we deploy after hours and the number of users we have is low but it's something to be aware of.

The management team want to get this out the door and can tolerate small outages so this version of the change is what we're going to use.

@pgwillia
Copy link
Member Author

pgwillia commented Nov 17, 2021

Is there a better place for this than in an issue comment?

Zero downtime

From Zero Downtime Migrations in Rails

The rule we have to adhere by is that all migrations need to be compatible with the code that is already running. This involves deploying your changes in multiple steps:

  1. Make your code compatible with the migration you need to run
  2. Run the migration
  3. Remove code specifically written to allow the migration to run

This means that instead of using a single deployment to make our code and database changes at once, we will use multiple deployments to ensure each change is compatible with the code that is already running.

Let's work through an example

This is what strong_migrations say about changing the type of a column

Changing the type of a column causes the entire table to be rewritten. During this time, reads and writes are blocked in Postgres, and writes are blocked in MySQL and MariaDB.

A safer approach is to:
  - Create a new column
  - Write to both columns
  - Backfill data from the old column to the new column
  - Move reads from the old column to the new column
  - Stop writing to the old column
  - Drop the old column

Create a new column

#  db/migrate/20211110162254_add_byte_size_bigint_to_active_storage_blob.rb
class AddByteSizeBigIntInActiveStorageBlob < ActiveRecord::Migration[6.1]
  def change
    add_column :active_storage_blobs, :byte_size_bigint, :bigint
  end
end

Write to both columns

# app/models/active_storage/blob.rb
def byte_size=(value)
    self[:byte_size] = value
    self[:byte_size_bigint] = value
  end

  def byte_size
    self[:byte_size]
  end

Could also do this by adding a trigger in SQL

Back-fill data from the old column to the new column

There seems to be a few ways that people have used to back-fill data

There are three keys to backfilling safely: batching, throttling, and running it outside a transaction. Use the Rails console or a separate migration with disable_ddl_transaction!.

In the migration

class BackfillByteSize < ActiveRecord::Migration[6.1]
  disable_ddl_transaction!

  def up
    ActiveStorage::Blob.find_each |blob|
      blob.update_column byte_size_bigint: blob.byte_size
      sleep(0.01) # throttle
    end
  end
end

In the migration with SQL

class BackfillByteSize < ActiveRecord::Migration[6.1]
  def up
    execute "UPDATE active_storage_blobs SET byte_size_bigint = byte_size;"
  end
end

In the migration but with a gem

Change data in migrations like a boss cautions

  • against using Class names in migrations because: naming is hard and there is often drift; if old migrations don't get updated with changed Class names then the migrations will fail
  • you could redefine the Class name in the migration, but this won't work with polymorphic associations
  • use SQL, but you must have strong SQL foo

Their solution is migration_data which allows you to do data migrations in def data; end within the database migration and then write tests that ensure data migrations work. I probably wouldn't go down this route because the way it suggests you couple db and data migrations doesn't solve the zero downtime problem, but it is an option. The value it has is in discussing the downsides of other methods.

Another possibly better solution from a gem

data_migrate adds rake tasks to migrate data alongside schema changes. Data migrations are stored in db/data. They act like schema migrations, except they should be reserved for data migrations. For instance, if you realize you need to titleize all your titles, this is the place to do it.

rails g data_migration add_this_to_that creates a migration in db/data. rails db:migrate is unchanged and wouldn't run this data migration, but rails db:migrate:with_data will also run any data migrations. The time stamps ensure that migrations happen in a particular order. This gem might be a good option in combination with the in migration advice above.

Short Lived Task

If a rake task fails it doesn't matter, but if a migration fails it must be undone manually in order to resume production deploys.
For complicated stuff we usually add a dry run to the rake task.
It can also take a couple hours and it’s not a big deal.

# lib/tasks/data_migration.rake
namespace :data_migration do
  desc 'backfill the byte_size_bigint column'
  task backfill_byte_size: :environment do
    Job.in_batches.update_all("byte_size_bigint = byte_size")
  end
end

Move reads from the old column to the new column

# app/models/active_storage/blob.rb
def byte_size
  self[:byte_size_bigint]
end

Stop writing to the old column

# app/models/active_storage/blob.rb
self.ignored_columns = [:byte_size]

def byte_size=(value)
  self[:byte_size_bigint] = value
end

Another way I saw this done

# app/models/active_storage/blob.rb
def self.columns
  super.reject { |column| column.name == "byte_size" }
end

Drop the old column

class RemoveByteSizeFromActiveStorageBlob < ActiveRecord::Migration[6.1]
  def change
    safety_assured { remove_column :active_storage_blobs, :byte_size, :integer }
  end
end

Fewer deployments

This is probably my bias showing (deployments are hard right now), but I think this could be done in three deployments

  • first deployment
    • Create a new column
    • Write to both columns
    • Backfill data from the old column to the new column
  • second deployment
    • Move reads from the old column to the new column
    • Stop writing to the old column
  • third deployment
    • Drop the old column

Renaming the Column

If your end goal is to restore the column name to it's original name you would then have to do the same set of steps again:

  • Create a new column with the original column name
  • Write to both columns
  • Backfill data from the tempoarary column to the new column
  • Move reads from the temporary column to the new column
  • Stop writing to the temporary column
  • Drop the temporary column

Sources

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant