-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert longer media varchar
fields to text
in the catalog db
#4357
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The migrations look good; lgtm
Full-stack documentation: https://docs.openverse.org/_preview/4357 Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again. You can check the GitHub pages deployment action list to see the current status of the deployments. Changed files 🔄: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Manually running the migrations locally works just fine 👍 I really can't wait until #1836, manually running destructive commands without any kind of real way to automatically test that it actually works is... a nightmare 🙂
Anyway, this does work, as far as I can tell, so LGTM!
we can run it on the catalog database and merge this PR.
@AetherUnbound will we need to pause running DAGs while this happens? I don't know whether any of the DAGs create long transactions, but if they do, is there any risk we need to consider when applying this migration to prevent a deadlock or some other timing issue?
(audio.audio_set ->> 'url'::text) ::character varying(1000) AS url, | ||
(audio.audio_set ->> 'filesize'::text) ::integer AS filesize, | ||
(audio.audio_set ->> 'filetype'::text) ::character varying(80) AS filetype, | ||
(audio.audio_set ->> 'thumbnail'::text) ::character varying(1000) AS thumbnail, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice simplification 😀
We should definitely wait for an opportunity to apply this when there are no running DAGs or data refresh. Currently we're running a staging data refresh for testing purposes so we'll want to wait until that's done copying data from the upstream database as well before running this migration. I'll wait to merge this until it's applied. |
I applied the steps above in order to the production database with one exception: "audio_provider_fid_idx" UNIQUE, btree (provider, md5(foreign_identifier::text)) I theorized that we could drop the index, apply the alter, then re-add the index without the type coersion, and that ended up working! So for each table, here's the additional steps I took for -- Audio update
DROP INDEX IF EXISTS audio_provider_fid_idx;
ALTER TABLE audio ALTER COLUMN foreign_identifier TYPE TEXT USING foreign_identifier :: TEXT;
CREATE UNIQUE INDEX audio_provider_fid_idx
ON public.audio
USING btree (provider, md5(foreign_identifier));
-- Image update
DROP INDEX IF EXISTS image_provider_fid_idx;
ALTER TABLE image ALTER COLUMN foreign_identifier TYPE TEXT USING foreign_identifier :: TEXT;
CREATE UNIQUE INDEX image_provider_fid_idx
ON public.image
USING btree (provider, md5(foreign_identifier)); These are now applied and the catalog database tables look as expected: deploy@localhost:openledger> \d audio;
+------------------------------+--------------------------+--------------------------------------+
| Column | Type | Modifiers |
|------------------------------+--------------------------+--------------------------------------|
| identifier | uuid | not null default uuid_generate_v4() |
| created_on | timestamp with time zone | not null |
| updated_on | timestamp with time zone | not null |
| ingestion_type | character varying(80) | |
| provider | character varying(80) | |
| source | character varying(80) | |
| foreign_identifier | text | |
| foreign_landing_url | text | |
| url | text | not null |
| thumbnail | text | |
| filetype | character varying(5) | |
| duration | integer | |
| bit_rate | integer | |
| sample_rate | integer | |
| category | character varying(80) | |
| genres | character varying(80)[] | |
| audio_set | jsonb | |
| set_position | integer | |
| alt_files | jsonb | |
| filesize | integer | |
| license | character varying(50) | not null |
| license_version | character varying(25) | |
| creator | text | |
| creator_url | text | |
| title | text | |
| meta_data | jsonb | |
| tags | jsonb | |
| watermarked | boolean | |
| last_synced_with_source | timestamp with time zone | |
| removed_from_source | boolean | not null |
| standardized_popularity | double precision | |
| audio_set_foreign_identifier | text | |
+------------------------------+--------------------------+--------------------------------------+
Indexes:
"audio_pkey" PRIMARY KEY, btree (identifier)
"audio_provider_fid_idx" UNIQUE, btree (provider, md5(foreign_identifier))
"audio_url_key" UNIQUE, btree (url)
Time: 0.518s
deploy@localhost:openledger> \d image;
+-------------------------+--------------------------+--------------------------------------+
| Column | Type | Modifiers |
|-------------------------+--------------------------+--------------------------------------|
| identifier | uuid | not null default uuid_generate_v4() |
| created_on | timestamp with time zone | not null |
| updated_on | timestamp with time zone | not null |
| ingestion_type | character varying(80) | |
| provider | character varying(80) | |
| source | character varying(80) | |
| foreign_identifier | text | |
| foreign_landing_url | text | |
| url | text | not null |
| thumbnail | text | |
| width | integer | |
| height | integer | |
| filesize | integer | |
| license | character varying(50) | not null |
| license_version | character varying(25) | |
| creator | text | |
| creator_url | text | |
| title | text | |
| meta_data | jsonb | |
| tags | jsonb | |
| watermarked | boolean | |
| last_synced_with_source | timestamp with time zone | |
| removed_from_source | boolean | not null |
| filetype | character varying(5) | |
| category | character varying(80) | |
| standardized_popularity | double precision | |
+-------------------------+--------------------------+--------------------------------------+
Indexes:
"image_pkey" PRIMARY KEY, btree (identifier)
"image_provider_fid_idx" UNIQUE, btree (provider, md5(foreign_identifier))
"image_url_key" UNIQUE, btree (url)
Time: 0.515s |
Fixes
Fixes #4312 by @AetherUnbound
Description
This PR alters the SQL DDL files for the catalog database to reflect the changes that will be made to the existing catalog database with the following migration:
Once the above migration is reviewed and approved, we can run it on the catalog database and merge this PR.
I've also updated some of the mock schemas for the ingestion server testing. We won't need these in perpetuity, fortunately, thanks to #3925.
Testing Instructions
main
just down -v && just api/init
just catalog/pgcli
, then the above migrations in order (you'll need to copy out each block below the comments, pgcli doesn't seem to play nicely with comments)\d image
,\d audio
, and\d audioset_view
in the shell to verify the tables are now usingtext
just c
to start the catalog and start the Jamendo and SMK DAGs to pull in new datajust down -v && just catalog/pgcli
and verify that the freshly created tables/views now usetext
Checklist
Update index.md
).main
) or a parent feature branch.just catalog/generate-docs
for catalogPRs) or the media properties generator (
just catalog/generate-docs media-props
for the catalog or
just api/generate-docs
for the API) where applicable.Developer Certificate of Origin
Developer Certificate of Origin