Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert longer media varchar fields to text in the API #4315

Merged
merged 2 commits into from
May 13, 2024

Conversation

AetherUnbound
Copy link
Collaborator

@AetherUnbound AetherUnbound commented May 10, 2024

Fixes

Fixes #4311 by @AetherUnbound

Description

This PR converts the listed fields from an underlying character varying type to the text type. In some cases, URL validation was used for fields. I created a custom field type which mirrors URLField except that it inherits from TextField rather than CharField.

Below is the SQL migration for this change. There are several indices that are dropped, but note that these are only the _like indices that optimize for varchar_pattern_ops. Since we perform all our searching with Elasticsearch and still have the standard indices, I think it's totally fine to have them be removed (and may even speed up certain operations).

BEGIN;
--
-- Alter field audio_set_foreign_identifier on audio
--
ALTER TABLE "audio" ALTER COLUMN "audio_set_foreign_identifier" TYPE text USING "audio_set_foreign_identifier"::text;
--
-- Alter field creator on audio
--
ALTER TABLE "audio" ALTER COLUMN "creator" TYPE text USING "creator"::text;
--
-- Alter field creator_url on audio
--
ALTER TABLE "audio" ALTER COLUMN "creator_url" TYPE text USING "creator_url"::text;
--
-- Alter field foreign_identifier on audio
--
DROP INDEX IF EXISTS "audio_foreign_identifier_617f66ad_like";
ALTER TABLE "audio" ALTER COLUMN "foreign_identifier" TYPE text USING "foreign_identifier"::text;
--
-- Alter field foreign_landing_url on audio
--
ALTER TABLE "audio" ALTER COLUMN "foreign_landing_url" TYPE text USING "foreign_landing_url"::text;
--
-- Alter field thumbnail on audio
--
ALTER TABLE "audio" ALTER COLUMN "thumbnail" TYPE text USING "thumbnail"::text;
--
-- Alter field title on audio
--
ALTER TABLE "audio" ALTER COLUMN "title" TYPE text USING "title"::text;
--
-- Alter field url on audio
--
DROP INDEX IF EXISTS "audio_url_b6a832d3_like";
ALTER TABLE "audio" ALTER COLUMN "url" TYPE text USING "url"::text;
--
-- Alter field creator on audioset
--
ALTER TABLE "audioset" ALTER COLUMN "creator" TYPE text USING "creator"::text;
--
-- Alter field creator_url on audioset
--
ALTER TABLE "audioset" ALTER COLUMN "creator_url" TYPE text USING "creator_url"::text;
--
-- Alter field foreign_identifier on audioset
--
DROP INDEX IF EXISTS "audioset_foreign_identifier_ef0c8e77_like";
ALTER TABLE "audioset" ALTER COLUMN "foreign_identifier" TYPE text USING "foreign_identifier"::text;
--
-- Alter field foreign_landing_url on audioset
--
ALTER TABLE "audioset" ALTER COLUMN "foreign_landing_url" TYPE text USING "foreign_landing_url"::text;
--
-- Alter field thumbnail on audioset
--
ALTER TABLE "audioset" ALTER COLUMN "thumbnail" TYPE text USING "thumbnail"::text;
--
-- Alter field title on audioset
--
ALTER TABLE "audioset" ALTER COLUMN "title" TYPE text USING "title"::text;
--
-- Alter field url on audioset
--
DROP INDEX IF EXISTS "audioset_url_144aed53_like";
ALTER TABLE "audioset" ALTER COLUMN "url" TYPE text USING "url"::text;
--
-- Alter field creator on image
--
ALTER TABLE "image" ALTER COLUMN "creator" TYPE text USING "creator"::text;
--
-- Alter field creator_url on image
--
ALTER TABLE "image" ALTER COLUMN "creator_url" TYPE text USING "creator_url"::text;
--
-- Alter field foreign_identifier on image
--
DROP INDEX IF EXISTS "image_foreign_identifier_4c72d3ee_like";
ALTER TABLE "image" ALTER COLUMN "foreign_identifier" TYPE text USING "foreign_identifier"::text;
--
-- Alter field foreign_landing_url on image
--
ALTER TABLE "image" ALTER COLUMN "foreign_landing_url" TYPE text USING "foreign_landing_url"::text;
--
-- Alter field thumbnail on image
--
ALTER TABLE "image" ALTER COLUMN "thumbnail" TYPE text USING "thumbnail"::text;
--
-- Alter field title on image
--
ALTER TABLE "image" ALTER COLUMN "title" TYPE text USING "title"::text;
--
-- Alter field url on image
--
DROP INDEX IF EXISTS "image_url_c6aabda2_like";
ALTER TABLE "image" ALTER COLUMN "url" TYPE text USING "url"::text;
COMMIT;

For instance, here's the indices on the audio table:

Indexes:
    "audio_pkey" PRIMARY KEY, btree (id)
    "audio_identifier_key" UNIQUE, btree (identifier)
    "audio_url_key" UNIQUE, btree (url)
    "unique_provider_audio" UNIQUE, btree (foreign_identifier, provider)
    "audio_category_ceb7d386" btree (category)
    "audio_category_ceb7d386_like" btree (category varchar_pattern_ops)
    "audio_foreign_identifier_617f66ad" btree (foreign_identifier)
    "audio_foreign_identifier_617f66ad_like" btree (foreign_identifier varchar_pattern_ops)
    "audio_genres_e34cc474" btree (genres)
    "audio_last_synced_with_source_94c4a383" btree (last_synced_with_source)
    "audio_provider_8fe1eb54" btree (provider)
    "audio_provider_8fe1eb54_like" btree (provider varchar_pattern_ops)
    "audio_source_e9ccc813" btree (source)
    "audio_source_e9ccc813_like" btree (source varchar_pattern_ops)
    "audio_url_b6a832d3_like" btree (url varchar_pattern_ops)

Only the following indices would be removed:

  • audio_foreign_identifier_617f66ad_like
  • audio_url_b6a832d3_like

Each of these has a standard btree index which would remain after the operation.

Testing Instructions

CI should pass, since this is essentially a semantic underlying database change.

Checklist

  • My pull request has a descriptive title (not a vague title likeUpdate index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.
  • I ran the DAG documentation generator (just catalog/generate-docs for catalog
    PRs) or the media properties generator (just catalog/generate-docs media-props
    for the catalog or just api/generate-docs for the API) where applicable.

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@github-actions github-actions bot added the 🧱 stack: api Related to the Django API label May 10, 2024
@openverse-bot openverse-bot added 🟧 priority: high Stalls work on the project or its dependents 🧰 goal: internal improvement Improvement that benefits maintainers, not users 💻 aspect: code Concerns the software code in the repository labels May 10, 2024
@github-actions github-actions bot added the migrations Modifications to Django migrations label May 10, 2024
@AetherUnbound AetherUnbound force-pushed the feature/api-text-fields branch from 8bc9fe8 to b0695a0 Compare May 10, 2024 23:20
@WordPress WordPress deleted a comment from github-actions bot May 10, 2024
Copy link

This PR has migrations. Please rebase it before merging to ensure that conflicting migrations are not introduced.

@AetherUnbound AetherUnbound marked this pull request as ready for review May 10, 2024 23:26
@AetherUnbound AetherUnbound requested review from a team as code owners May 10, 2024 23:26
Copy link

Full-stack documentation: https://docs.openverse.org/_preview/4315

Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again.

You can check the GitHub pages deployment action list to see the current status of the deployments.

Changed files 🔄:

Copy link
Collaborator

@sarayourfriend sarayourfriend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we perform all our searching with Elasticsearch and still have the standard indices, I think it's totally fine to have them be removed (and may even speed up certain operations).

💯. This was my thought exactly when I read those changes in the migration.

LGTM. I've left one remark regarding the URL fields. As clarified there, it can be a fast follow or a separate issue, if you prefer to avoid bike shedding the decision here. In my opinion it isn't controversial to say we don't need write-time validators on models that are never written to using Django ORM, but I also respect that could be seen as a significant change (however much I don't see it to be, given the usage of the models). All of that to say, just want to reiterate it is not a blocker.

class URLTextField(models.TextField):
"""URL field which uses the underlying Postgres TEXT column type."""

default_validators = [validators.URLValidator()]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Up front context) I know we currently use a URL field for some of these (though this would be a new detail for creator_url, I think), so this is only relevant feedback for this PR due to the need to add new code to continue supporting it. Just wanting to clarify up front I know this isn't a decision you're making about the model, and that my intention is to ask whether we need this at all, and if not, then to give feedback that this code is unnecessary and can be removed.


This validator is the only thing that differentiates the URL field from a regular text field. I question whether we need this. Not blocking because it's a trivial thing to remove in the future or as a fast-follow if we like (would be a no-SQL migration). I'd prefer we didn't add this code, however. These write-time Django validators are irrelevant for our domain and usage of the ORM because we never write to these tables with the Django ORM (except in tests). Data validation either has to happen in the catalogue, data refresh, or not at all, but definitely not in Django write-time validators.

The change requested would be to use a TextField and forego write-time validators as a code-quality improvement and clarification of the intention of these models and the domain they're actually concerned with (i.e. reading).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an interesting point. I can't think of a use case right now, but I also think it doesn't hurt to leave the validation for this kind of field. I have no strong opinion on either side.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am very happy to remove this code as a fast-follow! I wanted to make as functionally minimal a change as possible here, even if it meant adding more code. I'll follow up with an issue and a PR later this week.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AetherUnbound Just checking, were you able to create the issue?

@krysal I guess generally there is a rationale of removing code that isn't used. All code is a liability, either as a vulnerability or increased maintenance cost (the latter being most relevant and in fact exemplified here), so as much of it as we don't need is a good idea to remove.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created in #4320

Copy link
Member

@krysal krysal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. It's surprisingly nice how seamless the change is, and good to know the unique constraints are not affected.

Comment on lines +14 to +15
# As with CharField, this will cause URL validation to be performed
# twice.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Twice meaning it would happen in the form and at the database level?

class URLTextField(models.TextField):
"""URL field which uses the underlying Postgres TEXT column type."""

default_validators = [validators.URLValidator()]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an interesting point. I can't think of a use case right now, but I also think it doesn't hurt to leave the validation for this kind of field. I have no strong opinion on either side.

@krysal krysal mentioned this pull request May 13, 2024
@AetherUnbound AetherUnbound merged commit 125c65e into main May 13, 2024
55 checks passed
@AetherUnbound AetherUnbound deleted the feature/api-text-fields branch May 13, 2024 15:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🧰 goal: internal improvement Improvement that benefits maintainers, not users migrations Modifications to Django migrations 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: api Related to the Django API
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Convert longer media varchar fields to text in the API
4 participants