Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save cleaned up data during the cleanup step #904

Merged
merged 15 commits into from
Mar 28, 2023
Merged

Conversation

obulat
Copy link
Contributor

@obulat obulat commented Mar 13, 2023

Fixes

Fixes #861 by @krysal
Fixes #654 by @obulat

Description

This PR adds more logging for the data refresh, but its main goal is to be a proof-of-concept of saving the data during weekly data refresh as a preparation step for data normalization.

Data refresh image cleanup steps:

  • Add http or https protocol to URLs that don't have a scheme in "url", "creator_url", "foreign_landing_url" fields
  • Clean up tags:
    • tags that contain anything from the TAG_CONTAINS_DENYLIST set
    • AI-generated tags ("provider": "clarifai") with confidence level below TAG_MIN_CONFIDENCE = 0.90

This PR also adds a Wikimedia title cleanup step that removes File: prefix and file extension suffix from the image title. This step was added because in the Openverse Inserter PR it was specifically pointed out that those titles are bad for UX. The Wikimedia title cleanup step can be added after during the second run of this PR in prod.

There is also a step that we need to add to the cleanup process for incorrect utf-8 tags, but I think we should add it in a later refresh (gist with the implementation) so as the cleanup step does not become much longer.

This PR saves one file per cleaned field in a tsv format. The files contain the image identifier and the cleaned data. I don't know where the best place to save them is - all suggestions welcome!

Testing Instructions

Replace sample_data/sample_images.csv with the file in this gist (https://gist.github.com/obulat/b31e43b131352b8f6cd66a2dd87061d8), and run just recreate (or just start -> just init, if you haven't run the API before). You should see the tsv files recreated, logging about the cleaned fields:

2023-02-05 10:11:52 2023-02-05 07:11:52,223 INFO cleanup.py:276 - Finished saving cleaned data in 0.059366464614868164
2023-02-05 10:11:52 2023-02-05 07:11:52,223 INFO cleanup.py:353 - Batch finished, records/s: cleanup_rate=229.1038788216563
2023-02-05 10:11:52 2023-02-05 07:11:52,223 INFO cleanup.py:354 - Fetching next batch. Records cleaned so far: 1899, counts: {'tags': 320, 'url': 2, 'creator_url': 404, 'foreign_landing_url': 294, 'title': 224}
2023-02-05 10:11:52 2023-02-05 07:11:52,238 INFO cleanup.py:362 - Cleaned all records in 8.332297325134277 seconds, counts: {'tags': 320, 'url': 2, 'creator_url': 404, 'foreign_landing_url': 294, 'title': 224}

Also, check the logs about updated fields/values and TLS_CACHE.

Checklist

  • My pull request has a descriptive title (not a vague title like
    Update index.md).
  • My pull request targets the default branch of the repository (main) or
    a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible
    errors.

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@obulat obulat requested a review from a team as a code owner March 13, 2023 16:58
@obulat obulat requested review from krysal and sarayourfriend March 13, 2023 16:58
@obulat obulat self-assigned this Mar 13, 2023
@obulat obulat added 🟧 priority: high Stalls work on the project or its dependents 🤖 aspect: dx Concerns developers' experience with the codebase 🧰 goal: internal improvement Improvement that benefits maintainers, not users labels Mar 13, 2023
@github-actions github-actions bot added the 🧱 stack: ingestion server Related to the ingestion/data refresh server label Mar 13, 2023
@github-actions
Copy link

github-actions bot commented Mar 13, 2023

Full-stack documentation: Ready

https://WordPress.github.io/openverse/_preview/904

Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again.

You can check the GitHub pages deployment action list to see the current status of the deployments.

Copy link
Collaborator

@sarayourfriend sarayourfriend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to have worked perfectly for me locally using the sample images file you shared.

Should we switch to using that sample images file permanently to ensure this feature is tested on a regular basis?

Here are the sample output logs I found to show it working:

openverse-ingestion_server-1  | 2023-03-13 22:32:11,922 INFO cleanup.py:258 - TLS cache: {'www.flickr.com': True, 'commons.wikimedia.org': True, 'https://www.eol.org/': True, '.geograph.org.uk': True, '.eol.org': True, '.digitaltmuseum.org': True, 'www.geograph.org.uk': True, 'www.eol.org': True}
openverse-ingestion_server-1  | 2023-03-13 22:32:11,922 INFO cleanup.py:259 - Worker committing changes...
openverse-ingestion_server-1  | 2023-03-13 22:32:11,923 INFO cleanup.py:265 - Worker finished batch in 3.2522239685058594
openverse-ingestion_server-1  | 2023-03-13 22:32:14,006 INFO cleanup.py:200 - https://musee-mccord.qc.ca/ObjView/M965.199.10008.jpg:403
openverse-ingestion_server-1  | 2023-03-13 22:32:14,006 INFO cleanup.py:103 - Tested domain .musee-mccord.qc.ca
openverse-ingestion_server-1  | 2023-03-13 22:32:14,006 INFO cleanup.py:243 - Updated url for 74454cfd-489d-4c7a-bdda-d7eef06d6d2b from '{dirty_value}' to '{clean}'
openverse-ingestion_server-1  | 2023-03-13 22:32:14,007 INFO cleanup.py:200 - https://musee-mccord.qc.ca/ObjView/5344.jpg:403
openverse-ingestion_server-1  | 2023-03-13 22:32:14,007 INFO cleanup.py:103 - Tested domain .musee-mccord.qc.ca

Are there any useful unit tests for us to add for this change? I'm not requesting changes for it because I'm not certain that testing log output is 100% necessary. However, if we're going to rely on it for analysis or something else like that it might be good to add a unit test to re-enforce the expected format.

Comment on lines 53 to 63
# We know that flickr and wikimedia support TLS, so we can add them here
TLS_CACHE = {
"www.flickr.com": True,
"commons.wikimedia.org": True,
"https://www.eol.org/": True,
".geograph.org.uk": True,
".eol.org": True,
".digitaltmuseum.org": True,
"www.geograph.org.uk": True,
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did the others get added? Are they similar to Flickr and Wikimedia in that we just know that they do support TLS?

If that's the case, would it be worth manually testing providers for this and adding them to the list (understanding how tedious that is)? Or, is it something we need to monitor/update over time due to the potential for this status to change (I suppose, most likely, that someone starts to support it that previously didn't)?

Would moving this into Redis make sense at all (as an entirely separate issue) or is there a different future change that would persist this TLS support status?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added these manually, by looking through the logs and adding the ones that were being tested. Your logs suggest that .musee-mccord.qc.ca should have also been added :)

If that's the case, would it be worth manually testing providers for this and adding them to the list (understanding how tedious that is)? Or, is it something we need to monitor/update over time due to the potential for this status to change (I suppose, most likely, that someone starts to support it that previously didn't)?

In short, I think that the TLS support check, the way it's done right now, should go away after we finish the cleanup step (1-2 refreshes to get the updated TSV and 1-2 update DAG runs could be enough).

We do not test the URLs that have insecure http scheme for TLS support. The main reason we were testing for TLS support was to add a best scheme to the URLs that don't have it (to convert urls like www.flickr.com/image/path to https://www.flickr.com/image/path). There are not so many such rows in the database, mainly the ones that were ingested before the ImageStore improvements in the catalog, that were also not re-ingested. When we use the TSV from the cleanup run to update the catalog, all of the URLs will have a scheme, whether it is http or https, so the URL cleanup function will be parsing the URL and not running the TLS support checks because the scheme will not be "".

https://github.com/WordPress/openverse/blob/2646c5ead465603b42c70f58a190f7b50861d698/ingestion_server/ingestion_server/cleanup.py#L83-84

Do you think we should monitor for TLS support status? If so, I think this should be a separate issue. We could add a cleanup function to test domains with http for TLS support and report all of them, and then test them and update the URLs if the support changes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation and motivation for this feature.

Do you think we should monitor for TLS support status?

If we're hand-maintaining the list then yes, I think we should revisit it periodically or else the cleanup step here will apply the incorrect transformations.

I agree that that is out of scope of this issue, sort of, but if we're expanding the list of sites we're automatically applying the transformation to, then it does make the matter slightly more pressing as the area of effect is slightly wider. The added providers are small though, I think, so it's negligible. In any case, I agree it's a separate issue. I wanted to mention it in case it needs to be explicitly documented as such in a GitHub issue.

If it's something that will go away soon though, due to some other mechanism that will render this step unnecessary, then we can ignore it altogether and just document in the code that it's a temporary hold-over.

@sarayourfriend
Copy link
Collaborator

Oh, one requested change. I just noticed my git state was messy after reviewing this PR. Can we add the files to gitignore so they don't appear locally as changes?

@obulat
Copy link
Contributor Author

obulat commented Mar 15, 2023

Should we switch to using that sample images file permanently to ensure this feature is tested on a regular basis?

I'm not sure. The end goal for these changes is to remove the cleanup step. Or at least to remove the functions that we can remove. So, optimally, we would not have any data in the catalog that does not have a scheme in the URLs, and tags that are denylisted or badly-formed. Then, this sample data would be wrong. Does it make sense to add these rows to sample data until we update the catalog, and remove it after we're done?

Copy link
Member

@krysal krysal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect! Then, do you have a plan for the produced files here?

@obulat
Copy link
Contributor Author

obulat commented Mar 16, 2023

Perfect! Then, do you have a plan for the produced files here?

No, actually that's what I need help with. What's the best way of getting these TSV files from the Ingestion server to somewhere where the catalog can use them? I assume we should upload them to S3. Would it be more practical to somehow do it manually to avoid managing secrets here? @krysal ? @AetherUnbound ?

@sarayourfriend
Copy link
Collaborator

Would it be more practical to somehow do it manually to avoid managing secrets here?

We might be able to set the permissions of the EC2 boxes to allow them to upload to a specific S3 bucket without needing credentials (I think). https://aws.amazon.com/premiumsupport/knowledge-center/ec2-instance-access-s3-bucket/

@github-actions github-actions bot added 🧱 stack: api Related to the Django API 🧱 stack: frontend Related to the Nuxt frontend labels Mar 20, 2023
@obulat obulat force-pushed the add/save_cleanup_info branch from 46078ce to 2e1dce9 Compare March 20, 2023 15:41
@obulat obulat requested review from a team as code owners March 20, 2023 15:41
@obulat obulat requested a review from zackkrida March 20, 2023 15:41
@obulat obulat force-pushed the add/save_cleanup_info branch from 2e1dce9 to 80798be Compare March 20, 2023 15:42
@obulat
Copy link
Contributor Author

obulat commented Mar 20, 2023

I am going to merge this PR as is, and we can download the files from the box and upload them to S3 manually.

Hopefully, we will remove this process soon after we clean up the data, so it is an acceptable solution for that.

@sarayourfriend
Copy link
Collaborator

@obulat Sounds good. The unit tests for ingestion server failed, I've re-run them to see if they pass.

Is there an issue for tracking the following up work?

@obulat obulat removed 🧱 stack: api Related to the Django API 🧱 stack: frontend Related to the Nuxt frontend labels Mar 21, 2023
@obulat obulat force-pushed the add/save_cleanup_info branch from 97d57f3 to ccdd6bb Compare March 21, 2023 12:42
@github-actions
Copy link

Size Change: 0 B

Total Size: 882 kB

ℹ️ View Unchanged
Filename Size
./frontend/.nuxt/dist/client/235.js 273 B
./frontend/.nuxt/dist/client/235.modern.js 278 B
./frontend/.nuxt/dist/client/236.js 1.85 kB
./frontend/.nuxt/dist/client/app.js 143 kB
./frontend/.nuxt/dist/client/app.modern.js 117 kB
./frontend/.nuxt/dist/client/commons/app.js 87.8 kB
./frontend/.nuxt/dist/client/commons/app.modern.js 78.3 kB
./frontend/.nuxt/dist/client/components/loading-icon.js 747 B
./frontend/.nuxt/dist/client/components/loading-icon.modern.js 753 B
./frontend/.nuxt/dist/client/components/table-sort-icon.js 515 B
./frontend/.nuxt/dist/client/components/table-sort-icon.modern.js 518 B
./frontend/.nuxt/dist/client/components/v-all-results-grid.js 8.01 kB
./frontend/.nuxt/dist/client/components/v-all-results-grid.modern.js 5.49 kB
./frontend/.nuxt/dist/client/components/v-audio-cell.js 392 B
./frontend/.nuxt/dist/client/components/v-audio-cell.modern.js 397 B
./frontend/.nuxt/dist/client/components/v-audio-details.js 2.55 kB
./frontend/.nuxt/dist/client/components/v-audio-details.modern.js 1.79 kB
./frontend/.nuxt/dist/client/components/v-audio-track-skeleton.js 1.02 kB
./frontend/.nuxt/dist/client/components/v-audio-track-skeleton.modern.js 1.02 kB
./frontend/.nuxt/dist/client/components/v-audio-track.js 5.22 kB
./frontend/.nuxt/dist/client/components/v-audio-track.modern.js 5.18 kB
./frontend/.nuxt/dist/client/components/v-back-to-search-results-link.js 543 B
./frontend/.nuxt/dist/client/components/v-back-to-search-results-link.modern.js 547 B
./frontend/.nuxt/dist/client/components/v-bone.js 693 B
./frontend/.nuxt/dist/client/components/v-bone.modern.js 697 B
./frontend/.nuxt/dist/client/components/v-box-layout.js 1.24 kB
./frontend/.nuxt/dist/client/components/v-box-layout.modern.js 1.24 kB
./frontend/.nuxt/dist/client/components/v-content-link.js 1.12 kB
./frontend/.nuxt/dist/client/components/v-content-link.modern.js 1.1 kB
./frontend/.nuxt/dist/client/components/v-content-page.js 526 B
./frontend/.nuxt/dist/client/components/v-content-page.modern.js 530 B
./frontend/.nuxt/dist/client/components/v-content-report-button.js 785 B
./frontend/.nuxt/dist/client/components/v-content-report-button.modern.js 789 B
./frontend/.nuxt/dist/client/components/v-content-report-form.js 6.11 kB
./frontend/.nuxt/dist/client/components/v-content-report-form.modern.js 3.59 kB
./frontend/.nuxt/dist/client/components/v-content-report-popover.js 1.24 kB
./frontend/.nuxt/dist/client/components/v-content-report-popover.modern.js 4.25 kB
./frontend/.nuxt/dist/client/components/v-copy-button.js 4 kB
./frontend/.nuxt/dist/client/components/v-copy-button.modern.js 4.01 kB
./frontend/.nuxt/dist/client/components/v-copy-license.js 1 kB
./frontend/.nuxt/dist/client/components/v-copy-license.modern.js 1 kB
./frontend/.nuxt/dist/client/components/v-copy-license/components/v-error-image/components/v-media-reuse/components/v-search-grid/d219393b.js 9.96 kB
./frontend/.nuxt/dist/client/components/v-copy-license/components/v-error-image/components/v-media-reuse/components/v-search-grid/d219393b.modern.js 9.94 kB
./frontend/.nuxt/dist/client/components/v-dmca-notice.js 754 B
./frontend/.nuxt/dist/client/components/v-dmca-notice.modern.js 758 B
./frontend/.nuxt/dist/client/components/v-error-image.js 1.7 kB
./frontend/.nuxt/dist/client/components/v-error-image.modern.js 1.69 kB
./frontend/.nuxt/dist/client/components/v-error-section.js 372 B
./frontend/.nuxt/dist/client/components/v-error-section.modern.js 376 B
./frontend/.nuxt/dist/client/components/v-external-search-form.js 1.93 kB
./frontend/.nuxt/dist/client/components/v-external-search-form.modern.js 1.92 kB
./frontend/.nuxt/dist/client/components/v-external-source-list.js 905 B
./frontend/.nuxt/dist/client/components/v-external-source-list.modern.js 906 B
./frontend/.nuxt/dist/client/components/v-full-layout.js 1.52 kB
./frontend/.nuxt/dist/client/components/v-full-layout.modern.js 1.52 kB
./frontend/.nuxt/dist/client/components/v-grid-skeleton.js 1.62 kB
./frontend/.nuxt/dist/client/components/v-grid-skeleton.modern.js 1.63 kB
./frontend/.nuxt/dist/client/components/v-home-gallery.js 5.18 kB
./frontend/.nuxt/dist/client/components/v-home-gallery.modern.js 5.17 kB
./frontend/.nuxt/dist/client/components/v-homepage-content.js 1.76 kB
./frontend/.nuxt/dist/client/components/v-homepage-content.modern.js 1.73 kB
./frontend/.nuxt/dist/client/components/v-image-carousel.js 4.73 kB
./frontend/.nuxt/dist/client/components/v-image-carousel.modern.js 4.71 kB
./frontend/.nuxt/dist/client/components/v-image-cell.js 1.57 kB
./frontend/.nuxt/dist/client/components/v-image-cell.modern.js 1.56 kB
./frontend/.nuxt/dist/client/components/v-image-details.js 2.16 kB
./frontend/.nuxt/dist/client/components/v-image-details.modern.js 1.43 kB
./frontend/.nuxt/dist/client/components/v-image-grid.js 4.99 kB
./frontend/.nuxt/dist/client/components/v-image-grid.modern.js 2.52 kB
./frontend/.nuxt/dist/client/components/v-license-tab-panel.js 526 B
./frontend/.nuxt/dist/client/components/v-license-tab-panel.modern.js 529 B
./frontend/.nuxt/dist/client/components/v-load-more.js 3.17 kB
./frontend/.nuxt/dist/client/components/v-load-more.modern.js 695 B
./frontend/.nuxt/dist/client/components/v-media-license.js 829 B
./frontend/.nuxt/dist/client/components/v-media-license.modern.js 837 B
./frontend/.nuxt/dist/client/components/v-media-reuse.js 1.63 kB
./frontend/.nuxt/dist/client/components/v-media-reuse.modern.js 1.63 kB
./frontend/.nuxt/dist/client/components/v-media-tag.js 434 B
./frontend/.nuxt/dist/client/components/v-media-tag.modern.js 439 B
./frontend/.nuxt/dist/client/components/v-modal.js 1.01 kB
./frontend/.nuxt/dist/client/components/v-modal.modern.js 996 B
./frontend/.nuxt/dist/client/components/v-no-results.js 757 B
./frontend/.nuxt/dist/client/components/v-no-results.modern.js 756 B
./frontend/.nuxt/dist/client/components/v-radio.js 1.51 kB
./frontend/.nuxt/dist/client/components/v-radio.modern.js 1.47 kB
./frontend/.nuxt/dist/client/components/v-related-audio.js 1.26 kB
./frontend/.nuxt/dist/client/components/v-related-audio.modern.js 1.26 kB
./frontend/.nuxt/dist/client/components/v-related-images.js 1.06 kB
./frontend/.nuxt/dist/client/components/v-related-images.modern.js 3.1 kB
./frontend/.nuxt/dist/client/components/v-report-desc-form.js 977 B
./frontend/.nuxt/dist/client/components/v-report-desc-form.modern.js 981 B
./frontend/.nuxt/dist/client/components/v-row-layout.js 1.71 kB
./frontend/.nuxt/dist/client/components/v-row-layout.modern.js 1.72 kB
./frontend/.nuxt/dist/client/components/v-scroll-button.js 824 B
./frontend/.nuxt/dist/client/components/v-scroll-button.modern.js 830 B
./frontend/.nuxt/dist/client/components/v-search-grid.js 5.75 kB
./frontend/.nuxt/dist/client/components/v-search-grid.modern.js 5.68 kB
./frontend/.nuxt/dist/client/components/v-search-results-title.js 600 B
./frontend/.nuxt/dist/client/components/v-search-results-title.modern.js 604 B
./frontend/.nuxt/dist/client/components/v-search-type-radio.js 806 B
./frontend/.nuxt/dist/client/components/v-search-type-radio.modern.js 781 B
./frontend/.nuxt/dist/client/components/v-server-timeout.js 300 B
./frontend/.nuxt/dist/client/components/v-server-timeout.modern.js 303 B
./frontend/.nuxt/dist/client/components/v-sketch-fab-viewer.js 3.39 kB
./frontend/.nuxt/dist/client/components/v-sketch-fab-viewer.modern.js 913 B
./frontend/.nuxt/dist/client/components/v-snackbar.js 1.19 kB
./frontend/.nuxt/dist/client/components/v-snackbar.modern.js 1.19 kB
./frontend/.nuxt/dist/client/components/v-sources-table.js 16.2 kB
./frontend/.nuxt/dist/client/components/v-sources-table.modern.js 16.2 kB
./frontend/.nuxt/dist/client/components/v-warning-suppressor.js 306 B
./frontend/.nuxt/dist/client/components/v-warning-suppressor.modern.js 311 B
./frontend/.nuxt/dist/client/pages/about.js 1.4 kB
./frontend/.nuxt/dist/client/pages/about.modern.js 1.4 kB
./frontend/.nuxt/dist/client/pages/audio/_id/index.js 8.01 kB
./frontend/.nuxt/dist/client/pages/audio/_id/index.modern.js 4.85 kB
./frontend/.nuxt/dist/client/pages/external-sources.js 1.56 kB
./frontend/.nuxt/dist/client/pages/external-sources.modern.js 1.56 kB
./frontend/.nuxt/dist/client/pages/feedback.js 1.34 kB
./frontend/.nuxt/dist/client/pages/feedback.modern.js 1.34 kB
./frontend/.nuxt/dist/client/pages/image/_id/index.js 9.32 kB
./frontend/.nuxt/dist/client/pages/image/_id/index.modern.js 5.18 kB
./frontend/.nuxt/dist/client/pages/image/_id/report.js 3.66 kB
./frontend/.nuxt/dist/client/pages/image/_id/report.modern.js 4.27 kB
./frontend/.nuxt/dist/client/pages/index.js 7.28 kB
./frontend/.nuxt/dist/client/pages/index.modern.js 7.21 kB
./frontend/.nuxt/dist/client/pages/preferences.js 1.32 kB
./frontend/.nuxt/dist/client/pages/preferences.modern.js 1.32 kB
./frontend/.nuxt/dist/client/pages/privacy.js 1.01 kB
./frontend/.nuxt/dist/client/pages/privacy.modern.js 1.02 kB
./frontend/.nuxt/dist/client/pages/search-help.js 1.6 kB
./frontend/.nuxt/dist/client/pages/search-help.modern.js 1.58 kB
./frontend/.nuxt/dist/client/pages/search.js 4.55 kB
./frontend/.nuxt/dist/client/pages/search.modern.js 2.04 kB
./frontend/.nuxt/dist/client/pages/search/audio.js 6.02 kB
./frontend/.nuxt/dist/client/pages/search/audio.modern.js 3.55 kB
./frontend/.nuxt/dist/client/pages/search/image.js 507 B
./frontend/.nuxt/dist/client/pages/search/image.modern.js 2.71 kB
./frontend/.nuxt/dist/client/pages/search/index.js 443 B
./frontend/.nuxt/dist/client/pages/search/index.modern.js 448 B
./frontend/.nuxt/dist/client/pages/search/model-3d.js 244 B
./frontend/.nuxt/dist/client/pages/search/model-3d.modern.js 246 B
./frontend/.nuxt/dist/client/pages/search/search-page.types.js 266 B
./frontend/.nuxt/dist/client/pages/search/search-page.types.modern.js 271 B
./frontend/.nuxt/dist/client/pages/search/video.js 240 B
./frontend/.nuxt/dist/client/pages/search/video.modern.js 244 B
./frontend/.nuxt/dist/client/pages/sources.js 1.57 kB
./frontend/.nuxt/dist/client/pages/sources.modern.js 1.57 kB
./frontend/.nuxt/dist/client/runtime.js 2.72 kB
./frontend/.nuxt/dist/client/runtime.modern.js 2.73 kB
./frontend/.nuxt/dist/client/vendors/app.js 64.2 kB
./frontend/.nuxt/dist/client/vendors/app.modern.js 63.3 kB

compressed-size-action

@WordPress WordPress deleted a comment from github-actions bot Mar 21, 2023
@AetherUnbound
Copy link
Collaborator

This can certainly be merged as-is, but I think it would be pretty straightforward to upload the files to S3! As Sara mentions, I don't think it'd require any explicit permissions management, perhaps besides some IAM/role changes. We could even make a new data-refresh bucket specifically for this kind of thing.

@obulat obulat force-pushed the add/save_cleanup_info branch 2 times, most recently from edfed85 to d00abf9 Compare March 27, 2023 09:00
@obulat obulat force-pushed the add/save_cleanup_info branch from c434b80 to bb277b8 Compare March 28, 2023 18:24
@obulat obulat merged commit fd199b9 into main Mar 28, 2023
@obulat obulat deleted the add/save_cleanup_info branch March 28, 2023 18:37
dhruvkb pushed a commit that referenced this pull request Apr 14, 2023
* Add first pass a db snapshot rotation DAG

* Add unit tests

* Fix DAG documentation

* Add db snapshots DAG to parsing test

* Add missing attributes to DAG

* Fix DAG_ID

Co-authored-by: Madison Swain-Bowden <[email protected]>

* Fix template variable

* Remove redundant parameter

* Update openverse_catalog/dags/maintenance/rotate_db_snapshots.py

Co-authored-by: Madison Swain-Bowden <[email protected]>

* Use Airflow template strings to get variables

Co-authored-by: Madison Swain-Bowden <[email protected]>

* Fix dag name

* Sort describe snapshots return value (just to make sure)

Also fixes the usage of `describe_db_snapshots` to retrieve the actual
list of snapshots on the pagination object.

* Lint generated DAG file

Co-authored-by: Madison Swain-Bowden <[email protected]>
@obulat obulat mentioned this pull request Apr 28, 2023
@krysal krysal mentioned this pull request Feb 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖 aspect: dx Concerns developers' experience with the codebase 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: api Related to the Django API 🧱 stack: frontend Related to the Nuxt frontend 🧱 stack: ingestion server Related to the ingestion/data refresh server
Projects
None yet
4 participants