Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add media properties generation #3620

Merged
merged 30 commits into from
Apr 12, 2024
Merged

Conversation

obulat
Copy link
Contributor

@obulat obulat commented Jan 3, 2024

Fixes

Fixes #2186 by @obulat

Description

This PR adds the page with documentation on the media properties: a table with all of the audio and image properties with their database and Python characteristics.

The main part of this PR, though, is the script that parses the SQL definition files and the Python column definitions to create the media property tables, and the catalog/utilities/media_props_gen/media_props.md file that holds the more detailed descriptions of the media properties. This latter markdown file in this PR only contains the list of properties and the headings for descriptions. It will be filled in in the follow-up PR.

A new just catalog/generate-media-props script runs the parsers and markdown generators. If the generated markdown file is different from the file that was saved previously, a warning is issued to run the script again and commit the updated markdown file.

To parse the Python column definitions, this PR uses the ast module.

Testing Instructions

Change some property in the SQL file, the catalog/dags/common/storage/columns.py file or the catalog/utilities/media_props_gen/media_props.md file, and run just catalog/generate-media-props. The documentation/meta/media_properties.md file should reflect the changes, and a warning should appear in the console.

Also, review the media properties table and the full document from the comment below, and see if anything needs to be changed.

Checklist

  • My pull request has a descriptive title (not a vague title likeUpdate index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.
  • I ran the DAG documentation generator (if applicable).

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@github-actions github-actions bot added 🧱 stack: catalog Related to the catalog and Airflow DAGs 🧱 stack: documentation Related to Sphinx documentation labels Jan 3, 2024
@openverse-bot openverse-bot added the 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work label Jan 3, 2024
@obulat obulat added 🟨 priority: medium Not blocking but should be addressed soon 🌟 goal: addition Addition of new feature 📄 aspect: text Concerns the textual material in the repository and removed 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work labels Jan 3, 2024
@krysal
Copy link
Member

krysal commented Feb 8, 2024

@obulat, Would you be working on this PR in the following days? Or should it be closed?

@obulat
Copy link
Contributor Author

obulat commented Feb 8, 2024

@obulat, Would you be working on this PR in the following days? Or should it be closed?

Yes, @krysal , I'll be working on it

@obulat obulat force-pushed the to-reopen/add/media_properties_generation branch from d7bbf69 to dd317c9 Compare February 22, 2024 12:42
@obulat obulat force-pushed the to-reopen/add/media_properties_generation branch 5 times, most recently from bebff77 to 8fa0f3a Compare March 4, 2024 07:11
@WordPress WordPress deleted a comment from github-actions bot Mar 4, 2024
@obulat obulat force-pushed the to-reopen/add/media_properties_generation branch 3 times, most recently from 2e30a65 to 6222001 Compare March 6, 2024 11:53
@WordPress WordPress deleted a comment from github-actions bot Mar 6, 2024
@obulat obulat self-assigned this Mar 7, 2024
@obulat obulat marked this pull request as ready for review March 7, 2024 07:41
@obulat obulat requested a review from a team as a code owner March 7, 2024 07:41
@obulat obulat requested review from AetherUnbound and stacimc March 7, 2024 07:41
Copy link
Collaborator

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an excellent start!! I'm really excited to see us make progress on this and I'm loving the final output of the generation. I have a number of structure/code suggestions, as well as potential formatting improvements. I think we will also need tests for most of the utilities that are added here.

Additionally, the media properties themselves feel pretty squished together and difficult to delineate. I think adding a horizontal bar between each property can help visually differentiate the various properties, without affecting the table of contents in any way. Below is some code to accomplish that, and some screenshots of the difference:

diff --git a/catalog/utilities/media_props_gen/generate_media_properties.py b/catalog/utilities/media_props_gen/generate_media_properties.py
index 0505af318..106cb037d 100644
--- a/catalog/utilities/media_props_gen/generate_media_properties.py
+++ b/catalog/utilities/media_props_gen/generate_media_properties.py
@@ -98,7 +98,7 @@ def generate_long_form_doc(markdown_descriptions: dict, media_properties: dict)
         prop_doc = "".join(
             [f"{Md.heading(4, k)}{Md.line(v)}" for k, v in description.items()]
         )
-        media_docs += prop_heading + prop_doc
+        media_docs += prop_heading + prop_doc + Md.horizontal_line
 
     return media_docs
 
diff --git a/catalog/utilities/media_props_gen/md.py b/catalog/utilities/media_props_gen/md.py
index 8010fb011..2fb3a7a09 100644
--- a/catalog/utilities/media_props_gen/md.py
+++ b/catalog/utilities/media_props_gen/md.py
@@ -2,6 +2,9 @@ from pathlib import Path
 
 
 class Md:
+
+    horizontal_line = "\n---\n\n"
+
     @staticmethod
     def heading(level: int, text: str) -> str:
         """Add a heading to a markdown string."""

Before
image
After
image

catalog/justfile Outdated
Comment on lines 146 to 147
# Generate the DAG documentation
generate-media-props fail_on_diff="true":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that the workflow is almost exactly the same with only the files differing, I think we should create a _generate_markdown recipe with the more generic steps that then runs the shared logic for this and the above recipe.

Copy link
Contributor Author

@obulat obulat Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried moving this workflow into docker, but then realized that we don't mount the local .sql files that we parse to generate the db docs into the container. Here's the error I got:

FileNotFoundError: [Errno 2] No such file or directory: '/opt/airflow/docker/upstream_db/0006_openledger_audio_schema.sql'

Is it possible to open them from the host, @AetherUnbound ? Seems like this would make the setup unnecessarily complex.

If I keep the code running locally as it is here now, then I won't be able to import media type constants. Even if I do manipulate the path somehow to make the constant importable, those files contain imports from airflow, which throw new errors.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ohhhh dang, I didn't even think about having to mount those files 😮 We can actually mount those files relatively easily in the recipe at least, it would just require modifying the just ../run command here:

openverse/catalog/justfile

Lines 127 to 129 in 78d799c

just ../run \
{{ SERVICE }} \
"bash -c 'python catalog/utilities/dag_doc_gen/dag_doc_generation.py && chmod 666 /opt/airflow/catalog/utilities/dag_doc_gen/DAGs.md'"

Something similar to the way the tests are mounted (note the --volume here):

openverse/catalog/justfile

Lines 101 to 108 in 78d799c

_mount-test command: up-deps
env DC_USER="airflow" just ../run \
--rm \
-e AIRFLOW_VAR_INGESTION_LIMIT=1000000 \
-w /opt/airflow/catalog \
--volume {{ justfile_directory() }}/../docker:/opt/airflow/docker/ \
{{ SERVICE }} \
{{ command }}

Even though it's not needed for the DAG markdown generation, it can be done for all the generation recipes anyway!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I'll try that.
Not related to this PR, but is the process of keeping those files up-to-date automated?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not! I believe that's the purview of #416 😄

catalog/utilities/media_props_gen/preamble.md Outdated Show resolved Hide resolved
catalog/utilities/media_props_gen/column_parser.py Outdated Show resolved Hide resolved
catalog/utilities/media_props_gen/column_parser.py Outdated Show resolved Hide resolved
catalog/utilities/media_props_gen/column_parser.py Outdated Show resolved Hide resolved

PREAMBLE = open(Path(__file__).parent / "preamble.md").read()

MEDIA_TYPES: list[MediaType] = ["audio", "image"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same note about constants here:

MEDIA_TYPES = [AUDIO, IMAGE]

catalog/utilities/media_props_gen/md.py Outdated Show resolved Hide resolved
catalog/utilities/media_props_gen/md.py Outdated Show resolved Hide resolved
@obulat obulat marked this pull request as draft March 11, 2024 15:37
@obulat obulat force-pushed the to-reopen/add/media_properties_generation branch from 6222001 to 8d685d2 Compare March 12, 2024 03:33
@WordPress WordPress deleted a comment from github-actions bot Mar 12, 2024
Copy link

github-actions bot commented Mar 12, 2024

Full-stack documentation: https://docs.openverse.org/_preview/3620

Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again.

You can check the GitHub pages deployment action list to see the current status of the deployments.

New files ➕:

Changed files 🔄:

obulat and others added 26 commits April 12, 2024 08:25
Signed-off-by: Olga Bulat <[email protected]>
Signed-off-by: Olga Bulat <[email protected]>
Signed-off-by: Olga Bulat <[email protected]>
Signed-off-by: Olga Bulat <[email protected]>
Signed-off-by: Olga Bulat <[email protected]>
Signed-off-by: Olga Bulat <[email protected]>
Signed-off-by: Olga Bulat <[email protected]>
Signed-off-by: Olga Bulat <[email protected]>
Signed-off-by: Olga Bulat <[email protected]>
Signed-off-by: Olga Bulat <[email protected]>
Signed-off-by: Olga Bulat <[email protected]>
Signed-off-by: Olga Bulat <[email protected]>
Co-authored-by: Madison Swain-Bowden <[email protected]>
Signed-off-by: Olga Bulat <[email protected]>
@obulat obulat force-pushed the to-reopen/add/media_properties_generation branch from 1fca405 to 91bb537 Compare April 12, 2024 05:26
@obulat obulat merged commit 5041280 into main Apr 12, 2024
46 checks passed
@obulat obulat deleted the to-reopen/add/media_properties_generation branch April 12, 2024 05:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 📄 aspect: text Concerns the textual material in the repository 🌟 goal: addition Addition of new feature 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs 🧱 stack: documentation Related to Sphinx documentation
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Create a script to automatically generate media properties
5 participants