Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document the use of each column and guideline for selection from sources #1410

Closed
2 tasks
krysal opened this issue Oct 7, 2022 · 5 comments
Closed
2 tasks
Labels
📄 aspect: text Concerns the textual material in the repository 🌟 goal: addition Addition of new feature 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: catalog Related to the catalog and Airflow DAGs

Comments

@krysal
Copy link
Member

krysal commented Oct 7, 2022

Current Situation

As part of the Data Normalization project and per recent feedback from new contributors, we need to have a clear understanding of what is the meaning and use of each database column. This indeed to add a broad description given that each provider has particular particularities that should be documented in its provider script or doc describing the particular script.

Image table
+-------------------------+--------------------------+--------------------------------------+
| Column                  | Type                     | Modifiers                            |
|-------------------------+--------------------------+--------------------------------------|
| identifier              | uuid                     |  not null default uuid_generate_v4() |
| created_on              | timestamp with time zone |  not null                            |
| updated_on              | timestamp with time zone |  not null                            |
| ingestion_type          | character varying(80)    |                                      |
| provider                | character varying(80)    |                                      |
| source                  | character varying(80)    |                                      |
| foreign_identifier      | character varying(3000)  |                                      |
| foreign_landing_url     | character varying(1000)  |                                      |
| url                     | character varying(3000)  |  not null                            |
| thumbnail               | character varying(3000)  |                                      |
| width                   | integer                  |                                      |
| height                  | integer                  |                                      |
| filesize                | integer                  |                                      |
| license                 | character varying(50)    |  not null                            |
| license_version         | character varying(25)    |                                      |
| creator                 | character varying(2000)  |                                      |
| creator_url             | character varying(2000)  |                                      |
| title                   | character varying(5000)  |                                      |
| meta_data               | jsonb                    |                                      |
| tags                    | jsonb                    |                                      |
| watermarked             | boolean                  |                                      |
| last_synced_with_source | timestamp with time zone |                                      |
| removed_from_source     | boolean                  |  not null                            |
| filetype                | character varying(5)     |                                      |
| category                | character varying(80)    |                                      |
+-------------------------+--------------------------+--------------------------------------+
Audio table
+-------------------------+--------------------------+--------------------------------------+
| Column                  | Type                     | Modifiers                            |
|-------------------------+--------------------------+--------------------------------------|
| identifier              | uuid                     |  not null default uuid_generate_v4() |
| created_on              | timestamp with time zone |  not null                            |
| updated_on              | timestamp with time zone |  not null                            |
| ingestion_type          | character varying(80)    |                                      |
| provider                | character varying(80)    |                                      |
| source                  | character varying(80)    |                                      |
| foreign_identifier      | character varying(3000)  |                                      |
| foreign_landing_url     | character varying(1000)  |                                      |
| url                     | character varying(3000)  |  not null                            |
| thumbnail               | character varying(3000)  |                                      |
| filetype                | character varying(5)     |                                      |
| duration                | integer                  |                                      |
| bit_rate                | integer                  |                                      |
| sample_rate             | integer                  |                                      |
| category                | character varying(80)    |                                      |
| genres                  | character varying(80)[]  |                                      |
| audio_set               | jsonb                    |                                      |
| set_position            | integer                  |                                      |
| alt_files               | jsonb                    |                                      |
| filesize                | integer                  |                                      |
| license                 | character varying(50)    |  not null                            |
| license_version         | character varying(25)    |                                      |
| creator                 | character varying(2000)  |                                      |
| creator_url             | character varying(2000)  |                                      |
| title                   | character varying(5000)  |                                      |
| meta_data               | jsonb                    |                                      |
| tags                    | jsonb                    |                                      |
| watermarked             | boolean                  |                                      |
| last_synced_with_source | timestamp with time zone |                                      |
| removed_from_source     | boolean                  |  not null                            |
+-------------------------+--------------------------+--------------------------------------+

Suggested Improvement

Document in markdown tables the following information for each media db table (currently image and audio).

Column Is required? Use
... ... ...

Benefit

After this should be easy to point to newcomers and recall what is the meaning of each table.

Alternatives

We could keep the notes directly on each database table. PostgreSQL supports a COMMENT SQL command that can be applied to columns, constraints, functions, etc. It's not in the SQL standard but many DBMSs support something similar. The con of this approach is that it is more hidden than a simple markdown document, there is no public access to the database and therefore people will have to reach the code to see them anyway. I wonder if this is a common/popular way to document a database.

Additional context

Depends on:

@krysal krysal added 🟧 priority: high Stalls work on the project or its dependents 🌟 goal: addition Addition of new feature 📄 aspect: text Concerns the textual material in the repository labels Oct 7, 2022
@krysal krysal mentioned this issue Oct 7, 2022
29 tasks
@stacimc
Copy link
Collaborator

stacimc commented Oct 21, 2022

Just adding that a messy first pass of this was introduced in data_models.md in WordPress/openverse-catalog#790 which should be updated for this issue, and that it will be very helpful for whoever takes this ticket to look at some of the comments @obulat made there!

@stacimc
Copy link
Collaborator

stacimc commented Oct 25, 2022

@AetherUnbound also had an excellent reminder that @rwidom has created some really incredible documentation here which should be very helpful!

@rwidom
Copy link
Collaborator

rwidom commented Jan 2, 2023

I was just remembering that @AetherUnbound and @stacimc and I had some conversations about how the popularity metrics work that would have been useful for WordPress/openverse-catalog#796, but I can't seem to find the notes now. So I ended up guessing at what metadata field names would be most useful, and feeling a little guilty that I hadn't looked more closely for the field names that the popularity metrics actually use. But now I'm leaving this comment to say "I looked!" But I couldn't find it. :/ Sorry!

@AetherUnbound
Copy link
Collaborator

No worries, thank you for looking! 😄

@obulat obulat added the 🧱 stack: catalog Related to the catalog and Airflow DAGs label Feb 23, 2023
@obulat
Copy link
Contributor

obulat commented Apr 4, 2023

Closing this issue as @wordpress/openverse#412 (#412) supercedes it. The progress on this will be followed in 412.

@obulat obulat closed this as completed Apr 4, 2023
@github-project-automation github-project-automation bot moved this to 📋 Backlog in Openverse Backlog Apr 17, 2023
@obulat obulat transferred this issue from WordPress/openverse-catalog Apr 17, 2023
@obulat obulat moved this from 📋 Backlog to ✅ Done in Openverse Backlog Apr 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
📄 aspect: text Concerns the textual material in the repository 🌟 goal: addition Addition of new feature 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: catalog Related to the catalog and Airflow DAGs
Projects
Archived in project
Development

No branches or pull requests

5 participants