Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate tools to assess data quality #1331

Open
krysal opened this issue Nov 30, 2022 · 1 comment
Open

Evaluate tools to assess data quality #1331

krysal opened this issue Nov 30, 2022 · 1 comment
Labels
💻 aspect: code Concerns the software code in the repository 🌟 goal: addition Addition of new feature 🟩 priority: low Low priority and doesn't need to be rushed 🧱 stack: catalog Related to the catalog and Airflow DAGs

Comments

@krysal
Copy link
Member

krysal commented Nov 30, 2022

Current Situation

In #292 two tools were mentioned that could help to gain insights on the catalog database, as we keep part of the data that was ingested previously to the creation of the MediaStore class and its subclasses, which perform the majority of the validations of the data we get from providers API.

Suggested Improvement

Potentially start using one of these tools or even both:

Benefit

  • Gain insights into the different fields to make future adjustments to the structure and calibrate early validations
@krysal krysal added 🟩 priority: low Low priority and doesn't need to be rushed 🌟 goal: addition Addition of new feature 💻 aspect: code Concerns the software code in the repository data normalization labels Nov 30, 2022
@krysal krysal changed the title Evaluate tool to assesst data quality Evaluate tools to assess data quality Nov 30, 2022
rwidom referenced this issue in WordPress/openverse-catalog Dec 12, 2022
rwidom referenced this issue in WordPress/openverse-catalog Jan 13, 2023
* cleaning and temp table in pg

* sketch of full dag NOT TESTED

* inaturalist dag without tests or reporting (yet)

* complete dag, 25 mill recs in 5.5 hours local test

* Add passwords for s3 testing with new docker

* make temp loading table UNLOGGED to load it faster

* inat with translation 75 million recs in 8 hrs

* using OUTPUT_DIR for API files

* clarify delayed requester vs requester

* DRYer approach to tags TO DO

* comments on taxa transformation

* scientific names not ids for manual translation

* TO DO comment clean-up

* fix name insert syntax

* Merge 'main' into feature/inaturalist-performance

* add clarity on batch limit override

* missing piece of merge from main

* limit to 20 tags per photo

* add option to use alternate dag creation for sql

* adjust tests see issue #898

* slightly faster way to pull medium test sample

* Note another data source for vernacular names

* remove unnecessary test code

* clean and upsert one batch at a time

* log parsing resource doc

* use common.constants.IMAGE instead of MEDIA_TYPE

* add explanation of ancestry joins and taxa tags

* use existing clean_intermediate_table_data

* remove unnecessary env vars from load_to_s3

* declarative doc string for file update check

* update iNaturalist description

* remove message to Staci :)

* use dynamically generated load subtasks

* clarify taxa comments and include languages

* consolidate consolidation code

* add testing for consolidated metrics

* separate ti_mock instances per test

* test get batches

* shorter titles to save space

* add better testing instructions

* dag parameter to manage post-ingestion deletions

* Add kwargs to get_response_json call

* get_media_type can be static method

Co-authored-by: Krystle Salazar <[email protected]>

* link to original inaturalist photo, rather than medium

Co-authored-by: Krystle Salazar <[email protected]>

* prefer creator name over login

* remove unused constants

* add to do for extension cleanup

Co-authored-by: Madison Swain-Bowden <[email protected]>
Co-authored-by: Krystle Salazar <[email protected]>
@rwidom
Copy link
Collaborator

rwidom commented Feb 4, 2023

Just noticed that Airflow has some native tools for this: https://docs.astronomer.io/learn/airflow-sql-data-quality

@obulat obulat added the 🧱 stack: catalog Related to the catalog and Airflow DAGs label Feb 23, 2023
@obulat obulat transferred this issue from WordPress/openverse-catalog Apr 17, 2023
@dhruvkb dhruvkb added this to the Data normalization milestone Dec 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🌟 goal: addition Addition of new feature 🟩 priority: low Low priority and doesn't need to be rushed 🧱 stack: catalog Related to the catalog and Airflow DAGs
Projects
Status: 📋 Backlog
Development

No branches or pull requests

4 participants