Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move boilerplate required field None checks into ProviderDataIngester #1378

Open
1 task
stacimc opened this issue Oct 25, 2022 · 0 comments
Open
1 task

Move boilerplate required field None checks into ProviderDataIngester #1378

stacimc opened this issue Oct 25, 2022 · 0 comments
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟩 priority: low Low priority and doesn't need to be rushed 🧱 stack: catalog Related to the catalog and Airflow DAGs

Comments

@stacimc
Copy link
Collaborator

stacimc commented Oct 25, 2022

Current Situation

ProviderDataIngester child classes currently all must implement logic in get_record_data to return early when one of the required fields (foreign_landing_url, image/audio_url, foreign_identifier, license_info) is None. Example:

        if (foreign_identifier := data.get("foreign_id")) is None:
            return None

        if (foreign_landing_url := data.get("url")) is None:
            return None

        # ...and so on

This logic is repeated in each of our provider scripts, causing code duplication. There's also a risk that a new contributor might forget to implement this logic in a new provider script.

Suggested Improvement

We should try to move this logic into the base class. One idea is, instead of having a single abstract get_record_data method, to have an abstract method for each required field as well as a get_optional_fields method. Then get_record_data could contain the shared logic:

@abstractmethod
def get_foreign_identifier(self, data:dict) -> str | int:
    pass

@abstractmethod 
def get_optional_fields(self, data:dict) -> dict:
    pass

def get_record_data(self, data: dict) -> dict | list[dict] | None:
    if (foreign_identifier := self.get_foreign_identifier(data)) is None:
        return None

    # and so on for other required fields...

    optional_fields = self.get_optional_fields(data)

    return {
       "foreign_identifier": foreign_identifier,
       # etc...
       **optional_fields
    }

^ Just a starting point. For example, this does not support returning lists yet.

See the implementation of checks in Europeana for another idea: WordPress/openverse-catalog#821 (comment) using decorators. This approach is a bit cleaner especially in the event that the required fields list expands, and it also adds extra logging.

Some things we need to keep in mind:

  • We still need to support returning a list of data from get_record_data. See SMK for an example of a complex use case
  • Changes should ideally be backwards compatible with currently refactored providers so they don't need to be updated all at once. Consider refactoring Europeana as part of this ticket.
  • We should update the template as part of this work

Benefit

  • Reduce code duplication
  • Makes it clearer which fields are required without relying on docs
  • Ensures new providers will not miss this functionality

Additional context

This was discussed on WordPress/openverse-catalog#821 in this comment thread.

Implementation

  • 🙋 I would be interested in implementing this feature.
@stacimc stacimc added 🟩 priority: low Low priority and doesn't need to be rushed ✨ goal: improvement Improvement to an existing user-facing feature 💻 aspect: code Concerns the software code in the repository labels Oct 25, 2022
@obulat obulat added the 🧱 stack: catalog Related to the catalog and Airflow DAGs label Feb 23, 2023
@github-project-automation github-project-automation bot moved this to 📋 Backlog in Openverse Backlog Apr 17, 2023
@obulat obulat transferred this issue from WordPress/openverse-catalog Apr 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟩 priority: low Low priority and doesn't need to be rushed 🧱 stack: catalog Related to the catalog and Airflow DAGs
Projects
Status: 📋 Backlog
Development

No branches or pull requests

2 participants