Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Science museum script needs to implement get_should_continue to stop ingestion #1363

Closed
1 task
obulat opened this issue Nov 10, 2022 · 2 comments · Fixed by WordPress/openverse-catalog#868 or WordPress/openverse-catalog#905
Assignees
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix good first issue New-contributor friendly help wanted Open to participation from the community 🟧 priority: high Stalls work on the project or its dependents

Comments

@obulat
Copy link
Contributor

obulat commented Nov 10, 2022

Description

The ingestion for Science museum doesn't stop when the last page is reached, which causes an error due to response 400 for non-existent page.

Reproduction

  1. Run an API request for page 50:
curl -X GET -H 'Accept: application/json' -v -i 'https://collection.sciencemuseumgroup.org.uk/search/?has_image=1&image_license=CC&page%5Bsize%5D=100&page%5Bnumber%5D=50&date%5Bfrom%5D=1750&date%5Bto%5D=1825'
  1. Run a new API request, updating the URL's `page number` parameter to 51:
curl -X GET -H 'Accept: application/json' -v -i 'https://collection.sciencemuseumgroup.org.uk/search/?has_image=1&image_license=CC&page%5Bsize%5D=100&page%5Bnumber%5D=51&date%5Bfrom%5D=1750&date%5Bto%5D=1825'
  1. See error: you will get a 400 response.

Solution

Each JSON response has a links dictionary with next property in it. We should stop the ingestion when next is null by overriding the get_should_continue function:

def get_should_continue(self, response_json):
   return response_json.get('links', {}).get('next') is not None

Resolution

  • 🙋 I would be interested in resolving this bug.
@obulat obulat added good first issue New-contributor friendly help wanted Open to participation from the community 🟧 priority: high Stalls work on the project or its dependents 🛠 goal: fix Bug fix 💻 aspect: code Concerns the software code in the repository labels Nov 10, 2022
@aqeelat
Copy link
Contributor

aqeelat commented Nov 10, 2022

@obulat On it.

@AetherUnbound
Copy link
Collaborator

We recently re-encountered this issue, so unfortunately it doesn't look like it's resolved 😞
https://airflow.openverse.engineering/log?execution_date=2022-11-01T00%3A00%3A00%2B00%3A00&task_id=ingest_data.pull_image_data&dag_id=science_museum_workflow&map_index=-1
We're looking into why this is happening now, since (preliminarily) it seems like it WordPress/openverse-catalog#868 should have addressed it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix good first issue New-contributor friendly help wanted Open to participation from the community 🟧 priority: high Stalls work on the project or its dependents
Projects
Archived in project
3 participants