Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Europeana data ingester can return None in URL #2784

Closed
obulat opened this issue Aug 6, 2023 · 2 comments · Fixed by #3845
Closed

Europeana data ingester can return None in URL #2784

obulat opened this issue Aug 6, 2023 · 2 comments · Fixed by #3845
Assignees
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: catalog Related to the catalog and Airflow DAGs

Comments

@obulat
Copy link
Contributor

obulat commented Aug 6, 2023

Description

Europeana ingestion workflow failed because a None was saved in the URL field, and then the MediaStore class tried to extract the filetype from it.

Reproduction

See the error in Airflow logs:

[2023-08-06, 01:11:48 UTC] {taskinstance.py:1824} ERROR - Task failed with exception
providers.provider_api_scripts.provider_data_ingester.IngestionError: 'NoneType' object has no attribute 'split'
query_params: {"wskey": "[redacted]", "profile": "rich", "reusability": ["open", "restricted"], "sort": ["europeana_id+desc", "timestamp_created+desc"], "rows": "100", "media": "true", "start": 1, "qf": ["TYPE:IMAGE", "provider_aggregation_edm_isShownBy:*"], "query": "timestamp_created:[2016-03-01T00:00:00Z TO 2016-03-02T00:00:00Z]", "cursor": "*"}
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/operators/python.py", line 181, in execute
    return_value = self.execute_callable()
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/operators/python.py", line 198, in execute_callable
    return self.python_callable(*self.op_args, **self.op_kwargs)
  File "/opt/airflow/catalog/dags/providers/factory_utils.py", line 55, in pull_media_wrapper
    data = ingester.ingest_records()
  File "/opt/airflow/catalog/dags/providers/provider_api_scripts/provider_data_ingester.py", line 269, in ingest_records
    raise error from ingestion_error
  File "/opt/airflow/catalog/dags/providers/provider_api_scripts/provider_data_ingester.py", line 234, in ingest_records
    self.record_count += self.process_batch(batch)
  File "/opt/airflow/catalog/dags/providers/provider_api_scripts/provider_data_ingester.py", line 465, in process_batch
    store.add_item(**record)
  File "/opt/airflow/catalog/dags/common/storage/image.py", line 146, in add_item
    image = self._get_image(**image_data)
  File "/opt/airflow/catalog/dags/common/storage/image.py", line 153, in _get_image
    image_metadata = self.clean_media_metadata(**kwargs)
  File "/opt/airflow/catalog/dags/common/storage/media.py", line 145, in clean_media_metadata
    media_data["filetype"] = self._validate_filetype(
  File "/opt/airflow/catalog/dags/common/storage/media.py", line 315, in _validate_filetype
    filetype = extract_filetype(url, self.media_type)
  File "/opt/airflow/catalog/dags/common/extensions.py", line 9, in extract_filetype
    possible_filetype = url.split(".")[-1]
AttributeError: 'NoneType' object has no attribute 'split'

Additional context

Europeana script uses a raise_if_empty decorator to ensure that only items with valid required data are saved.

@obulat obulat added 🟧 priority: high Stalls work on the project or its dependents 🛠 goal: fix Bug fix 💻 aspect: code Concerns the software code in the repository 🧱 stack: catalog Related to the catalog and Airflow DAGs labels Aug 6, 2023
@github-project-automation github-project-automation bot moved this to 📋 Backlog in Openverse Backlog Aug 6, 2023
@obulat obulat added good first issue New-contributor friendly help wanted Open to participation from the community labels Sep 26, 2023
@obulat obulat self-assigned this Sep 26, 2023
@obulat obulat moved this from 📋 Backlog to 📅 To do in Openverse Backlog Sep 26, 2023
@AetherUnbound AetherUnbound moved this from 📅 To do to 📋 Backlog in Openverse Backlog Oct 4, 2023
@obulat obulat removed good first issue New-contributor friendly help wanted Open to participation from the community labels Oct 6, 2023
@obulat
Copy link
Contributor Author

obulat commented Oct 6, 2023

The plan was to update this issue to make it a good first issue. However, I looked into the code in Europeana, and I couldn't find how the url could be None - it should have raised an error and return None as the item.

I think we should refactor the Europeana script to use the pattern used in other scripts: it will be easier to debug.

@obulat obulat removed their assignment Nov 7, 2023
@AetherUnbound AetherUnbound self-assigned this Feb 12, 2024
@AetherUnbound AetherUnbound moved this from 📋 Backlog to 📅 To Do in Openverse Backlog Feb 12, 2024
@AetherUnbound
Copy link
Collaborator

AetherUnbound commented Feb 28, 2024

I'm able to reproduce this locally by using the following additional_query_params:

{"query": "timestamp_created:[2016-03-01T00:00:00Z TO 2016-03-02T00:00:00Z]"}

I added a log for the URL processing step and came across this:

[2024-02-28, 20:58:50 UTC] {logging_mixin.py:188} INFO - group=['https://www.dropbox.com/s/rpv24ftoz01e9s9/Belgium_MAR-BVM-1-384.jpg?raw=1']
[2024-02-28, 20:58:51 UTC] {logging_mixin.py:188} INFO - group=['https://www.dropbox.com/s/rpv24ftoz01e9s9/Belgium_MAR-BVM-1-384.jpg?raw=1']
[2024-02-28, 20:58:54 UTC] {logging_mixin.py:188} INFO - group=['https://www.dropbox.com/s/3tz4bsrmw5jqskj/Belgium_ELB-ULG-BGPHL-Ms0363-015.jpg?raw=1']
[2024-02-28, 20:58:54 UTC] {logging_mixin.py:188} INFO - group=['https://www.dropbox.com/s/3tz4bsrmw5jqskj/Belgium_ELB-ULG-BGPHL-Ms0363-015.jpg?raw=1']
[2024-02-28, 20:58:56 UTC] {logging_mixin.py:188} INFO - group=['L-APC248-https://www.dropbox.com/s/i1pqizm1joof8y1/Belgium_Diptyque%20_MAR-SGP-CO1.jpg?raw=1']
[2024-02-28, 20:58:57 UTC] {logging_mixin.py:188} INFO - group=['L-APC248-https://www.dropbox.com/s/i1pqizm1joof8y1/Belgium_Diptyque%20_MAR-SGP-CO1.jpg?raw=1']
[2024-02-28, 20:58:57 UTC] {media.py:233} INFO - Writing 2 lines from buffer to disk.

Two things to note:

  1. The URLs appear to be prefixed in the last two examples
  2. The URLs appear to be to Dropbox links (!!) that no longer work

So to me it seems like there might be two separate approaches here:

  1. Make Europeana/ingestion robust to having prefixes (or just reject them entirely)
  2. Reject any Europeana URLs that include "dropbox.com" in the domain

@AetherUnbound AetherUnbound moved this from 📅 To Do to 📋 Backlog in Openverse Backlog Feb 28, 2024
@AetherUnbound AetherUnbound moved this from 📋 Backlog to 🏗 In Progress in Openverse Backlog Feb 28, 2024
@openverse-bot openverse-bot moved this from 🏗 In Progress to ✅ Done in Openverse Backlog Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: catalog Related to the catalog and Airflow DAGs
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants