Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Europeana script should collect the creator #2834

Open
obulat opened this issue Aug 15, 2023 · 2 comments · May be fixed by #5057
Open

Europeana script should collect the creator #2834

obulat opened this issue Aug 15, 2023 · 2 comments · May be fixed by #5057
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature good first issue New-contributor friendly help wanted Open to participation from the community 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs

Comments

@obulat
Copy link
Contributor

obulat commented Aug 15, 2023

Problem

The Europeana script does not collect the creator data.

Description

Europeana is an aggregator of high-quality media from many European GLAM institutions, and it is important for search relevancy to collect all of the relevant data. Creator is not currently not collected at all.

The creators are available in the dcCreator field within the returned data, here's an example with our sample data:

Here's what it looks like within a real return data:

{'completeness': 8,
 'country': ['Hungary'],
 'dataProvider': ['Mezőkövesd Public Treasures Collection Nonprofit Ltd.'],
 'dcCreator': ['Hajdu Ráfis János Mezőgazdasági Gépmúzeum Mezőkövesd'],
 'dcCreatorLangAware': {'def': ['Hajdu Ráfis János Mezőgazdasági Gépmúzeum Mezőkövesd']},
 'dcDescription': ['MÁV Gépgyár (Budapest)                             Állapota: keresztfeje, vízszivattyúja, hajtórúdjai hiányoznak. Füstkamrája elrozsdásodott. Kiállításra alkalmassá tehető.'],
 'dcDescriptionLangAware': {'hu': ['MÁV Gépgyár (Budapest)                             Állapota: keresztfeje, vízszivattyúja, hajtórúdjai hiányoznak. Füstkamrája elrozsdásodott. Kiállításra alkalmassá tehető.']},
 'dcLanguage': ['zxx', 'ZXX'],
 'dcLanguageLangAware': {'def': ['zxx', 'ZXX']},
 'dcSubjectLangAware': {'hu': ['Agráltörténet', 'színes kép']},
 'dcTitleLangAware': {'hu': ['MÁV Gépgyár  által készített gőzüzemű lokomobil']},
 'dcTypeLangAware': {'def': ['kép']},
 'dctermsSpatial': ['Mezőkövesd'],
 'edmDatasetName': ['2048128_Ag_HU_MaNDA_OAI'],
 'edmIsShownAt': ['https://mandadb.hu/tetel/99947/MAV_Gepgyar__altal_keszitett_gozuzemu_lokomobil'],
 'edmIsShownBy': ['https://mandadb.hu/common/file-servlet/document/175231/default/doc_url/200021_3.JPG'],
 'edmPreview': ['https://api.europeana.eu/thumbnail/v2/url.json?uri=https%3A%2F%2Fmandadb.hu%2Fmandadb%2Fwebimage%2F9%2F5%2F6%2F5%2F3%2Fwimage%2F200021_2.jpg&type=IMAGE'],
 'europeanaCollectionName': ['2048128_Ag_HU_MaNDA_OAI'],
 'europeanaCompleteness': 8,
 'guid': 'https://www.europeana.eu/item/2048128/99947?utm_source=api&utm_medium=api&utm_campaign=dialialika',
 'id': '/2048128/99947',
 'index': 0,
 'language': ['hu'],
 'link': 'https://api.europeana.eu/record/2048128/99947.json?wskey=dialialika',
 'organizations': ['http://data.europeana.eu/organization/1482250000004509149',
  'http://data.europeana.eu/organization/1482250000003772998'],
 'previewNoDistribute': False,
 'provider': ['Forum Hungaricum Non-profit Ltd.'],
 'rights': ['http://creativecommons.org/licenses/by-nc-nd/4.0/'],
 'score': 1.0,
 'timestamp': 1716914248926,
 'timestamp_created': '2024-05-28T07:50:08.374Z',
 'timestamp_created_epoch': 1716882608374,
 'timestamp_update': '2024-05-28T07:50:08.374Z',
 'timestamp_update_epoch': 1716882608374,
 'title': ['MÁV Gépgyár  által készített gőzüzemű lokomobil'],
 'type': 'IMAGE',
 'ugc': [False]}

It looks like there could be a list of creators, we should probably join them together with commas (e.g. ", ".join(item_data.get("dcCreator", ""))).

In order to accomplish this, we'll need to modify the Europeana provider ingestion script. We should add a new function to the EuropeanaRecordBuilder class to retrieve this value. A good example to use would be _get_filesize. We'll then need to capture this information in get_record_data, by adding it to the dictionary with the "creator" key here:

"filesize": self._get_filesize(item_data),

We'll also need to add tests for this function and update any other Europeana tests that might be affected. An example test for _get_filesize can be found here.

@obulat obulat added 🟨 priority: medium Not blocking but should be addressed soon ✨ goal: improvement Improvement to an existing user-facing feature 💻 aspect: code Concerns the software code in the repository 🧱 stack: catalog Related to the catalog and Airflow DAGs labels Aug 15, 2023
@AetherUnbound AetherUnbound added good first issue New-contributor friendly help wanted Open to participation from the community labels May 29, 2024
@dryruffian
Copy link
Contributor

dryruffian commented Oct 20, 2024

Hi @obulat,

def _get_creator(self, item_data: dict) -> str | None:
        creators = item_data.get("dcCreator", [])
        if not creators:
            return None
        return creators if isinstance(creators, str) else ", ".join(creators)

will adding this function to the EuropeanaRecordBuilder class in europeana.py file if this will work can you assign this issue to me I am also working on writing tests

@dryruffian
Copy link
Contributor

@pytest.mark.parametrize(
    "item_data, expected",
    [
        # Single creator in a list
        pytest.param({"dcCreator": ["Chandler"]}, "Chandler", id="single_creator"),
        # Multiple creators in a list
        pytest.param(
            {"dcCreator": ["Chandler", "Joey"]},
            "Chandler, Joey",
            id="multiple_creators",
        ),
        # Empty creator list
        pytest.param({"dcCreator": []}, None, id="empty_creator_list"),
        # Missing dcCreator key
        pytest.param({}, None, id="no_dcCreator"),
        # dcCreator is a string
        pytest.param({"dcCreator": "Chandler"}, "Chandler", id="dcCreator_string"),
        # dcCreator is None
        pytest.param({"dcCreator": None}, None, id="dcCreator_none"),
        # Empty string in creator list
        pytest.param({"dcCreator": [""]}, "", id="empty_string_in_list"),
    ],
)
def test_get_creator(item_data, expected, record_builder):
    assert record_builder._get_creator(item_data) == expected

This is a test for the Europeana script. Please let me know if this is fine.

dryruffian added a commit to dryruffian/openverse that referenced this issue Oct 20, 2024
@dryruffian dryruffian linked a pull request Oct 20, 2024 that will close this issue
8 tasks
dryruffian added a commit to dryruffian/openverse that referenced this issue Oct 21, 2024
dryruffian added a commit to dryruffian/openverse that referenced this issue Oct 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature good first issue New-contributor friendly help wanted Open to participation from the community 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs
Projects
Status: 🏗 In Progress
Development

Successfully merging a pull request may close this issue.

3 participants