Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more fields to harvester 2.0 CKAN API payload to maintain metadata links and collection relationship #4847

Closed
FuhuXia opened this issue Aug 15, 2024 · 3 comments
Assignees
Labels
bug Software defect or bug H2.0/Harvest-General General Harvesting 2.0 Issues

Comments

@FuhuXia
Copy link
Member

FuhuXia commented Aug 15, 2024

After harvester 2.0, packages are added to CKAN via API calls, not via ckanext-harvest's harvesting activity any more. We need to re-consider how to maintain the metadata links and collection relationship that are offered by legacy ckanext-harvest and other related extensions.

Metadata link

This is a block that display harvest object and harvest source info for each dataset. In order for this to show up in catalog-next, the API call payload need to include these three keys and their values in the extras field:

harvest_object_id
harvest_source_id
harvest_source_title

By doing this there is no change on CKAN (catalog-next) side to keep the metadata link block and harvest source related facet search. When user clicks to show harvest-object original metadata and harvest source details, we can redirect to the harvester 2.0 Flask app.

Collection

There is never a complete solution to handle DCAT collection relationship in CKAN. For example, there might be harvest errors and need multiple attempts to complete harvesting a datajson with collections. During the initial harvesting, the parent check is enforced before a child dataset can be harvested, but in any following reharvests parent dataset can be deleted and leaving all previously harvested children dataset orphaned.

My suggestions for collection_package_id:
1. Do not use parent ckan id as collection id. Use the combination of harvest-source-id+identifier (more on this later). This way children can be harvested reglardless parent dataset is present or not.
2. Parent dataset is not aware of its parenthood. We detect dataset's parethoold with a solr query when a dataset detail page is loaded. This means there wont be collection icon on the dataset listing page, it only show up on the detail page. This behavior is kind of in sync with what is in DCAT: Parent record is not aware of parenthood. When all children datasets are gone, parent record is just a regular record.
3. Use the combination of harvest-source-id+identifier as collection id, not harvest-source-name+identifier, or a hash value of it, making the collection id permanent and searchable. We can split the id into harvest-source-id and identifier and locate the dataset in CKAN search. We cant use identifier alone since identifier is only guaranteed to be unique on harvest source level.

10/21/2024 Update: Based on the team discussion, we will not set and pass the ids from the harvesting process. Instead, all information will be handled on the CKAN side, as it is already available there. (Details in #4969)

@FuhuXia FuhuXia added the bug Software defect or bug label Aug 15, 2024
@jbrown-xentity jbrown-xentity added H2.0/Harvest-Runner Harvest Source Processing for Harvesting 2.0 H2.0/Harvest-General General Harvesting 2.0 Issues and removed H2.0/Harvest-Runner Harvest Source Processing for Harvesting 2.0 labels Aug 15, 2024
@Bagesary Bagesary moved this to H2.0 Backlog in data.gov team board Aug 15, 2024
@FuhuXia
Copy link
Member Author

FuhuXia commented Aug 26, 2024

For the Metadata link field names, we can go with what is defined in ticket #4856:

record_id
harvest_source_id
harvset_source_name

@Jin-Sun-tts
Copy link
Contributor

Included three keys and values in the extras field,

record_id => harvest_object_id
harvest_source_id => harvest_source_id
harvset_source_name => harvest_source_title

Following metadata source block shows up to display harvest object and harvest source info:
Image

@jbrown-xentity
Copy link
Contributor

For context, here is the notes from our session planning on how to handle collections: https://docs.google.com/document/d/1xaWeIOaqgL1Qo6kmWm_S7QwOcoD4i19kw4IwxNLpGa4/edit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Software defect or bug H2.0/Harvest-General General Harvesting 2.0 Issues
Projects
Archived in project
Development

No branches or pull requests

3 participants