Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

10982 10909 Allow using OAI-PMH identifiers as persistent ids of harvested datasets #11049

Merged
merged 17 commits into from
Nov 27, 2024
Merged
Show file tree
Hide file tree
Changes from 15 commits
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
4449679
quick draft implementation of addressing issue 1. from #10909.
landreev Oct 21, 2024
2656ccd
Adding the new client options to the json printer and parser #10909
landreev Oct 29, 2024
5c043cd
we DO want to include the persistent id in the search cards for all h…
landreev Nov 1, 2024
b7efee0
a flyway script for the "use the oai id as the pid" harvesting client…
landreev Nov 23, 2024
eca0389
removed the part of the cherry-picked commit that I'm not going to ne…
landreev Nov 23, 2024
2911417
removed pieces of another cherry-picked commit not needed in this bra…
landreev Nov 23, 2024
0967b7a
A "hybrid" implementation of the support for using OAI identifiers fo…
landreev Nov 23, 2024
00943e1
guide entry
landreev Nov 25, 2024
cc7fb45
release note.
landreev Nov 25, 2024
d6fc240
json files for the new tests (from PR #11010 by @stevenferey)
landreev Nov 25, 2024
86b2260
tests for selecting persistent ids in the GenericImportService (from …
landreev Nov 25, 2024
115c88e
Update doc/release-notes/11049-oai-identifiers-as-pids.md
landreev Nov 25, 2024
a295cc4
Update doc/sphinx-guides/source/api/native-api.rst
landreev Nov 25, 2024
3c46287
reverted the flyway script back to its original state (a newline was …
landreev Nov 25, 2024
8a361be
another cherry-picked commit not needed in this branch.
landreev Nov 25, 2024
321de7c
there's no need to slap the "incomplete metadata" label on harvested …
landreev Nov 26, 2024
40fe665
a typo in search include fragment
landreev Nov 26, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions doc/release-notes/11049-oai-identifiers-as-pids.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
## When harvesting, Dataverse can now use the identifier from the OAI-PMH record header as the persistent id for the harvested dataset.

This will allow harvesting from sources that do not include a persistent id in their oai_dc metadata records, but use valid dois or handles as the OAI-PMH record header identifiers.

It is also possible to optionally configure a harvesting client to use this OAI-PMH identifier as the **preferred** choice for the persistent id. See the [Harvesting Clients API](https://guides.dataverse.org/en/6.5/api/native-api.html#create-a-harvesting-client) section of the Guides, #11049 and #10982 for more information.
2 changes: 2 additions & 0 deletions doc/sphinx-guides/source/api/native-api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5246,6 +5246,7 @@ Shows a Harvesting Client with a defined nickname::
"dataverseAlias": "fooData",
"nickName": "myClient",
"set": "fooSet",
"useOaiIdentifiersAsPids": false
"schedule": "none",
"status": "inActive",
"lastHarvest": "Thu Oct 13 14:48:57 EDT 2022",
Expand Down Expand Up @@ -5280,6 +5281,7 @@ The following optional fields are supported:
- style: Defaults to "default" - a generic OAI archive. (Make sure to use "dataverse" when configuring harvesting from another Dataverse installation).
- customHeaders: This can be used to configure this client with a specific HTTP header that will be added to every OAI request. This is to accommodate a use case where the remote server requires this header to supply some form of a token in order to offer some content not available to other clients. See the example below. Multiple headers can be supplied separated by `\\n` - actual "backslash" and "n" characters, not a single "new line" character.
- allowHarvestingMissingCVV: Flag to allow datasets to be harvested with Controlled Vocabulary Values that existed in the originating Dataverse Project but are not in the harvesting Dataverse Project. (Default is false). Currently only settable using API.
- useOaiIdentifiersAsPids: Defaults to false; if set to true, the harvester will attempt to use the identifier from the OAI-PMH record header as the **first choice** for the persistent id of the harvested dataset. When set to false, Dataverse will still attempt to use this identifier, but only if none of the `<dc:identifier>` entries in the OAI_DC record contain a valid persistent id (this is new as of v6.5).

Generally, the API will accept the output of the GET version of the API for an existing client as valid input, but some fields will be ignored. For example, as of writing this there is no way to configure a harvesting schedule via this API.

Expand Down
1 change: 1 addition & 0 deletions src/main/java/edu/harvard/iq/dataverse/DataCitation.java
Original file line number Diff line number Diff line change
Expand Up @@ -792,6 +792,7 @@ private GlobalId getPIDFrom(DatasetVersion dsv, DvObject dv) {
if (!dsv.getDataset().isHarvested()
|| HarvestingClient.HARVEST_STYLE_VDC.equals(dsv.getDataset().getHarvestedFrom().getHarvestStyle())
|| HarvestingClient.HARVEST_STYLE_ICPSR.equals(dsv.getDataset().getHarvestedFrom().getHarvestStyle())
|| HarvestingClient.HARVEST_STYLE_DEFAULT.equals(dsv.getDataset().getHarvestedFrom().getHarvestStyle())
|| HarvestingClient.HARVEST_STYLE_DATAVERSE
.equals(dsv.getDataset().getHarvestedFrom().getHarvestStyle())) {
if(!isDirect()) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -150,12 +150,16 @@ public DatasetDTO processXML( XMLStreamReader xmlr, ForeignMetadataFormatMapping

}

// Helper method for importing harvested Dublin Core xml.
// Helper methods for importing harvested Dublin Core xml.
// Dublin Core is considered a mandatory, built in metadata format mapping.
// It is distributed as required content, in reference_data.sql.
// Note that arbitrary formatting tags are supported for the outer xml
// wrapper. -- L.A. 4.5
public DatasetDTO processOAIDCxml(String DcXmlToParse) throws XMLStreamException {
return processOAIDCxml(DcXmlToParse, null, false);
}

public DatasetDTO processOAIDCxml(String DcXmlToParse, String oaiIdentifier, boolean preferSuppliedIdentifier) throws XMLStreamException {
// look up DC metadata mapping:

ForeignMetadataFormatMapping dublinCoreMapping = findFormatMappingByName(DCTERMS);
Expand Down Expand Up @@ -185,18 +189,37 @@ public DatasetDTO processOAIDCxml(String DcXmlToParse) throws XMLStreamException

datasetDTO.getDatasetVersion().setVersionState(DatasetVersion.VersionState.RELEASED);

// Our DC import handles the contents of the dc:identifier field
// as an "other id". In the context of OAI harvesting, we expect
// the identifier to be a global id, so we need to rearrange that:
// In some cases, the identifier that we want to use for the dataset is
// already supplied to the method explicitly. For example, in some
// harvesting cases we'll want to use the OAI identifier (the identifier
// from the <header> section of the OAI record) for that purpose, without
// expecting to find a valid persistent id in the body of the DC record:

String identifier = getOtherIdFromDTO(datasetDTO.getDatasetVersion());
logger.fine("Imported identifier: "+identifier);
String globalIdentifier;

String globalIdentifier = reassignIdentifierAsGlobalId(identifier, datasetDTO);
logger.fine("Detected global identifier: "+globalIdentifier);
if (oaiIdentifier != null) {
logger.fine("Attempting to use " + oaiIdentifier + " as the persistentId of the imported dataset");

globalIdentifier = reassignIdentifierAsGlobalId(oaiIdentifier, datasetDTO);
} else {
// Our DC import handles the contents of the dc:identifier field
// as an "other id". Unless we are using an externally supplied
// global id, we will be using the first such "other id" that we
// can parse and recognize as the global id for the imported dataset
// (note that this is the default behavior during harvesting),
// so we need to reaassign it accordingly:
String identifier = selectIdentifier(datasetDTO.getDatasetVersion(), oaiIdentifier, preferSuppliedIdentifier);
logger.fine("Imported identifier: " + identifier);

globalIdentifier = reassignIdentifierAsGlobalId(identifier, datasetDTO);
logger.fine("Detected global identifier: " + globalIdentifier);
}

if (globalIdentifier == null) {
throw new EJBException("Failed to find a global identifier in the OAI_DC XML record.");
String exceptionMsg = oaiIdentifier == null ?
"Failed to find a global identifier in the OAI_DC XML record." :
"Failed to parse the supplied identifier as a valid Persistent Id";
throw new EJBException(exceptionMsg);
}

return datasetDTO;
Expand Down Expand Up @@ -344,8 +367,20 @@ private FieldDTO makeDTO(DatasetFieldType dataverseFieldType, FieldDTO value, St
return value;
}

private String getOtherIdFromDTO(DatasetVersionDTO datasetVersionDTO) {
public String selectIdentifier(DatasetVersionDTO datasetVersionDTO, String suppliedIdentifier) {
return selectIdentifier(datasetVersionDTO, suppliedIdentifier, false);
}

private String selectIdentifier(DatasetVersionDTO datasetVersionDTO, String suppliedIdentifier, boolean preferSuppliedIdentifier) {
List<String> otherIds = new ArrayList<>();

if (suppliedIdentifier != null && preferSuppliedIdentifier) {
// This supplied identifier (in practice, his is likely the OAI-PMH
// identifier from the <record> <header> section) will be our first
// choice candidate for the pid of the imported dataset:
otherIds.add(suppliedIdentifier);
}

for (Map.Entry<String, MetadataBlockDTO> entry : datasetVersionDTO.getMetadataBlocks().entrySet()) {
String key = entry.getKey();
MetadataBlockDTO value = entry.getValue();
Expand All @@ -363,6 +398,16 @@ private String getOtherIdFromDTO(DatasetVersionDTO datasetVersionDTO) {
}
}
}

if (suppliedIdentifier != null && !preferSuppliedIdentifier) {
// Unless specifically instructed to prefer this extra identifier
// (in practice, this is likely the OAI-PMH identifier from the
// <record> <header> section), we will try to use it as the *last*
// possible candidate for the pid, so, adding it to the end of the
// list:
otherIds.add(suppliedIdentifier);
}

if (!otherIds.isEmpty()) {
// We prefer doi or hdl identifiers like "doi:10.7910/DVN/1HE30F"
for (String otherId : otherIds) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -208,7 +208,13 @@ public JsonObjectBuilder handleFile(DataverseRequest dataverseRequest, Dataverse
}

@TransactionAttribute(TransactionAttributeType.REQUIRES_NEW)
public Dataset doImportHarvestedDataset(DataverseRequest dataverseRequest, HarvestingClient harvestingClient, String harvestIdentifier, String metadataFormat, File metadataFile, Date oaiDateStamp, PrintWriter cleanupLog) throws ImportException, IOException {
public Dataset doImportHarvestedDataset(DataverseRequest dataverseRequest,
HarvestingClient harvestingClient,
String harvestIdentifier,
String metadataFormat,
File metadataFile,
Date oaiDateStamp,
PrintWriter cleanupLog) throws ImportException, IOException {
if (harvestingClient == null || harvestingClient.getDataverse() == null) {
throw new ImportException("importHarvestedDataset called with a null harvestingClient, or an invalid harvestingClient.");
}
Expand Down Expand Up @@ -244,8 +250,8 @@ public Dataset doImportHarvestedDataset(DataverseRequest dataverseRequest, Harve
} else if ("dc".equalsIgnoreCase(metadataFormat) || "oai_dc".equals(metadataFormat)) {
logger.fine("importing DC "+metadataFile.getAbsolutePath());
try {
String xmlToParse = new String(Files.readAllBytes(metadataFile.toPath()));
dsDTO = importGenericService.processOAIDCxml(xmlToParse);
String xmlToParse = new String(Files.readAllBytes(metadataFile.toPath()));
dsDTO = importGenericService.processOAIDCxml(xmlToParse, harvestIdentifier, harvestingClient.isUseOaiIdentifiersAsPids());
} catch (IOException | XMLStreamException e) {
throw new ImportException("Failed to process Dublin Core XML record: "+ e.getClass() + " (" + e.getMessage() + ")");
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -252,8 +252,16 @@ public void setAllowHarvestingMissingCVV(boolean allowHarvestingMissingCVV) {
this.allowHarvestingMissingCVV = allowHarvestingMissingCVV;
}

// TODO: do we need "orphanRemoval=true"? -- L.A. 4.4
// TODO: should it be @OrderBy("startTime")? -- L.A. 4.4
private boolean useOaiIdAsPid;

public boolean isUseOaiIdentifiersAsPids() {
return useOaiIdAsPid;
}

public void setUseOaiIdentifiersAsPids(boolean useOaiIdAsPid) {
this.useOaiIdAsPid = useOaiIdAsPid;
}

@OneToMany(mappedBy="harvestingClient", cascade={CascadeType.REMOVE, CascadeType.MERGE, CascadeType.PERSIST})
@OrderBy("id")
private List<ClientHarvestRun> harvestHistory;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1052,6 +1052,7 @@ public String parseHarvestingClient(JsonObject obj, HarvestingClient harvestingC
harvestingClient.setHarvestingSet(obj.getString("set",null));
harvestingClient.setCustomHttpHeaders(obj.getString("customHeaders", null));
harvestingClient.setAllowHarvestingMissingCVV(obj.getBoolean("allowHarvestingMissingCVV", false));
harvestingClient.setUseOaiIdentifiersAsPids(obj.getBoolean("useOaiIdentifiersAsPids", false));

return dataverseAlias;
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1013,6 +1013,7 @@ public static JsonObjectBuilder json(HarvestingClient harvestingClient) {
add("status", harvestingClient.isHarvestingNow() ? "inProgress" : "inActive").
add("customHeaders", harvestingClient.getCustomHttpHeaders()).
add("allowHarvestingMissingCVV", harvestingClient.getAllowHarvestingMissingCVV()).
add("useOaiIdentifiersAsPids", harvestingClient.isUseOaiIdentifiersAsPids()).
add("lastHarvest", harvestingClient.getLastHarvestTime() == null ? null : harvestingClient.getLastHarvestTime().toString()).
add("lastResult", harvestingClient.getLastResult()).
add("lastSuccessful", harvestingClient.getLastSuccessfulHarvestTime() == null ? null : harvestingClient.getLastSuccessfulHarvestTime().toString()).
Expand Down
2 changes: 2 additions & 0 deletions src/main/resources/db/migration/V6.4.0.3.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
-- Add this boolean flag to accommodate a new harvesting client feature
ALTER TABLE harvestingclient ADD COLUMN IF NOT EXISTS useOaiIdAsPid BOOLEAN DEFAULT FALSE;
Original file line number Diff line number Diff line change
@@ -1,21 +1,70 @@
package edu.harvard.iq.dataverse.api.imports;

import edu.harvard.iq.dataverse.api.dto.DatasetDTO;
import edu.harvard.iq.dataverse.api.dto.DatasetVersionDTO;

import org.apache.commons.io.FileUtils;
import com.google.gson.Gson;
import java.io.File;
import java.io.IOException;

import org.junit.jupiter.api.Test;
import org.junit.jupiter.api.extension.ExtendWith;
import org.mockito.InjectMocks;
import org.mockito.junit.jupiter.MockitoExtension;
import static org.junit.jupiter.api.Assertions.assertEquals;
import static org.junit.jupiter.api.Assertions.assertNull;

import java.nio.charset.StandardCharsets;

@ExtendWith(MockitoExtension.class)
public class ImportGenericServiceBeanTest {
pdurbin marked this conversation as resolved.
Show resolved Hide resolved

@InjectMocks
private ImportGenericServiceBean importGenericService;

@Test
public void testReassignIdentifierAsGlobalId() {
void testIdentifierHarvestableWithOtherID() throws IOException {
// "otherIdValue" containing the value : doi:10.7910/DVN/TJCLKP
File file = new File("src/test/resources/json/importGenericWithOtherId.json");
String text = FileUtils.readFileToString(file, StandardCharsets.UTF_8);
DatasetVersionDTO dto = new Gson().fromJson(text, DatasetVersionDTO.class);

assertEquals("doi:10.7910/DVN/TJCLKP", importGenericService.selectIdentifier(dto, "https://doi.org/10.7910/DVN/TJCLKP"));
// junk or null
assertEquals("doi:10.7910/DVN/TJCLKP", importGenericService.selectIdentifier(dto, "junk"));
assertEquals("doi:10.7910/DVN/TJCLKP", importGenericService.selectIdentifier(dto, null));
assertEquals("doi:10.7910/DVN/TJCLKP", importGenericService.selectIdentifier(dto, "http://www.example.com"));
assertEquals("doi:10.7910/DVN/TJCLKP", importGenericService.selectIdentifier(dto, "https://dataverse.org"));
}

@Test
void testIdentifierHarvestableWithoutOtherID() throws IOException {
// Does not contain data of type "otherIdValue"
File file = new File("src/test/resources/json/importGenericWithoutOtherId.json");
String text = FileUtils.readFileToString(file, StandardCharsets.UTF_8);
DatasetVersionDTO dto = new Gson().fromJson(text, DatasetVersionDTO.class);

// non-URL
assertEquals("doi:10.7910/DVN/TJCLKP", importGenericService.selectIdentifier(dto, "doi:10.7910/DVN/TJCLKP"));
assertEquals("hdl:10.7910/DVN/TJCLKP", importGenericService.selectIdentifier(dto, "hdl:10.7910/DVN/TJCLKP"));
// HTTPS
assertEquals("https://doi.org/10.7910/DVN/TJCLKP", importGenericService.selectIdentifier(dto, "https://doi.org/10.7910/DVN/TJCLKP"));
assertEquals("https://dx.doi.org/10.7910/DVN/TJCLKP", importGenericService.selectIdentifier(dto, "https://dx.doi.org/10.7910/DVN/TJCLKP"));
assertEquals("https://hdl.handle.net/10.7910/DVN/TJCLKP", importGenericService.selectIdentifier(dto, "https://hdl.handle.net/10.7910/DVN/TJCLKP"));
// HTTP (no S)
assertEquals("http://doi.org/10.7910/DVN/TJCLKP", importGenericService.selectIdentifier(dto, "http://doi.org/10.7910/DVN/TJCLKP"));
assertEquals("http://dx.doi.org/10.7910/DVN/TJCLKP", importGenericService.selectIdentifier(dto, "http://dx.doi.org/10.7910/DVN/TJCLKP"));
assertEquals("http://hdl.handle.net/10.7910/DVN/TJCLKP", importGenericService.selectIdentifier(dto, "http://hdl.handle.net/10.7910/DVN/TJCLKP"));
// junk or null
assertNull(importGenericService.selectIdentifier(dto, "junk"));
assertNull(importGenericService.selectIdentifier(dto, null));
assertNull(importGenericService.selectIdentifier(dto, "http://www.example.com"));
assertNull(importGenericService.selectIdentifier(dto, "https://dataverse.org"));
}

@Test
void testReassignIdentifierAsGlobalId() {
// non-URL
assertEquals("doi:10.7910/DVN/TJCLKP", importGenericService.reassignIdentifierAsGlobalId("doi:10.7910/DVN/TJCLKP", new DatasetDTO()));
assertEquals("hdl:10.7910/DVN/TJCLKP", importGenericService.reassignIdentifierAsGlobalId("hdl:10.7910/DVN/TJCLKP", new DatasetDTO()));
Expand All @@ -29,6 +78,8 @@ public void testReassignIdentifierAsGlobalId() {
assertEquals("hdl:10.7910/DVN/TJCLKP", importGenericService.reassignIdentifierAsGlobalId("http://hdl.handle.net/10.7910/DVN/TJCLKP", new DatasetDTO()));
// junk
assertNull(importGenericService.reassignIdentifierAsGlobalId("junk", new DatasetDTO()));
assertNull(importGenericService.reassignIdentifierAsGlobalId("http://www.example.com", new DatasetDTO()));
assertNull(importGenericService.reassignIdentifierAsGlobalId("https://dataverse.org", new DatasetDTO()));
}

}
Loading
Loading