Skip to content

Commit

Permalink
Merge pull request #11049 from IQSS/10909-oai-identifiers-as-pids
Browse files Browse the repository at this point in the history
10982 10909 Allow using OAI-PMH identifiers as persistent ids of harvested datasets
  • Loading branch information
ofahimIQSS authored Nov 27, 2024
2 parents bcb441e + 40fe665 commit 3c427c1
Show file tree
Hide file tree
Showing 13 changed files with 704 additions and 17 deletions.
5 changes: 5 additions & 0 deletions doc/release-notes/11049-oai-identifiers-as-pids.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
## When harvesting, Dataverse can now use the identifier from the OAI-PMH record header as the persistent id for the harvested dataset.

This will allow harvesting from sources that do not include a persistent id in their oai_dc metadata records, but use valid dois or handles as the OAI-PMH record header identifiers.

It is also possible to optionally configure a harvesting client to use this OAI-PMH identifier as the **preferred** choice for the persistent id. See the [Harvesting Clients API](https://guides.dataverse.org/en/6.5/api/native-api.html#create-a-harvesting-client) section of the Guides, #11049 and #10982 for more information.
2 changes: 2 additions & 0 deletions doc/sphinx-guides/source/api/native-api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5254,6 +5254,7 @@ Shows a Harvesting Client with a defined nickname::
"dataverseAlias": "fooData",
"nickName": "myClient",
"set": "fooSet",
"useOaiIdentifiersAsPids": false
"schedule": "none",
"status": "inActive",
"lastHarvest": "Thu Oct 13 14:48:57 EDT 2022",
Expand Down Expand Up @@ -5288,6 +5289,7 @@ The following optional fields are supported:
- style: Defaults to "default" - a generic OAI archive. (Make sure to use "dataverse" when configuring harvesting from another Dataverse installation).
- customHeaders: This can be used to configure this client with a specific HTTP header that will be added to every OAI request. This is to accommodate a use case where the remote server requires this header to supply some form of a token in order to offer some content not available to other clients. See the example below. Multiple headers can be supplied separated by `\\n` - actual "backslash" and "n" characters, not a single "new line" character.
- allowHarvestingMissingCVV: Flag to allow datasets to be harvested with Controlled Vocabulary Values that existed in the originating Dataverse Project but are not in the harvesting Dataverse Project. (Default is false). Currently only settable using API.
- useOaiIdentifiersAsPids: Defaults to false; if set to true, the harvester will attempt to use the identifier from the OAI-PMH record header as the **first choice** for the persistent id of the harvested dataset. When set to false, Dataverse will still attempt to use this identifier, but only if none of the `<dc:identifier>` entries in the OAI_DC record contain a valid persistent id (this is new as of v6.5).
Generally, the API will accept the output of the GET version of the API for an existing client as valid input, but some fields will be ignored. For example, as of writing this there is no way to configure a harvesting schedule via this API.
Expand Down
1 change: 1 addition & 0 deletions src/main/java/edu/harvard/iq/dataverse/DataCitation.java
Original file line number Diff line number Diff line change
Expand Up @@ -792,6 +792,7 @@ private GlobalId getPIDFrom(DatasetVersion dsv, DvObject dv) {
if (!dsv.getDataset().isHarvested()
|| HarvestingClient.HARVEST_STYLE_VDC.equals(dsv.getDataset().getHarvestedFrom().getHarvestStyle())
|| HarvestingClient.HARVEST_STYLE_ICPSR.equals(dsv.getDataset().getHarvestedFrom().getHarvestStyle())
|| HarvestingClient.HARVEST_STYLE_DEFAULT.equals(dsv.getDataset().getHarvestedFrom().getHarvestStyle())
|| HarvestingClient.HARVEST_STYLE_DATAVERSE
.equals(dsv.getDataset().getHarvestedFrom().getHarvestStyle())) {
if(!isDirect()) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -150,12 +150,16 @@ public DatasetDTO processXML( XMLStreamReader xmlr, ForeignMetadataFormatMapping

}

// Helper method for importing harvested Dublin Core xml.
// Helper methods for importing harvested Dublin Core xml.
// Dublin Core is considered a mandatory, built in metadata format mapping.
// It is distributed as required content, in reference_data.sql.
// Note that arbitrary formatting tags are supported for the outer xml
// wrapper. -- L.A. 4.5
public DatasetDTO processOAIDCxml(String DcXmlToParse) throws XMLStreamException {
return processOAIDCxml(DcXmlToParse, null, false);
}

public DatasetDTO processOAIDCxml(String DcXmlToParse, String oaiIdentifier, boolean preferSuppliedIdentifier) throws XMLStreamException {
// look up DC metadata mapping:

ForeignMetadataFormatMapping dublinCoreMapping = findFormatMappingByName(DCTERMS);
Expand Down Expand Up @@ -185,18 +189,37 @@ public DatasetDTO processOAIDCxml(String DcXmlToParse) throws XMLStreamException

datasetDTO.getDatasetVersion().setVersionState(DatasetVersion.VersionState.RELEASED);

// Our DC import handles the contents of the dc:identifier field
// as an "other id". In the context of OAI harvesting, we expect
// the identifier to be a global id, so we need to rearrange that:
// In some cases, the identifier that we want to use for the dataset is
// already supplied to the method explicitly. For example, in some
// harvesting cases we'll want to use the OAI identifier (the identifier
// from the <header> section of the OAI record) for that purpose, without
// expecting to find a valid persistent id in the body of the DC record:

String identifier = getOtherIdFromDTO(datasetDTO.getDatasetVersion());
logger.fine("Imported identifier: "+identifier);
String globalIdentifier;

String globalIdentifier = reassignIdentifierAsGlobalId(identifier, datasetDTO);
logger.fine("Detected global identifier: "+globalIdentifier);
if (oaiIdentifier != null) {
logger.fine("Attempting to use " + oaiIdentifier + " as the persistentId of the imported dataset");

globalIdentifier = reassignIdentifierAsGlobalId(oaiIdentifier, datasetDTO);
} else {
// Our DC import handles the contents of the dc:identifier field
// as an "other id". Unless we are using an externally supplied
// global id, we will be using the first such "other id" that we
// can parse and recognize as the global id for the imported dataset
// (note that this is the default behavior during harvesting),
// so we need to reaassign it accordingly:
String identifier = selectIdentifier(datasetDTO.getDatasetVersion(), oaiIdentifier, preferSuppliedIdentifier);
logger.fine("Imported identifier: " + identifier);

globalIdentifier = reassignIdentifierAsGlobalId(identifier, datasetDTO);
logger.fine("Detected global identifier: " + globalIdentifier);
}

if (globalIdentifier == null) {
throw new EJBException("Failed to find a global identifier in the OAI_DC XML record.");
String exceptionMsg = oaiIdentifier == null ?
"Failed to find a global identifier in the OAI_DC XML record." :
"Failed to parse the supplied identifier as a valid Persistent Id";
throw new EJBException(exceptionMsg);
}

return datasetDTO;
Expand Down Expand Up @@ -344,8 +367,20 @@ private FieldDTO makeDTO(DatasetFieldType dataverseFieldType, FieldDTO value, St
return value;
}

private String getOtherIdFromDTO(DatasetVersionDTO datasetVersionDTO) {
public String selectIdentifier(DatasetVersionDTO datasetVersionDTO, String suppliedIdentifier) {
return selectIdentifier(datasetVersionDTO, suppliedIdentifier, false);
}

private String selectIdentifier(DatasetVersionDTO datasetVersionDTO, String suppliedIdentifier, boolean preferSuppliedIdentifier) {
List<String> otherIds = new ArrayList<>();

if (suppliedIdentifier != null && preferSuppliedIdentifier) {
// This supplied identifier (in practice, his is likely the OAI-PMH
// identifier from the <record> <header> section) will be our first
// choice candidate for the pid of the imported dataset:
otherIds.add(suppliedIdentifier);
}

for (Map.Entry<String, MetadataBlockDTO> entry : datasetVersionDTO.getMetadataBlocks().entrySet()) {
String key = entry.getKey();
MetadataBlockDTO value = entry.getValue();
Expand All @@ -363,6 +398,16 @@ private String getOtherIdFromDTO(DatasetVersionDTO datasetVersionDTO) {
}
}
}

if (suppliedIdentifier != null && !preferSuppliedIdentifier) {
// Unless specifically instructed to prefer this extra identifier
// (in practice, this is likely the OAI-PMH identifier from the
// <record> <header> section), we will try to use it as the *last*
// possible candidate for the pid, so, adding it to the end of the
// list:
otherIds.add(suppliedIdentifier);
}

if (!otherIds.isEmpty()) {
// We prefer doi or hdl identifiers like "doi:10.7910/DVN/1HE30F"
for (String otherId : otherIds) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -208,7 +208,13 @@ public JsonObjectBuilder handleFile(DataverseRequest dataverseRequest, Dataverse
}

@TransactionAttribute(TransactionAttributeType.REQUIRES_NEW)
public Dataset doImportHarvestedDataset(DataverseRequest dataverseRequest, HarvestingClient harvestingClient, String harvestIdentifier, String metadataFormat, File metadataFile, Date oaiDateStamp, PrintWriter cleanupLog) throws ImportException, IOException {
public Dataset doImportHarvestedDataset(DataverseRequest dataverseRequest,
HarvestingClient harvestingClient,
String harvestIdentifier,
String metadataFormat,
File metadataFile,
Date oaiDateStamp,
PrintWriter cleanupLog) throws ImportException, IOException {
if (harvestingClient == null || harvestingClient.getDataverse() == null) {
throw new ImportException("importHarvestedDataset called with a null harvestingClient, or an invalid harvestingClient.");
}
Expand Down Expand Up @@ -244,8 +250,8 @@ public Dataset doImportHarvestedDataset(DataverseRequest dataverseRequest, Harve
} else if ("dc".equalsIgnoreCase(metadataFormat) || "oai_dc".equals(metadataFormat)) {
logger.fine("importing DC "+metadataFile.getAbsolutePath());
try {
String xmlToParse = new String(Files.readAllBytes(metadataFile.toPath()));
dsDTO = importGenericService.processOAIDCxml(xmlToParse);
String xmlToParse = new String(Files.readAllBytes(metadataFile.toPath()));
dsDTO = importGenericService.processOAIDCxml(xmlToParse, harvestIdentifier, harvestingClient.isUseOaiIdentifiersAsPids());
} catch (IOException | XMLStreamException e) {
throw new ImportException("Failed to process Dublin Core XML record: "+ e.getClass() + " (" + e.getMessage() + ")");
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -252,8 +252,16 @@ public void setAllowHarvestingMissingCVV(boolean allowHarvestingMissingCVV) {
this.allowHarvestingMissingCVV = allowHarvestingMissingCVV;
}

// TODO: do we need "orphanRemoval=true"? -- L.A. 4.4
// TODO: should it be @OrderBy("startTime")? -- L.A. 4.4
private boolean useOaiIdAsPid;

public boolean isUseOaiIdentifiersAsPids() {
return useOaiIdAsPid;
}

public void setUseOaiIdentifiersAsPids(boolean useOaiIdAsPid) {
this.useOaiIdAsPid = useOaiIdAsPid;
}

@OneToMany(mappedBy="harvestingClient", cascade={CascadeType.REMOVE, CascadeType.MERGE, CascadeType.PERSIST})
@OrderBy("id")
private List<ClientHarvestRun> harvestHistory;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1052,6 +1052,7 @@ public String parseHarvestingClient(JsonObject obj, HarvestingClient harvestingC
harvestingClient.setHarvestingSet(obj.getString("set",null));
harvestingClient.setCustomHttpHeaders(obj.getString("customHeaders", null));
harvestingClient.setAllowHarvestingMissingCVV(obj.getBoolean("allowHarvestingMissingCVV", false));
harvestingClient.setUseOaiIdentifiersAsPids(obj.getBoolean("useOaiIdentifiersAsPids", false));

return dataverseAlias;
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1013,6 +1013,7 @@ public static JsonObjectBuilder json(HarvestingClient harvestingClient) {
add("status", harvestingClient.isHarvestingNow() ? "inProgress" : "inActive").
add("customHeaders", harvestingClient.getCustomHttpHeaders()).
add("allowHarvestingMissingCVV", harvestingClient.getAllowHarvestingMissingCVV()).
add("useOaiIdentifiersAsPids", harvestingClient.isUseOaiIdentifiersAsPids()).
add("lastHarvest", harvestingClient.getLastHarvestTime() == null ? null : harvestingClient.getLastHarvestTime().toString()).
add("lastResult", harvestingClient.getLastResult()).
add("lastSuccessful", harvestingClient.getLastSuccessfulHarvestTime() == null ? null : harvestingClient.getLastSuccessfulHarvestTime().toString()).
Expand Down
2 changes: 2 additions & 0 deletions src/main/resources/db/migration/V6.4.0.3.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
-- Add this boolean flag to accommodate a new harvesting client feature
ALTER TABLE harvestingclient ADD COLUMN IF NOT EXISTS useOaiIdAsPid BOOLEAN DEFAULT FALSE;
2 changes: 1 addition & 1 deletion src/main/webapp/search-include-fragment.xhtml
Original file line number Diff line number Diff line change
Expand Up @@ -582,7 +582,7 @@
<h:outputText value="#{bundle['retentionExpired']}" styleClass="label label-warning" rendered="#{SearchIncludeFragment.isRetentionExpired(result)}"/>
<h:outputText value="#{DatasetUtil:getLocaleExternalStatus(result.externalStatus)}" styleClass="label label-info" rendered="#{!empty result.externalStatus and SearchIncludeFragment.canPublishDataset(result.entityId)}"/>
<h:outputText value="#{result.userRole}" styleClass="label label-primary" rendered="#{!empty result.userRole}"/>
<h:outputText value="#{bundle['incomplete']}" styleClass="label label-danger" rendered="#{!SearchIncludeFragment.isValid(result)}"/>
<h:outputText value="#{bundle['incomplete']}" styleClass="label label-danger" rendered="#{!result.harvested and !SearchIncludeFragment.isValid(result)}"/>
</div>
<div class="card-preview-icon-block text-center">
<a rel="nofollow" href="#{!SearchIncludeFragment.rootDv and !result.isInTree ? result.datasetUrl : widgetWrapper.wrapURL(result.datasetUrl)}" target="#{(!SearchIncludeFragment.rootDv and !result.isInTree and widgetWrapper.widgetView) or result.harvested ? '_blank' : ''}" aria-label="#{result.title}">
Expand Down
Original file line number Diff line number Diff line change
@@ -1,21 +1,70 @@
package edu.harvard.iq.dataverse.api.imports;

import edu.harvard.iq.dataverse.api.dto.DatasetDTO;
import edu.harvard.iq.dataverse.api.dto.DatasetVersionDTO;

import org.apache.commons.io.FileUtils;
import com.google.gson.Gson;
import java.io.File;
import java.io.IOException;

import org.junit.jupiter.api.Test;
import org.junit.jupiter.api.extension.ExtendWith;
import org.mockito.InjectMocks;
import org.mockito.junit.jupiter.MockitoExtension;
import static org.junit.jupiter.api.Assertions.assertEquals;
import static org.junit.jupiter.api.Assertions.assertNull;

import java.nio.charset.StandardCharsets;

@ExtendWith(MockitoExtension.class)
public class ImportGenericServiceBeanTest {

@InjectMocks
private ImportGenericServiceBean importGenericService;

@Test
public void testReassignIdentifierAsGlobalId() {
void testIdentifierHarvestableWithOtherID() throws IOException {
// "otherIdValue" containing the value : doi:10.7910/DVN/TJCLKP
File file = new File("src/test/resources/json/importGenericWithOtherId.json");
String text = FileUtils.readFileToString(file, StandardCharsets.UTF_8);
DatasetVersionDTO dto = new Gson().fromJson(text, DatasetVersionDTO.class);

assertEquals("doi:10.7910/DVN/TJCLKP", importGenericService.selectIdentifier(dto, "https://doi.org/10.7910/DVN/TJCLKP"));
// junk or null
assertEquals("doi:10.7910/DVN/TJCLKP", importGenericService.selectIdentifier(dto, "junk"));
assertEquals("doi:10.7910/DVN/TJCLKP", importGenericService.selectIdentifier(dto, null));
assertEquals("doi:10.7910/DVN/TJCLKP", importGenericService.selectIdentifier(dto, "http://www.example.com"));
assertEquals("doi:10.7910/DVN/TJCLKP", importGenericService.selectIdentifier(dto, "https://dataverse.org"));
}

@Test
void testIdentifierHarvestableWithoutOtherID() throws IOException {
// Does not contain data of type "otherIdValue"
File file = new File("src/test/resources/json/importGenericWithoutOtherId.json");
String text = FileUtils.readFileToString(file, StandardCharsets.UTF_8);
DatasetVersionDTO dto = new Gson().fromJson(text, DatasetVersionDTO.class);

// non-URL
assertEquals("doi:10.7910/DVN/TJCLKP", importGenericService.selectIdentifier(dto, "doi:10.7910/DVN/TJCLKP"));
assertEquals("hdl:10.7910/DVN/TJCLKP", importGenericService.selectIdentifier(dto, "hdl:10.7910/DVN/TJCLKP"));
// HTTPS
assertEquals("https://doi.org/10.7910/DVN/TJCLKP", importGenericService.selectIdentifier(dto, "https://doi.org/10.7910/DVN/TJCLKP"));
assertEquals("https://dx.doi.org/10.7910/DVN/TJCLKP", importGenericService.selectIdentifier(dto, "https://dx.doi.org/10.7910/DVN/TJCLKP"));
assertEquals("https://hdl.handle.net/10.7910/DVN/TJCLKP", importGenericService.selectIdentifier(dto, "https://hdl.handle.net/10.7910/DVN/TJCLKP"));
// HTTP (no S)
assertEquals("http://doi.org/10.7910/DVN/TJCLKP", importGenericService.selectIdentifier(dto, "http://doi.org/10.7910/DVN/TJCLKP"));
assertEquals("http://dx.doi.org/10.7910/DVN/TJCLKP", importGenericService.selectIdentifier(dto, "http://dx.doi.org/10.7910/DVN/TJCLKP"));
assertEquals("http://hdl.handle.net/10.7910/DVN/TJCLKP", importGenericService.selectIdentifier(dto, "http://hdl.handle.net/10.7910/DVN/TJCLKP"));
// junk or null
assertNull(importGenericService.selectIdentifier(dto, "junk"));
assertNull(importGenericService.selectIdentifier(dto, null));
assertNull(importGenericService.selectIdentifier(dto, "http://www.example.com"));
assertNull(importGenericService.selectIdentifier(dto, "https://dataverse.org"));
}

@Test
void testReassignIdentifierAsGlobalId() {
// non-URL
assertEquals("doi:10.7910/DVN/TJCLKP", importGenericService.reassignIdentifierAsGlobalId("doi:10.7910/DVN/TJCLKP", new DatasetDTO()));
assertEquals("hdl:10.7910/DVN/TJCLKP", importGenericService.reassignIdentifierAsGlobalId("hdl:10.7910/DVN/TJCLKP", new DatasetDTO()));
Expand All @@ -29,6 +78,8 @@ public void testReassignIdentifierAsGlobalId() {
assertEquals("hdl:10.7910/DVN/TJCLKP", importGenericService.reassignIdentifierAsGlobalId("http://hdl.handle.net/10.7910/DVN/TJCLKP", new DatasetDTO()));
// junk
assertNull(importGenericService.reassignIdentifierAsGlobalId("junk", new DatasetDTO()));
assertNull(importGenericService.reassignIdentifierAsGlobalId("http://www.example.com", new DatasetDTO()));
assertNull(importGenericService.reassignIdentifierAsGlobalId("https://dataverse.org", new DatasetDTO()));
}

}
Loading

0 comments on commit 3c427c1

Please sign in to comment.