Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate Entities Detected in OpenCitations Meta #28

Open
eliarizzetto opened this issue Sep 5, 2024 · 0 comments
Open

Duplicate Entities Detected in OpenCitations Meta #28

eliarizzetto opened this issue Sep 5, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@eliarizzetto
Copy link
Collaborator

Issue Description:

We have detected the presence of duplicate entities in OpenCitations Meta across several entity types, specifically concerning Bibliographic Resources, Responsible Agents, and Identifiers. Below is a summary of the problem and examples illustrating the issue.
See also issue #24.

Summary of the Problem:

  1. Bibliographic Resources (and related Identifier entities): There are instances where multiple Bibliographic Resource (BR) entities are linked to the same identifier value (e.g., DOI, ISSN). This effectively consists in duplication, with separate journal articles being represented by distinct entities but associated with the same DOI. This issue arises due to either:

    • Multiple Identifier entities connected to the same value (i.e. also the Identifier entities are duplicates).
    • A single Identifier entity being linked to multiple Bibliographic Resource entities.
  2. Responsible Agents: Similar duplication occurs with Responsible Agents, such as authors. In some cases, multiple entities represent the same real-world individual but are all associated with the same ORCID identifier.

Example SPARQL Query for ISSN Duplication:

A SPARQL query was written to retrieve examples of duplicate Bibliographic Resource entities connected to the same ISSN. This query identifies 10 distinct cases where two Bibliographic Resources are linked to the same Identifier entity through the same ISSN:

PREFIX datacite: <http://purl.org/spar/datacite/>
PREFIX literal: <http://www.essepuntato.it/2010/06/literalreification/>

SELECT DISTINCT ?id (?lit AS ?ISSN) ?br1 ?br2
WHERE {
  ?id datacite:usesIdentifierScheme datacite:issn;
    literal:hasLiteralValue ?lit.

  ?br1 datacite:hasIdentifier ?id.
  ?br2 datacite:hasIdentifier ?id.

  FILTER(?br1 != ?br2)
}
GROUP BY ?lit
LIMIT 10

Current Results:

id ISSN br1 br2
https://w3id.org/oc/meta/id/06302944976 2214-1766 https://w3id.org/oc/meta/br/06303150256 https://w3id.org/oc/meta/br/06380151022
https://w3id.org/oc/meta/id/0616014 0162-8828 https://w3id.org/oc/meta/br/062503701865 https://w3id.org/oc/meta/br/062203701946
https://w3id.org/oc/meta/id/0616014 0162-8828 https://w3id.org/oc/meta/br/0603903711 https://w3id.org/oc/meta/br/062503701865
https://w3id.org/oc/meta/id/06170244 1178-203X https://w3id.org/oc/meta/br/062103762230 https://w3id.org/oc/meta/br/06280185247
https://w3id.org/oc/meta/id/0616081 1555-6654 https://w3id.org/oc/meta/br/061203826853 https://w3id.org/oc/meta/br/061606048
https://w3id.org/oc/meta/id/061402866970 1809-9246 https://w3id.org/oc/meta/br/061203801536 https://w3id.org/oc/meta/br/061403009914
https://w3id.org/oc/meta/id/06201832116 2007-865X https://w3id.org/oc/meta/br/06203959225 https://w3id.org/oc/meta/br/06103883902
https://w3id.org/oc/meta/id/061401171 1229-5949 https://w3id.org/oc/meta/br/0614039607 https://w3id.org/oc/meta/br/061503913707
https://w3id.org/oc/meta/id/06301140758 2212-5043 https://w3id.org/oc/meta/br/06301094885 https://w3id.org/oc/meta/br/06804190681
https://w3id.org/oc/meta/id/0626020980 0873-2159 https://w3id.org/oc/meta/br/062503758069 https://w3id.org/oc/meta/br/0603903778

As part of a preliminary analysis, the scale of the problem was quantified:

  • Duplicate BRs with DOIs: 4,564,263 Bibliographic Resources (BRs) are associated with the same DOI as at least one other BR.
  • Duplicate BRs with ISSNs : 570 BRs are associated with the same ISSN as at least one other BR.

The following SPARQL query was used to obtain the count of affected bibliographic resources:

PREFIX datacite: <http://purl.org/spar/datacite/>
PREFIX literal: <http://www.essepuntato.it/2010/06/literalreification/>

SELECT (COUNT(DISTINCT ?br) AS ?count) WHERE {
  {
    SELECT ?br (COUNT(DISTINCT ?br_other) AS ?shared_br_count) WHERE {
      ?br datacite:hasIdentifier ?id.
      ?id datacite:usesIdentifierScheme datacite:issn; # change with datacite:doi for DOIs
        literal:hasLiteralValue ?lit.
      ?br_other datacite:hasIdentifier ?id_other.
      ?id_other datacite:usesIdentifierScheme datacite:issn; # change with datacite:doi for DOIs
        literal:hasLiteralValue ?lit.
      FILTER(?br != ?br_other)
    }
    GROUP BY ?br
  }
  FILTER(?shared_br_count > 0 )
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant