Files: Persistent Identifiers for Files (DOIs for files) #2438

eaquigley · 2015-08-07T14:34:03Z

Since we are moving towards individual pages for files, we also need to think about what the persistent identifier will be for them.

lwo · 2015-09-29T20:10:27Z

I am entering via: https://groups.google.com/forum/#!msg/dataverse-community/gtz2npccWjU/i7_EVs2LBgAJ

... I think persistent identifiers should not be derived from the local file id. An organization may want to migrate from their repository solution to dataverse and be able to import their PIDs if they already have them. And then rebind them to the dataverse dataset and future file pages. That would be possible with a String type, but not with a number.

djbrooke · 2016-08-16T18:49:16Z

Sebastian brought this up on the community call 8/16 and asked if it would be included in 4.6. There are currently no plans to work on this for 4.6.

mheppler · 2016-10-27T19:51:41Z

Closing this issue as a duplicate of Files: Need Persistent Identifiers/URL's for Data Files #2700.

pdurbin · 2017-02-17T11:42:50Z

A "DOIs for files" thread was just started at https://groups.google.com/d/msg/dataverse-community/JX2GLqPy_yE/6dzgXGVcCAAJ which reminds me that @djbrooke and I discussed this issue as well as #2700 last week.

In short, #2700 was a bit more about a specific need for putting into print specific instructions about how to download files from an installation of Dataverse. A combination of a DOI and a file name was sufficient so I'll close that issue.

This issue was originally closed as a duplicate of #2700 but this issue is actually the "DOIs for files" issue so I'll reopen it. Note that the title is "persistent identifiers for files" to be more generic than DOIs, to include Handles or whatever other schemes make sense.

scolapasta · 2017-09-05T17:09:42Z

Two recent discussions come up recently where we would benefit from persistent IDs for files:

supporting the Data Citation recommendations for granularity
using the persistent identifiers for provenance (as provenance should never change)

landreev · 2018-04-16T18:47:59Z

Adding relevant info about the current state of the branch;
this is from @sekmiller via slack on Fri.:

we've seen a database connection error when the number of files on a dataset is well over 500. The logs show errors like: Severe:   RAR5031:System Exception
java.lang.NullPointerException
    at com.sun.enterprise.resource.ConnectorXAResource.getResourceHandle(ConnectorXAResource.java:246)
    at com.sun.enterprise.resource.ConnectorXAResource.end(ConnectorXAResource.java:159)
    at com.sun.enterprise.transaction.JavaEETransactionManagerSimplified.delistResource(JavaEETransactionManagerSimplified.java:527)
If you wait long enough the dataset will eventually publish and all of the files will be registered, however, the page is never locked to the user but at the end of the process the dataset gets it lock and it doesn't go away.
I put a hack into the Datapage init such that if a dataset is locked for PIDRegistration but there's no draft version, we try to remove the lock.
Is there a better way to approach this?

To me this looks very wrong. And I don't feel this really is about the right way of handling an exception - why are we getting that exception in the first place? Specifically, a database connection error? Are we somehow trying to register these file DOIs for the 500+ files at the same time - instead of doing int sequentially in the background?
Steve also said that

They at least get identifiers in the database I didn't check that they are all actually registered at the EZID admin console, but a spot check shows that all the ones I have looked for there have been registered.

so at least it's looking like that asynchronous job finishes doing what it needs to do, just never removes the locks properly... Still, this database exception, and the mess it leaves behind looks scary to me. And I don't think we should leave it like this, just because 500+ files is an "edge case".

landreev · 2018-04-16T19:17:06Z

I also added two specific change requests in the review.
One for removing an Async attribute that doesn't seem to serve any purpose; and another for the part in the FinalizeDatasetPublicationCommand where we appear to be trying to delete the same lock twice...
No good reason to think either of these 2 things are causing trouble, but who knows .

…n the dataset once it's locked, before the asynchronous part of the publishing process is initiated. (#2438)

…atasetPublicationCommand update the dataset object; to avoid the condition when the first command is still holding onto the same object as it initiate the execution of the second command, asynchronously - which can result in concurrency and optimistic lock issues. Also, removed the redundant/unnecessary merge() calls. (#2438)

landreev · 2018-04-30T18:34:21Z

A quick summary, of the latest developments:

the identifiers for files are not immediately assigned on upload; for as long as the file is in draft, it has no global id. so the db. ids are used in urls pointing to the file landing pages, etc.
the identifiers are assigned when the version is published.
if there are 10 or fewer files (the limit is configurable, but 10 is the default), the identifiers are assigned as part of the single publishing task. Meaning, you click publish, you (eventually) get the success message, all the files should have global identifiers. (if the global ids cannot be assigned, publish should fail).
if there are more than 10, it goes into the async mode. The version becomes locked, with a message clearly explaining what's going on. While locked, the version should still appear as an unpublished draft. Once the global ids are assigned in the background, the version should get automatically unlocked and appear as published.
So both the sync (10 or fewer files) and async modes should be tested separately.
Publishing datasets with large numbers of files should definitely be tested carefully. The expected behavior is that it should work with any number of files; with the async. part simply taking longer (the rule of thumb is that it costs .5 sec. per file to register the global id). But the last issue we ran into was that with more than 500 files or so, we were running into a weird concurrency issue where there was a conflict between the first (the synchronous) Publish command and the asynchronous part (the FinalizePublish command) that was messing up the locking and/or unlocking and sometimes preventing the dataset from getting published. It appears to have been fixed and everything should be working as advertised now.

kcondon · 2018-05-04T21:07:25Z

Tested the above checklist by leonid and all works as described. Also nearly completed test checklist: https://docs.google.com/document/d/1VOlDJFRy7zoS4LVDerH5TgMYsr3DKkm9ZRNcqajitxY/edit?usp=sharing
all works for ezid but handles and datacite do not register ids. handle does not create them, datacite now creates but does not register and is slow.

eaquigley added Type: Suggestion an idea Feature: Metadata Feature: File Upload & Handling labels Aug 7, 2015

eaquigley added this to the 4.2 milestone Aug 7, 2015

eaquigley assigned mercecrosas Aug 10, 2015

pdurbin added a commit that referenced this issue Sep 4, 2015

added script to download file via R #2438

812424a

pdurbin mentioned this issue Sep 4, 2015

Add support in datasets API for persistent id (doi) #1837

Closed

eaquigley modified the milestones: 4.3, 4.2 Sep 14, 2015

pdurbin mentioned this issue Oct 28, 2015

Add integration test to download a file by filename IQSS/dataverse-client-python#29

Open

eaquigley mentioned this issue Oct 29, 2015

Files: Need Persistent Identifiers/URL's for Data Files #2700

Closed

mercecrosas modified the milestones: 4.3, In Review Nov 30, 2015

scolapasta unassigned mercecrosas Jan 28, 2016

scolapasta added Status: Triaged and removed Status: Dev labels Jan 28, 2016

scolapasta removed this from the Not Assigned to a Release milestone Jan 28, 2016

mheppler mentioned this issue Oct 27, 2016

File: Individual Landing Pages for Files #2465

Closed

mheppler closed this as completed Oct 27, 2016

pdurbin reopened this Feb 17, 2017

pdurbin changed the title ~~Files: Persistent Identifiers for Files~~ Files: Persistent Identifiers for Files (DOIs for files) Feb 17, 2017

pdurbin added User Role: Curator Curates and reviews datasets, manages permissions and removed zTriaged labels Jun 30, 2017

djbrooke added the Status: Backlog label Sep 7, 2017

djbrooke unassigned landreev Apr 18, 2018

djbrooke added Status: Development and removed Status: Code Review labels Apr 23, 2018

djbrooke assigned sekmiller Apr 23, 2018

sekmiller added a commit that referenced this issue Apr 23, 2018

#2438 Code Cleanup per Leonid's review

cef0518

pameyer mentioned this issue Apr 23, 2018

Rsync: File upload not completing due to lock error when adding package. #4605

Closed

djbrooke assigned landreev and unassigned sekmiller Apr 24, 2018

landreev added a commit that referenced this issue Apr 26, 2018

Adding another explicit .merge(), to finalize any unflushed changes i…

b0e1b7a

…n the dataset once it's locked, before the asynchronous part of the publishing process is initiated. (#2438)

landreev added Status: QA and removed Status: Development labels Apr 30, 2018

landreev assigned kcondon and unassigned landreev Apr 30, 2018

sekmiller added a commit that referenced this issue May 4, 2018

#2438 generate file DOI on publish if blank (DataCite)

ff9aa46

sekmiller added a commit that referenced this issue May 8, 2018

#2438 reduce noise on already exists not found for handles

d7e1e87

sekmiller added a commit that referenced this issue May 14, 2018

#2438 Fix register of Files via DataCite

c25083a

sekmiller added a commit that referenced this issue May 14, 2018

#2438 Clean up FindFileOrDie calls

fa3a55f

sekmiller added a commit that referenced this issue May 14, 2018

#2438 combine upgrade sql scripts

3575b0d

sekmiller added a commit that referenced this issue May 14, 2018

#2438 Fix FilePage redirect when no shoulder and dependent identifier

45d3963

sekmiller added a commit that referenced this issue May 15, 2018

#2438 Simplify retrieve by globalId

ded2172

sekmiller added a commit that referenced this issue May 15, 2018

#2438 Get File name/desc for DC metadata

c0a3c65

qqmyers mentioned this issue May 15, 2018

DV cannot parse DOIs that don't have a separator in authority/prefix #3583

Closed

kcondon closed this as completed May 16, 2018

kcondon removed the Status: QA label May 16, 2018

qqmyers mentioned this issue Jun 24, 2018

Datafile suggested citation doesn't include file DOI #4777

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files: Persistent Identifiers for Files (DOIs for files) #2438

Files: Persistent Identifiers for Files (DOIs for files) #2438

eaquigley commented Aug 7, 2015

lwo commented Sep 29, 2015

djbrooke commented Aug 16, 2016

mheppler commented Oct 27, 2016 •

edited

Loading

pdurbin commented Feb 17, 2017

scolapasta commented Sep 5, 2017 •

edited

Loading

landreev commented Apr 16, 2018

landreev commented Apr 16, 2018 •

edited

Loading

landreev commented Apr 30, 2018

kcondon commented May 4, 2018

Files: Persistent Identifiers for Files (DOIs for files) #2438

Files: Persistent Identifiers for Files (DOIs for files) #2438

Comments

eaquigley commented Aug 7, 2015

lwo commented Sep 29, 2015

djbrooke commented Aug 16, 2016

mheppler commented Oct 27, 2016 • edited Loading

pdurbin commented Feb 17, 2017

scolapasta commented Sep 5, 2017 • edited Loading

landreev commented Apr 16, 2018

landreev commented Apr 16, 2018 • edited Loading

landreev commented Apr 30, 2018

kcondon commented May 4, 2018

mheppler commented Oct 27, 2016 •

edited

Loading

scolapasta commented Sep 5, 2017 •

edited

Loading

landreev commented Apr 16, 2018 •

edited

Loading