Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Files: Persistent Identifiers for Files (DOIs for files) #2438

Closed
eaquigley opened this issue Aug 7, 2015 · 100 comments
Closed

Files: Persistent Identifiers for Files (DOIs for files) #2438

eaquigley opened this issue Aug 7, 2015 · 100 comments
Assignees

Comments

@eaquigley
Copy link
Contributor

Since we are moving towards individual pages for files, we also need to think about what the persistent identifier will be for them.

@lwo
Copy link

lwo commented Sep 29, 2015

I am entering via: https://groups.google.com/forum/#!msg/dataverse-community/gtz2npccWjU/i7_EVs2LBgAJ

... I think persistent identifiers should not be derived from the local file id. An organization may want to migrate from their repository solution to dataverse and be able to import their PIDs if they already have them. And then rebind them to the dataverse dataset and future file pages. That would be possible with a String type, but not with a number.

@djbrooke
Copy link
Contributor

Sebastian brought this up on the community call 8/16 and asked if it would be included in 4.6. There are currently no plans to work on this for 4.6.

@mheppler
Copy link
Contributor

mheppler commented Oct 27, 2016

Closing this issue as a duplicate of Files: Need Persistent Identifiers/URL's for Data Files #2700.

@pdurbin
Copy link
Member

pdurbin commented Feb 17, 2017

A "DOIs for files" thread was just started at https://groups.google.com/d/msg/dataverse-community/JX2GLqPy_yE/6dzgXGVcCAAJ which reminds me that @djbrooke and I discussed this issue as well as #2700 last week.

In short, #2700 was a bit more about a specific need for putting into print specific instructions about how to download files from an installation of Dataverse. A combination of a DOI and a file name was sufficient so I'll close that issue.

This issue was originally closed as a duplicate of #2700 but this issue is actually the "DOIs for files" issue so I'll reopen it. Note that the title is "persistent identifiers for files" to be more generic than DOIs, to include Handles or whatever other schemes make sense.

@pdurbin pdurbin reopened this Feb 17, 2017
@pdurbin pdurbin changed the title Files: Persistent Identifiers for Files Files: Persistent Identifiers for Files (DOIs for files) Feb 17, 2017
@pdurbin pdurbin added User Role: Curator Curates and reviews datasets, manages permissions and removed zTriaged labels Jun 30, 2017
@scolapasta
Copy link
Contributor

scolapasta commented Sep 5, 2017

Two recent discussions come up recently where we would benefit from persistent IDs for files:

  • supporting the Data Citation recommendations for granularity
  • using the persistent identifiers for provenance (as provenance should never change)

@landreev
Copy link
Contributor

Adding relevant info about the current state of the branch;
this is from @sekmiller via slack on Fri.:

we've seen a database connection error when the number of files on a dataset is well over 500. The logs show errors like: Severe:   RAR5031:System Exception
java.lang.NullPointerException
    at com.sun.enterprise.resource.ConnectorXAResource.getResourceHandle(ConnectorXAResource.java:246)
    at com.sun.enterprise.resource.ConnectorXAResource.end(ConnectorXAResource.java:159)
    at com.sun.enterprise.transaction.JavaEETransactionManagerSimplified.delistResource(JavaEETransactionManagerSimplified.java:527)
If you wait long enough the dataset will eventually publish and all of the files will be registered, however, the page is never locked to the user but at the end of the process the dataset gets it lock and it doesn't go away.
I put a hack into the Datapage init such that if a dataset is locked for PIDRegistration but there's no draft version, we try to remove the lock.
Is there a better way to approach this?

To me this looks very wrong. And I don't feel this really is about the right way of handling an exception - why are we getting that exception in the first place? Specifically, a database connection error? Are we somehow trying to register these file DOIs for the 500+ files at the same time - instead of doing int sequentially in the background?
Steve also said that

They at least get identifiers in the database I didn't check that they are all actually registered at the EZID admin console, but a spot check shows that all the ones I have looked for there have been registered.
  • so at least it's looking like that asynchronous job finishes doing what it needs to do, just never removes the locks properly... Still, this database exception, and the mess it leaves behind looks scary to me. And I don't think we should leave it like this, just because 500+ files is an "edge case".

@landreev
Copy link
Contributor

landreev commented Apr 16, 2018

I also added two specific change requests in the review.
One for removing an Async attribute that doesn't seem to serve any purpose; and another for the part in the FinalizeDatasetPublicationCommand where we appear to be trying to delete the same lock twice...
No good reason to think either of these 2 things are causing trouble, but who knows .

@djbrooke djbrooke assigned landreev and unassigned sekmiller Apr 24, 2018
landreev added a commit that referenced this issue Apr 26, 2018
…n the dataset once it's locked, before the asynchronous part of the publishing process is initiated. (#2438)
landreev added a commit that referenced this issue Apr 30, 2018
…atasetPublicationCommand update the dataset object;

to avoid the condition when the first command is still holding onto the same object as it initiate the execution of the
second command, asynchronously - which can result in concurrency and optimistic lock issues.
Also, removed the redundant/unnecessary merge() calls. (#2438)
@landreev landreev assigned kcondon and unassigned landreev Apr 30, 2018
@landreev
Copy link
Contributor

A quick summary, of the latest developments:

  • the identifiers for files are not immediately assigned on upload; for as long as the file is in draft, it has no global id. so the db. ids are used in urls pointing to the file landing pages, etc.
  • the identifiers are assigned when the version is published.
  • if there are 10 or fewer files (the limit is configurable, but 10 is the default), the identifiers are assigned as part of the single publishing task. Meaning, you click publish, you (eventually) get the success message, all the files should have global identifiers. (if the global ids cannot be assigned, publish should fail).
  • if there are more than 10, it goes into the async mode. The version becomes locked, with a message clearly explaining what's going on. While locked, the version should still appear as an unpublished draft. Once the global ids are assigned in the background, the version should get automatically unlocked and appear as published.
  • So both the sync (10 or fewer files) and async modes should be tested separately.
  • Publishing datasets with large numbers of files should definitely be tested carefully. The expected behavior is that it should work with any number of files; with the async. part simply taking longer (the rule of thumb is that it costs .5 sec. per file to register the global id). But the last issue we ran into was that with more than 500 files or so, we were running into a weird concurrency issue where there was a conflict between the first (the synchronous) Publish command and the asynchronous part (the FinalizePublish command) that was messing up the locking and/or unlocking and sometimes preventing the dataset from getting published. It appears to have been fixed and everything should be working as advertised now.

@kcondon
Copy link
Contributor

kcondon commented May 4, 2018

Tested the above checklist by leonid and all works as described. Also nearly completed test checklist: https://docs.google.com/document/d/1VOlDJFRy7zoS4LVDerH5TgMYsr3DKkm9ZRNcqajitxY/edit?usp=sharing
all works for ezid but handles and datacite do not register ids. handle does not create them, datacite now creates but does not register and is slow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests