Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DV cannot parse DOIs that don't have a separator in authority/prefix #3583

Closed
tdilauro opened this issue Jan 20, 2017 · 55 comments
Closed

DV cannot parse DOIs that don't have a separator in authority/prefix #3583

tdilauro opened this issue Jan 20, 2017 · 55 comments

Comments

@tdilauro
Copy link
Contributor

I encountered a serious problem related to retaining our DOIs in Step 4 (batch migrate) of the migration instructions.

The problem appears to originate here. The code is overly restrictive, returning false (error) for perfectly legit DOIs (e.g., doi:10.7281/T1J10120).

This is an urgent bug for us, as I have limited time to get this migration completed and this may be a showstopper.

@djbrooke
Copy link
Contributor

Hey @tdilauro - @scolapasta is running point on migration issues and I'll make sure he checks this out today.

I'll also tag @landreev as it looks like he added code to handle what you mention in a commit that was in 4.5: 7ba9194

Anyway, we'll be in touch as soon as possible.

@djbrooke
Copy link
Contributor

Almost forgot! We should also check with two of the recent upgraders from the Dataverse community - @donsizemore and @4tikhonov - to see if they have any helpful info here.

pdurbin added a commit to pdurbin/dataverse that referenced this issue Jan 20, 2017
I'm able to migrate a dataset with a DOI of "doi:10.7281/T1J10120" but
deprecated constructors are that throw exceptions when a DOI like this
is used. It's unclear how well supported DOIs like this are.
@pdurbin
Copy link
Member

pdurbin commented Jan 20, 2017

Hi @tdilauro I see from your post at https://groups.google.com/d/msg/dataverse-migration-wg/bFYugiKVvd8/IpsLxRB6AwAJ that you were having trouble migrating datasets with DOIs such as "doi:10.7281/T1J10120" to Dataverse 4.3 but can you please try this on Dataverse 4.6? I'm getting different behavior than you reported there. I'm not getting "IllegalArgumentException: Failed to parse identifier: doi:10.7281/T1J10120" but I also don't want to get your hopes up that DOIs like this are completely supported. I suspect there still may be an issue with DOIs like this. We'll see. 😄

On a related note, I played around with the code a bit and pushed some scratch work to pdurbin@ca6ecb4 that I'd be happy to review with @scolapasta @landreev @sekmiller and others.

@scolapasta
Copy link
Contributor

scolapasta commented Jan 20, 2017 via email

@tdilauro
Copy link
Contributor Author

Thanks @pdurbin, @djbrooke, and @scolapasta.

@scolapasta: A key question for me right now is for which DV versions the migration tools and process are supported. I have performed the migration several times against 4.3 and below, but not of the most current releases that supports migration directly from 3.6.3.

@djbrooke
Copy link
Contributor

Hey @tdilauro - I know that you and @scolapasta have been communicating about this, so feel free to leave any helpful notes or scripts here. If this is something that other installations will run into, we should share the knowledge if helpful. Thanks!

@pdurbin
Copy link
Member

pdurbin commented Feb 1, 2017

@tdilauro and I discussed this issue briefly at http://irclog.iq.harvard.edu/dataverse/2017-02-01#i_47985

@tdilauro
Copy link
Contributor Author

tdilauro commented Feb 1, 2017

I was able to work around this issue in 4.3.1 to complete our migration by stripping the DOIs from the 3.6.3-exported DDI and feeding this to batch migration. Since our Handle prefix was not real and not registered, I needed only replace the Handles with their corresponding DOIs in the dataset table once the batch migration was complete. Using the original exported DDI, I was able to create a mapping from the Handles to their corresponding DOIs for each of the datasets. Using the same information, I created a script to rename (mv) the associated directories.

My migration path was 3.6.2 -> 4.3.1, followed by a series of upgrades from 4.3.1 -> ... -> 4.6. I did not have time to test a migration directly from 3.6.2 to 4.6. So, I'm not sure if this issue is resolved or not.

@pdurbin
Copy link
Member

pdurbin commented Feb 1, 2017

@tdilauro I think there's still an issue here. Please take a look at https://github.com/IQSS/dataverse/blob/v4.6/src/test/java/edu/harvard/iq/dataverse/GlobalIdTest.java#L127 for example, which looks like this:

    @Test
    public void testBadIdentifierTwoParts(){
        System.out.println("testBadIdentifierTwoParts");

        exception.expect(IllegalArgumentException.class);
        exception.expectMessage("Failed to parse identifier: doi:2part/blah");
        new GlobalId("doi:2part/blah");
}

@tdilauro you identifiers have two parts (i.e. doi:10.7281/T1J10120) and it sounds like the Dataverse code should be able to parse them.

@pdurbin
Copy link
Member

pdurbin commented Feb 2, 2017

Yesterday @scolapasta mentioned "The shoulder separator should not be part of authority but rather the identifier per examples on EZID site" from #898 which seems related. That issue also references http://ezid.cdlib.org/learn/id_concepts

@tdilauro
Copy link
Contributor Author

tdilauro commented Feb 2, 2017

Thanks for making that connection to #898, @pdurbin. I'll add some comments there.

@qqmyers
Copy link
Member

qqmyers commented Apr 26, 2018

FWIW - We have a working fix for using shoulders that don't end in '/' in QDR (and very strangely our internal issue # was 898 also!). I didn't send it at the time since I left the hardcoding that assumes '/' is the doiSeparator (so it's only a partial fix that doesn't address other separators). It works with the sequential number and random string identifiers and is backward compatible in that existing identifiers (of the form 10.5072/FK2/ABCDEF) remain stored as is (with the shoulder as part of the authority) while ones without a final slash separate the authority and shoulder (which is stored as part of the identifier field - 10.5072 and FK2ABCDEF) in the DB. A new DoiShoulder key allows you to set the shoulder. If a PR is useful, let me know. (We weren't upgrading from 3.x but I think a DB update could migrate old DOIs if that's still useful).

@pdurbin
Copy link
Member

pdurbin commented Apr 26, 2018

@qqmyers yes, please make a pull request so we can debate the approach, at least. Thanks!

@djbrooke
Copy link
Contributor

Sending over to @scolapasta based on a brief discussion

@pdurbin pdurbin removed the User Role: Sysadmin Installs, upgrades, and configures the system, connects via ssh label May 2, 2018
@scolapasta
Copy link
Contributor

@qqmyers thanks for the PR and I think it's a good start to the solution, which is what #898 is all about (or at least, evolved into). That solution being that the shoulder should be stored as part of the identifier always (with our without a separator).

While we normally do favor small batches, in this case I worry that will make migration instructions challenging when we do get the complete fix for #898, since this would create a schism in installations, some with shoulders stored as part of the authority, some with shoulders stored as part of the identifier (and possibly some with both! if they changed their structure at some point)

I'm willing to be convinced otherwise, but until then I lean not testing / merging this PR.

Regardless, it's good to see that others are on the same page about storing the shoulder as part of the identifier and your comments and work on this can hopefully help me champion #898 as a priority (as it helps the goal of reducing forks) for us here, or encourage someone in the community to tackle the complete solution.

@qqmyers
Copy link
Member

qqmyers commented May 4, 2018

@scolapasta - I haven't checked thoroughly, but I think existing entries could be migrated to move the shoulder into the identifier with a db migration and a minor update to DatasetServiceBean.findByGlobalId() (there's ICPSR specific code there to handle identifiers with only one separator character that would instead be used all the time). I didn't make this change when we made the update for QDR to avoid a db migration step, but this change would make all ids stored with authority and 'shoulder+' in the database. If that's useful, let me know and I can do a quick test (we have '10.5072/FK2' authority entries in QDR's test databases, so I can migrate those and try the updated code I'm suggesting.).

If that's not a helpful step forward, that's fine. I didn't submit this back when we developed it for QDR because it was clear that more work would be needed to support other separators/ change of separator over time.

@scolapasta
Copy link
Contributor

I agree that your proposed changes sound like what is needed for #898 and if you have the time to work on that, I would be very happy to help and review, as needed.

@scolapasta
Copy link
Contributor

Identifier still needs to be removed from dataset, but I've asked @sekmiller to just go in and extract it as I'm hoping to get this to QA today.

The only other thing is how we have duplicate code for parsing id's in DVObjectServiceBean and in global id. Would be good to consolidate at some point, but we can tackle separately* (unless for some reason QA takes longer than expected).

(*) if you think you can get to it this afternoon, then feel free to tackle now.

@scolapasta scolapasta removed their assignment May 23, 2018
@kcondon kcondon self-assigned this May 23, 2018
@qqmyers
Copy link
Member

qqmyers commented May 23, 2018

Try this. The only real difference I saw was that GlobalId strips any whitespace, semicolon and single quote from authority and identifier. I left that in and merged the better error checking in from DvObjectServiceBean. So you create a GlobalId from the a string, and then DvObjectServiceBean just reads the parsed protocol, authority, and identifier from it to populate the query.

@scolapasta
Copy link
Contributor

Perfect! Thanks.

@kcondon
Copy link
Contributor

kcondon commented May 24, 2018

Issues so far:

  • setup-all.sh needs to set :Authority and :DoiShoulder (Stephen)
  • update db script does not migrate multiple doi providers correctly.
  • docs need to scrub mention of :DoiSeparator and Add :DoiShoulder (Derek?)
  • setup-optional-harvard.sh was modified to add prod :Authority and :Shoulder values but did not do so previously.

@qqmyers
Copy link
Member

qqmyers commented May 25, 2018

setup-all.sh can drop setting the doiSeparator too.

sekmiller added a commit to QualitativeDataRepository/dataverse that referenced this issue May 25, 2018
pdurbin added a commit to QualitativeDataRepository/dataverse that referenced this issue May 25, 2018
@pdurbin
Copy link
Member

pdurbin commented May 25, 2018

@sekmiller and I just worked on 83a76a0 but I think it's weird that :DoiShoulder continues to be a misnomer:

screen shot 2018-05-25 at 10 44 34 am

@qqmyers
Copy link
Member

qqmyers commented May 25, 2018

The Handle ID generation code does not include the shoulder, so I think it is now DOI specific. The connection with Handles and the requirement to be '/' were both from when it was also used as the authority - identifier separator.

@pdurbin
Copy link
Member

pdurbin commented May 25, 2018

@qqmyers great! If you're confident that the line about the misnomer can be safely removed, can you please go ahead? Thanks!

@qqmyers
Copy link
Member

qqmyers commented May 25, 2018

Sorry - it is still a misnomer, but no longer has to be a '/'. (Regardless of protocol, identifiers are generated the same way in the DatasetServiceBean and DataFileServiceBean, which is then passed to the relevant Id bean for generation.)

@pdurbin
Copy link
Member

pdurbin commented May 25, 2018

@qqmyers thanks for taking a look and for fixing up the docs at cf7a720. It sounds like :DoiShoulder is still a misnomer, which is unfortunate, since it's a new setting.

@qqmyers
Copy link
Member

qqmyers commented May 25, 2018

Looks like there are 18 uses of 'DoiShoulder' in git - I can just go through and change them all ...

@pdurbin
Copy link
Member

pdurbin commented May 25, 2018

@qqmyers judging from baa9052 you're done already! Thanks!!

pdurbin added a commit to QualitativeDataRepository/dataverse that referenced this issue May 31, 2018
pdurbin added a commit to QualitativeDataRepository/dataverse that referenced this issue May 31, 2018
@pdurbin
Copy link
Member

pdurbin commented May 31, 2018

@kcondon reported that he doesn't want :Authority or :Shoulder to be set by the setup-optional-harvard.sh script so I fixed this in d497dee . I also merged the latest from develop into the branch in 5478470.

sekmiller added a commit to QualitativeDataRepository/dataverse that referenced this issue May 31, 2018
sekmiller added a commit to QualitativeDataRepository/dataverse that referenced this issue May 31, 2018
@kcondon kcondon closed this as completed Jun 1, 2018
@kcondon kcondon removed the Status: QA label Jun 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants