-
Notifications
You must be signed in to change notification settings - Fork 490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DV cannot parse DOIs that don't have a separator in authority/prefix #3583
Comments
Hey @tdilauro - @scolapasta is running point on migration issues and I'll make sure he checks this out today. I'll also tag @landreev as it looks like he added code to handle what you mention in a commit that was in 4.5: 7ba9194 Anyway, we'll be in touch as soon as possible. |
Almost forgot! We should also check with two of the recent upgraders from the Dataverse community - @donsizemore and @4tikhonov - to see if they have any helpful info here. |
I'm able to migrate a dataset with a DOI of "doi:10.7281/T1J10120" but deprecated constructors are that throw exceptions when a DOI like this is used. It's unclear how well supported DOIs like this are.
Hi @tdilauro I see from your post at https://groups.google.com/d/msg/dataverse-migration-wg/bFYugiKVvd8/IpsLxRB6AwAJ that you were having trouble migrating datasets with DOIs such as "doi:10.7281/T1J10120" to Dataverse 4.3 but can you please try this on Dataverse 4.6? I'm getting different behavior than you reported there. I'm not getting "IllegalArgumentException: Failed to parse identifier: doi:10.7281/T1J10120" but I also don't want to get your hopes up that DOIs like this are completely supported. I suspect there still may be an issue with DOIs like this. We'll see. 😄 On a related note, I played around with the code a bit and pushed some scratch work to pdurbin@ca6ecb4 that I'd be happy to review with @scolapasta @landreev @sekmiller and others. |
Thanks @pdurbin, @djbrooke, and @scolapasta. @scolapasta: A key question for me right now is for which DV versions the migration tools and process are supported. I have performed the migration several times against 4.3 and below, but not of the most current releases that supports migration directly from 3.6.3. |
Hey @tdilauro - I know that you and @scolapasta have been communicating about this, so feel free to leave any helpful notes or scripts here. If this is something that other installations will run into, we should share the knowledge if helpful. Thanks! |
@tdilauro and I discussed this issue briefly at http://irclog.iq.harvard.edu/dataverse/2017-02-01#i_47985 |
I was able to work around this issue in 4.3.1 to complete our migration by stripping the DOIs from the 3.6.3-exported DDI and feeding this to batch migration. Since our Handle prefix was not real and not registered, I needed only replace the Handles with their corresponding DOIs in the dataset table once the batch migration was complete. Using the original exported DDI, I was able to create a mapping from the Handles to their corresponding DOIs for each of the datasets. Using the same information, I created a script to rename (mv) the associated directories. My migration path was 3.6.2 -> 4.3.1, followed by a series of upgrades from 4.3.1 -> ... -> 4.6. I did not have time to test a migration directly from 3.6.2 to 4.6. So, I'm not sure if this issue is resolved or not. |
@tdilauro I think there's still an issue here. Please take a look at https://github.com/IQSS/dataverse/blob/v4.6/src/test/java/edu/harvard/iq/dataverse/GlobalIdTest.java#L127 for example, which looks like this:
@tdilauro you identifiers have two parts (i.e. doi:10.7281/T1J10120) and it sounds like the Dataverse code should be able to parse them. |
Yesterday @scolapasta mentioned "The shoulder separator should not be part of authority but rather the identifier per examples on EZID site" from #898 which seems related. That issue also references http://ezid.cdlib.org/learn/id_concepts |
FWIW - We have a working fix for using shoulders that don't end in '/' in QDR (and very strangely our internal issue # was 898 also!). I didn't send it at the time since I left the hardcoding that assumes '/' is the doiSeparator (so it's only a partial fix that doesn't address other separators). It works with the sequential number and random string identifiers and is backward compatible in that existing identifiers (of the form 10.5072/FK2/ABCDEF) remain stored as is (with the shoulder as part of the authority) while ones without a final slash separate the authority and shoulder (which is stored as part of the identifier field - 10.5072 and FK2ABCDEF) in the DB. A new DoiShoulder key allows you to set the shoulder. If a PR is useful, let me know. (We weren't upgrading from 3.x but I think a DB update could migrate old DOIs if that's still useful). |
@qqmyers yes, please make a pull request so we can debate the approach, at least. Thanks! |
Sending over to @scolapasta based on a brief discussion |
@qqmyers thanks for the PR and I think it's a good start to the solution, which is what #898 is all about (or at least, evolved into). That solution being that the shoulder should be stored as part of the identifier always (with our without a separator). While we normally do favor small batches, in this case I worry that will make migration instructions challenging when we do get the complete fix for #898, since this would create a schism in installations, some with shoulders stored as part of the authority, some with shoulders stored as part of the identifier (and possibly some with both! if they changed their structure at some point) I'm willing to be convinced otherwise, but until then I lean not testing / merging this PR. Regardless, it's good to see that others are on the same page about storing the shoulder as part of the identifier and your comments and work on this can hopefully help me champion #898 as a priority (as it helps the goal of reducing forks) for us here, or encourage someone in the community to tackle the complete solution. |
@scolapasta - I haven't checked thoroughly, but I think existing entries could be migrated to move the shoulder into the identifier with a db migration and a minor update to DatasetServiceBean.findByGlobalId() (there's ICPSR specific code there to handle identifiers with only one separator character that would instead be used all the time). I didn't make this change when we made the update for QDR to avoid a db migration step, but this change would make all ids stored with authority and 'shoulder+' in the database. If that's useful, let me know and I can do a quick test (we have '10.5072/FK2' authority entries in QDR's test databases, so I can migrate those and try the updated code I'm suggesting.). If that's not a helpful step forward, that's fine. I didn't submit this back when we developed it for QDR because it was clear that more work would be needed to support other separators/ change of separator over time. |
I agree that your proposed changes sound like what is needed for #898 and if you have the time to work on that, I would be very happy to help and review, as needed. |
Identifier still needs to be removed from dataset, but I've asked @sekmiller to just go in and extract it as I'm hoping to get this to QA today. The only other thing is how we have duplicate code for parsing id's in DVObjectServiceBean and in global id. Would be good to consolidate at some point, but we can tackle separately* (unless for some reason QA takes longer than expected). (*) if you think you can get to it this afternoon, then feel free to tackle now. |
Try this. The only real difference I saw was that GlobalId strips any whitespace, semicolon and single quote from authority and identifier. I left that in and merged the better error checking in from DvObjectServiceBean. So you create a GlobalId from the a string, and then DvObjectServiceBean just reads the parsed protocol, authority, and identifier from it to populate the query. |
Perfect! Thanks. |
Issues so far:
|
setup-all.sh can drop setting the doiSeparator too. |
@sekmiller and I just worked on 83a76a0 but I think it's weird that |
The Handle ID generation code does not include the shoulder, so I think it is now DOI specific. The connection with Handles and the requirement to be '/' were both from when it was also used as the authority - identifier separator. |
@qqmyers great! If you're confident that the line about the misnomer can be safely removed, can you please go ahead? Thanks! |
Sorry - it is still a misnomer, but no longer has to be a '/'. (Regardless of protocol, identifiers are generated the same way in the DatasetServiceBean and DataFileServiceBean, which is then passed to the relevant Id bean for generation.) |
Looks like there are 18 uses of 'DoiShoulder' in git - I can just go through and change them all ... |
I encountered a serious problem related to retaining our DOIs in Step 4 (batch migrate) of the migration instructions.
The problem appears to originate here. The code is overly restrictive, returning false (error) for perfectly legit DOIs (e.g., doi:10.7281/T1J10120).
This is an urgent bug for us, as I have limited time to get this migration completed and this may be a showstopper.
The text was updated successfully, but these errors were encountered: