-
Notifications
You must be signed in to change notification settings - Fork 490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Permalinks in S3 storage identifiers: treatment of slashes #10771
Comments
FWIW: I haven't had time to look into this, but I don't think permalinks starting with https:// or having double slashes in general was considered during the design (versus having a base-url and having a URL form like DOIs with https://doi.org/ ). There are probably some parts of the code that assume the last // is a special separator in the storageidentifier, e.g. that everything before it is the bucket name, which I think would break with URLs as permalink identifiers. |
I see. I actually didn't consider the I tried again to configure my Permalink PID provider using that setting. The storage identifier looks fine then (no double slashes). However, the permalink itself is now Is there an intended way to get permalinks that look like the latter? |
Nothing implemented I think. The only use case we had for the initial development was for locally minted ones. In other places (e.g. external controlled vocab), we've set up settings with simple ways to substitute values in to create URLs. So here perhaps something like *.resolver-format={pid} (or "https://example.com/mypermalinks/{pid}" - if there are cases where always prepending the base-url doesn't work) where {pid} would be the bare numeric PID. That, or something similar, would be consistent with the earlier design discussions - I'd be happy to review any PR along those lines or discuss further if you think something else is needed. |
I've just pushed a commit here with an idea of how the handling of the This would allow more flexible permalink formats, not limited to permalinks including the hardcoded "/citation?persistentId=" string. If a I've also updated the (By the way, I think this would also fix a majority of the other issues we had with permalinks #10768 and #10769) |
Thanks - looks good to me and simpler than what I suggested. Your solution assumes the PID will always be at the end of the URL to resolve it, which could be true in practice. If not, we can look for options if/when someone has that use case. |
@vera When you create the PR could you update the documentation with a hint to not use |
Sounds good. The only other issue I see with my suggestion is, it requires updating the Permalink configuration for all installations that have a |
Yeah, probably rare at this point. Since this involves MicroProfile settings, which could be jvm-options or something else, I'd suggest just adding a release note in the PR and pointing out that anyone with a perma base-url configured needs to add ?citation... to it (or whatever the correct change is) when upgrading (and avoiding giving specific instructions for the various MicroProfile options or trying to automate the change in some way). |
If you want to go farther, I think it would make sense to actually consider these errors and block them. I'd also suggest some warning about using chars that have special meaning in URLs - I think things like ? and & would break resolver URLs, XML special chars might break xml metadata exports, etc. Perhaps something vague like "In general, PermaLink authority/shoulder values should be alphanumeric. For other cases, admins may need to consider the potential impact of special characters in resolver URLs, exports, etc. |
Just opened a PR: #10775 For now, we've just added a warning in the docs about special characters in authority/shoulder values, as you suggested. |
After creating a dataset in a Dataverse configured for Permalink PIDs and S3 storage, I noticed that the dataset's storage identifier looks like this:
s3://https://example.com/mypermalinks//XYZ...
(where
https://example.com/mypermalinks/
is the authority I have configured andXYZ...
is the unique part of the identifier)In this example, the Permalink PID that is displayed in the Dataverse UI is
https://example.com/mypermalinks/XYZ...
.Note that the storage identifier contains a double slash preceding the unique part, but the PID does not. Is this a bug in the storage identifier generation?
In my S3 management console as well as using the S3 command line tool
s3cmd
, the double slashes (both the "buggy" slash and the one inhttps://
) are interpreted as a kind of "folder" structure that is weird to navigate.Here is a snippet from the Amazon S3 docs about object key names:
https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-keys.html
Is this creation of a folder structure in S3 intended? If yes, I think the folder structure should be created in a way that makes more sense (e.g. don't create a weird empty-name folder because of the double slash at the start of a URL (
https://...
)).In any case, I think it might make sense to URL-encode some or all of the slashes inside Permalink PIDs when generating the storage identifier, as suggested by the Amazon docs (e.g. set a storage identifier like
s3://https%3A%2F%2Fexample.com%2Fmypermalinks%2F%2FXYZ...
(< this would be an option that wouldn't create any nested folder structure inside the S3 bucket)).What do you think?
The text was updated successfully, but these errors were encountered: