-
Notifications
You must be signed in to change notification settings - Fork 495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix handling of storageidentifiers in dataverse_json harvests #7736
Comments
Note that #7325 uses storageidentifiers of this type/form. The use there should probably be consistent with anything done for harvested datasets (at least not be incompatible). |
Reading that PR again, I think the final design defines a storage type of 'http', but the storageidentifiers would use the label for the store as with other types, so they would be things like trsa:// rather than https://. So no direct conflict but a chance for confusion both through the current type name (could rename the type to 'remote'). A potential for issues with storage code overall exists though - the code assumes a format of <storage driver label> :// when trying to identify the right StorageIO class to use so http(s) would have to become reserved words. |
I agree, it could be confusing, at least for a human person looking at the database entries. But I don't think it should lead to any real conflicts, even if you define a storage of type "http" with the actual label "http" (like we have with "file:" and "s3:"). Because we have other ways to unambiguously tell a harvested dvobject from a real one, without having to rely on the storageidentifier. That said we could make it more explicit; maybe all the harvested ones should have some reserved prefix... |
I am going to open an issue for reviewing and potentially refactoring this setup, of "harvested files". (this current issue is still a valid case for a shorter-term fix though). |
I have also opened #8629, for potentially redesigning the whole scheme of how we handle "harvested files". But this should be straightforward enough that we should just go ahead and fix. |
My guess this is a 10 |
Sizing:
|
Priority Review with Stefano:
|
…s harvested in the proprietary json format. #7736
Dataverses can harvest metadata in our custom json format from each other. Our own proprietary export and import are used in the process.
A side effect of this method is that the storageidentifier, the physical location of the datafile on the remote installation ends up being imported verbatim into the harvesting dataverse; instead of the url of the download API on the remote end.
We now have all these recently harvested files with strange storageidentifiers;
like this one:
(no driver prefix; must have been harvested from a pre-4.20 Dataverse)
or this:
(a file in somebody's S3 storage bucket...)
These are of course completely useless for a harvesting installation. We want to handle these the same way as when we harvest DDI between Dataverses. I.e., the dvobjects for these harvested files need to be created with the remote download API url in storageidentifier field (would be
https://dataverse.tdl.org/api/access/datafile/something-something
for the last file above, for example...)Aside from these entries being useless as imported, this is not urgent in that we don't use these remote locations for any practical purpose, as of now. (So that's why we haven't noticed until now). But it's still messy (I was completely weirded out when I saw the ones like the first one above; that looked entirely like a local storageidentifier, that somehow got created without a driver prefix...)
The text was updated successfully, but these errors were encountered: