-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
In exporttree mode store information about file id for each file #146
Comments
Hi! 👋 We are happy that you opened your first issue here! 😄 If you haven't done so already, please make sure you check out our Code of Conduct. |
Right, in Dataverse we call this "replacing" a file: https://guides.dataverse.org/en/5.11/user/dataset-management.html#replace-files and https://guides.dataverse.org/en/5.11/api/native-api.html#replacing-files Before we had the concept of replacing a file, people would simple delete the old file, upload a new one, and publish. (And lots of people still do this.) With the "replace file" functionality, we store the following:
For a little more context (from "export a dataset as native JSON" where you can see info about files):
|
@pdurbin is there API to request a file simply based on its MD5 checksum? |
@yarikoptic no. File download requests are done by database id or DOI, if available. (DOIs for files are off by default). My first thought is, what about empty files like |
yes. Filename in git-annex repo contained within "git tree" which then symlinks the content. So if we could query content (could be delivered without any content-disposition filename, could be anything, e.g. checksum) that would be very cool! Does dataverse DB schema makes it feasible to add a request content based on md5? |
@pdurbin, could you shine some light on this with respect to identifiers:
I suppose:
|
@yarikoptic one could certainly query the Dataverse database for files based on checksum. The table in question is "datafile": https://guides.dataverse.org/en/5.6/schemaspy/tables/datafile.html A query would look something like this:
If you're really interested in this feature, please go ahead and create an issue at https://github.com/IQSS/dataverse/issues . I assume you'd want the "File Access" API to support checksums. @bpoldrack yes Right, at the file level, I wouldn't worry about |
@bpoldrack Yes, The database ids are not truly persistent of course; but they are definitely unique, and for all practical purposes you can assume they will stay constant for the lifetime of a Dataverse installation. @yarikoptic I'm still trying to understand your use case, specifically, why it is necessary to access the file by its checksum;
So is it just to be able to produce the urls for accessing the content of the files, without having to make any intermediate lookups via our API? (or ...?) |
Thanks, @pdurbin and @landreev! Re looking up files based on their MD5: From my point of view this is semi-important for us. One the one hand, it would be kinda nice, since MD5 is datalad's default annex backend and to be able to lookup annex keys directly makes things smooth. However, users can decide to use some other hash instead and then this lookup doesn't help us much. In any case: per-file lookup likely wouldn't scale nicely. And as bulk info upfront we can already get it, it seems. |
Correct, you get the assigned database ids of all the files as you deposit them. So no extra lookups are really needed. |
I'm cleaning up PR #147 before it's ready. Meant to close this issue with it, but not sure about it anymore: I implemented an approach where we do indeed use dataverse's database IDs and keep track of them via git-annex's While the special remote implementation is kinda capable of that, If I now go back one commit and try to get that "old" key, I found no way to make annex even try with an export remote. So, the only option I see is via setting a URL for a key when it's removed (from annex POV, but not really since published). However, the utility and validity of this depends, I think. So, not sure how to proceed in that regard. With respect to "keeping track of the ID" - done and in that sense could be closed. It's already useful w/o providing easy annex-based access to old content. (Note, that the ID tracking is used in non-export mode as well) WDYT? How does it work in the S3 case you mentioned? |
It is unclear to me why this is still open. |
It's still open, because in order to properly address what Yarik wanted, we not only need the ids, but support versioned export mode. Otherwise we can not get annex to ever ask an export remote for an older version of a file. In order to support versioned export mode, we need to implement support for And for that, I wanted to enhance |
In #148 (comment) I recorded the fact that this development only superficially sounds like "one more thing here", but it actually requires implementing support for the importtree protocol first. Not here, but in annexremote. Given that the protocol specification itself in labeled as a partially implemented draft by git-annex itself, the scope of this issue is a bit larger than what the discussion here makes it sound. |
Given #148 is closed 'wont-fix', there is nothing left to do here. |
path
which was first published in one version of dataset on dataverse, and then it was modified, and published again, it should be possible to access both versions of thatpath
through two differentids
for it. It is analogous to havingversionId
on versioned S3 bucket.e.g. demo for versionId on s3 in our test bucket
so you can see
2versions-removed-recreated.txt
available with two different versionIds.looking inside how git-annex stores it
so for key
MD5E-s13681254--03f09045f3343b776f7403a43e14341e.nii.gz
it isZQstkVt.4Szohp204gTDhytGQ7kkqwA_
versionId. Let's look in git-annex branch for that key and how it is stored:STORE-SUCCESS Key ContentIdentifier
forSTOREEXPORTEXPECTED
invocation.git annex get
) should also use such ContentIdentifier if known/provided, thus allowing to access any (possibly prior) version of any path exported to dataverseThe text was updated successfully, but these errors were encountered: