In exporttree mode store information about file id for each file #146

yarikoptic · 2022-07-12T16:30:54Z

dataverse, as was described by @pdurbin supports identification of a path version, so each path/version can be uniquely identified. I.e. if I have path which was first published in one version of dataset on dataverse, and then it was modified, and published again, it should be possible to access both versions of that path through two different ids for it. It is analogous to having versionId on versioned S3 bucket.

e.g. demo for versionId on s3 in our test bucket

$> datalad ls -aL s3://datalad-test0-versioned/2versions-removed-recreated.txt
Bucket info:
  Versioning: {'Versioning': 'Enabled', 'MfaDelete': 'Disabled'}
     Website: datalad-test0-versioned.s3-website-us-east-1.amazonaws.com
         ACL: <Policy: [email protected] (owner) = FULL_CONTROL>
2versions-removed-recreated.txt            2015-11-07T05:23:37.000Z 8 ver:bBVSCB4MdBOeEXDQ2KwrjtrevpwFabaY  acl:<Policy: [email protected] (owner) = FULL_CONTROL>  http://datalad-test0-versioned.s3.amazonaws.com/2versions-removed-recreated.txt?versionId=bBVSCB4MdBOeEXDQ2KwrjtrevpwFabaY [OK]
2versions-removed-recreated.txt            2015-11-07T05:23:37.000Z DeleteMarker
2versions-removed-recreated.txt            2015-11-07T05:23:37.000Z 8 ver:zwW0b567gYO3puJeLMZsOmETqowJnv6l  acl:<Policy: [email protected] (owner) = FULL_CONTROL>  http://datalad-test0-versioned.s3.amazonaws.com/2versions-removed-recreated.txt?versionId=zwW0b567gYO3puJeLMZsOmETqowJnv6l [OK]
2versions-removed-recreated.txt_sameprefix 2015-11-07T05:23:37.000Z 8 ver:cfTdf3N8exZLFg.KcW5szQKrFNLUyCu1  acl:<Policy: [email protected] (owner) = FULL_CONTROL>  http://datalad-test0-versioned.s3.amazonaws.com/2versions-removed-recreated.txt_sameprefix?versionId=cfTdf3N8exZLFg.KcW5szQKrFNLUyCu1 [OK]

so you can see 2versions-removed-recreated.txt available with two different versionIds.

- http://datasets.datalad.org/labs/hasson/narratives/.git is an example of a datalad dataset exported to s3 bucket with versioning enabled, so you can see git-annex storing versionIds for the files

looking inside how git-annex stores it

$> git annex whereis sub-001/anat/sub-001_T1w.nii.gz
whereis sub-001/anat/sub-001_T1w.nii.gz (5 copies) 
  	25af5b16-0e8f-4536-a9d9-30dc416fd6fe -- yoh@falkor:/srv/datasets.datalad.org/www/labs/hasson/narratives [origin]
   	7c87d330-1f68-459a-b518-5a0d5fe5ce7b -- nastase@smaug:/mnt/datasets/incoming/nastase/narratives
   	a5565bb2-df69-4eb6-800e-125e7766a53f -- Narratives data collection on PNI server
   	a733154e-1fc3-4fa0-843f-bdf5bd7490d3 -- yoh@smaug:/mnt/datasets/datalad/crawl/labs/hasson/narratives
   	abe19045-3f76-4ec5-a673-1fd3785fa62f -- [fcp-indi]

  fcp-indi: https://s3.amazonaws.com/fcp-indi/data/Projects/narratives/sub-001/anat/sub-001_T1w.nii.gz?versionId=ZQstkVt.4Szohp204gTDhytGQ7kkqwA_
ok

$> ls -l sub-001/anat/sub-001_T1w.nii.gz
lrwxrwxrwx 1 yoh yoh 142 Mar 15  2021 sub-001/anat/sub-001_T1w.nii.gz -> ../../.git/annex/objects/7w/2z/MD5E-s13681254--03f09045f3343b776f7403a43e14341e.nii.gz/MD5E-s13681254--03f09045f3343b776f7403a43e14341e.nii.gz

so for key MD5E-s13681254--03f09045f3343b776f7403a43e14341e.nii.gz it is ZQstkVt.4Szohp204gTDhytGQ7kkqwA_ versionId. Let's look in git-annex branch for that key and how it is stored:

$> cat ./f1a/188/MD5E-s13681254--03f09045f3343b776f7403a43e14341e.nii.gz.log.rmet
1607610479.067007s abe19045-3f76-4ec5-a673-1fd3785fa62f:V +ZQstkVt.4Szohp204gTDhytGQ7kkqwA_#data/Projects/narratives/sub-001/anat/sub-001_T1w.nii.gz

dataset in dataverse need to be published before a new version of a file could be uploaded
git-annex docs at https://git-annex.branchable.com/git-annex-export/ say "some special remotes, notably S3, support keeping track of old versions of files stored in them. If a special remote is set up to do that, it can be used as a key/value store and the limitations in the above paragraph do not apply. ..."
https://git-annex.branchable.com/design/external_special_remote_protocol/export_and_import_appendix/ talks about "content identifiers" and it seems that support is as easy as reporting that (in dataverse could be file ID) back in STORE-SUCCESS Key ContentIdentifier for STOREEXPORTEXPECTED invocation.
and then retrieval (i.e. for git annex get) should also use such ContentIdentifier if known/provided, thus allowing to access any (possibly prior) version of any path exported to dataverse

The text was updated successfully, but these errors were encountered:

welcome · 2022-07-12T16:30:56Z

Hi! 👋 We are happy that you opened your first issue here! 😄 If you haven't done so already, please make sure you check out our Code of Conduct.

pdurbin · 2022-07-13T20:36:21Z

Right, in Dataverse we call this "replacing" a file: https://guides.dataverse.org/en/5.11/user/dataset-management.html#replace-files and https://guides.dataverse.org/en/5.11/api/native-api.html#replacing-files

Before we had the concept of replacing a file, people would simple delete the old file, upload a new one, and publish. (And lots of people still do this.)

With the "replace file" functionality, we store the following:

previousDataFileId: the file id that was replaced most recently
rootDataFileId: the file id that was originally replaced.

For a little more context (from "export a dataset as native JSON" where you can see info about files):

    "files": [
      {
        "label": "file.txt",
        "restricted": false,
        "version": 1,
        "datasetVersionId": 3,
        "dataFile": {
          "id": 5,
          "persistentId": "",
          "pidURL": "",
          "filename": "file.txt",
          "contentType": "text/plain",
          "filesize": 4,
          "storageIdentifier": "file://181f94173f4-5f4de5a4fd27",
          "rootDataFileId": 4,
          "previousDataFileId": 4,
          "md5": "6ddb4095eb719e2a9f0a3f95677d24e0",
          "checksum": {
            "type": "MD5",
            "value": "6ddb4095eb719e2a9f0a3f95677d24e0"
          },
          "creationDate": "2022-07-13"
        }
      }
    ]

yarikoptic · 2022-07-13T21:53:24Z

@pdurbin is there API to request a file simply based on its MD5 checksum?

pdurbin · 2022-07-14T19:58:57Z

@yarikoptic no. File download requests are done by database id or DOI, if available. (DOIs for files are off by default).

My first thought is, what about empty files like __init.py__ that Python uses. All empty files will have the same MD5 but they would have differently filenames, etc. Maybe you're only interested in the content of the file and don't care about the filename.

yarikoptic · 2022-07-15T23:16:47Z

Maybe you're only interested in the content of the file and don't care about the filename.

yes. Filename in git-annex repo contained within "git tree" which then symlinks the content. So if we could query content (could be delivered without any content-disposition filename, could be anything, e.g. checksum) that would be very cool! Does dataverse DB schema makes it feasible to add a request content based on md5?

bpoldrack · 2022-07-18T14:28:13Z

@pdurbin, could you shine some light on this with respect to identifiers:

"files": [
      {
        "label": "file.txt",
        "restricted": false,
        "version": 1,
        "datasetVersionId": 3,
        "dataFile": {
          "id": 5,
          "persistentId": "",
          "pidURL": "",
          "filename": "file.txt",
          "contentType": "text/plain",
          "filesize": 4,
          "storageIdentifier": "file://181f94173f4-5f4de5a4fd27",
          "rootDataFileId": 4,
          "previousDataFileId": 4,
          "md5": "6ddb4095eb719e2a9f0a3f95677d24e0",
          "checksum": {
            "type": "MD5",
            "value": "6ddb4095eb719e2a9f0a3f95677d24e0"
          },
          "creationDate": "2022-07-13"
        }
      }
    ]

I suppose:

id should always be there and probably is the database id, correct? Hence, it's not really guaranteed to be persistent, right?
Availability of persistentId depends on whether the dataset was published with that version of the file and on the dataverse instance's proper setup to provide one, correct?
storageIdentifier: I have no clue so far ;-) Any guarantees for that one?

pdurbin · 2022-07-18T18:25:13Z

@yarikoptic one could certainly query the Dataverse database for files based on checksum. The table in question is "datafile": https://guides.dataverse.org/en/5.6/schemaspy/tables/datafile.html

A query would look something like this:

select * from datafile where checksumtype = 'MD5' and checksumvalue = '6ddb4095eb719e2a9f0a3f95677d24e0';

If you're really interested in this feature, please go ahead and create an issue at https://github.com/IQSS/dataverse/issues . I assume you'd want the "File Access" API to support checksums.

@bpoldrack yes id is the database id of the file (id in the datafile table mentioned above). I'd say it's fairly persistent. When do primary keys change? Hopefully not very often! 😄

Right, at the file level, persistentId is only populated if you have the feature turned on and if the file is published. For this reason, it's more reliable to identify files using their id (database id, as discussed above).

I wouldn't worry about storageIdentifier. 😄 file:// means the file is on the filesystem as opposed to S3 or Swift. You can read more about Dataverse storage options at https://guides.dataverse.org/en/5.11/installation/config.html#file-storage-using-a-local-filesystem-and-or-swift-and-or-object-stores

landreev · 2022-07-18T21:06:01Z

@bpoldrack Yes, storageIdentifier identifies the location of the physical file in storage. Generally these are for internal use only. Note that the storageIdentifier of a file is only guaranteed to be unique within the dataset. In other words there may be multiple datafiles across a Dataverse installation with the storage identifier file://xyz; but the combinations of the persistent id of the dataset + the storage identifier will all be unique.

The database ids are not truly persistent of course; but they are definitely unique, and for all practical purposes you can assume they will stay constant for the lifetime of a Dataverse installation.

@yarikoptic I'm still trying to understand your use case, specifically, why it is necessary to access the file by its checksum;

Filename in git-annex repo contained within "git tree" which then symlinks the content. So if we could query content (could be delivered without any content-disposition filename, could be anything, e.g. checksum) that would be very cool! Does dataverse DB schema makes it feasible to add a request content based on md5?

So is it just to be able to produce the urls for accessing the content of the files, without having to make any intermediate lookups via our API? (or ...?)

bpoldrack · 2022-07-19T09:08:40Z

Thanks, @pdurbin and @landreev!

Re looking up files based on their MD5:

From my point of view this is semi-important for us. One the one hand, it would be kinda nice, since MD5 is datalad's default annex backend and to be able to lookup annex keys directly makes things smooth. However, users can decide to use some other hash instead and then this lookup doesn't help us much. In any case: per-file lookup likely wouldn't scale nicely. And as bulk info upfront we can already get it, it seems.

landreev · 2022-07-19T15:07:00Z

And as bulk info upfront we can already get it, it seems.

Correct, you get the assigned database ids of all the files as you deposit them. So no extra lookups are really needed.

bpoldrack · 2022-08-14T16:18:48Z

I'm cleaning up PR #147 before it's ready. Meant to close this issue with it, but not sure about it anymore:

I implemented an approach where we do indeed use dataverse's database IDs and keep track of them via git-annex's SETSTATE special remote feature. However, that does not allow us to access "old" file content in export mode, @yarikoptic.

While the special remote implementation is kinda capable of that, git-annex isn't. Consider a repo, push its state and content to a dataverse special remote in export mode and "publish" that dataset (as in the dataverse feature, not the old datalad command). Now, locally remove a file, commit and export again. The result would a draft version on dataverse, that doesn't have the file anymore. However, the file is still available from the published version and we know its ID. But:

If I now go back one commit and try to get that "old" key, I found no way to make annex even try with an export remote.
get, fsck, checkpresent, etc. - with no option does it anything other than firing up the special remote and exiting again right after taking note, that it supports EXPORT. No PREPARE, no CHECKPRESENT, no nothing. Not even setpresent helps really. annex thinks it knows better. ;-)

So, the only option I see is via setting a URL for a key when it's removed (from annex POV, but not really since published). However, the utility and validity of this depends, I think. So, not sure how to proceed in that regard. With respect to "keeping track of the ID" - done and in that sense could be closed. It's already useful w/o providing easy annex-based access to old content. (Note, that the ID tracking is used in non-export mode as well)

WDYT? How does it work in the S3 case you mentioned?

mih · 2023-03-06T09:34:27Z

It is unclear to me why this is still open.

bpoldrack · 2023-03-06T09:53:00Z

It's still open, because in order to properly address what Yarik wanted, we not only need the ids, but support versioned export mode. Otherwise we can not get annex to ever ask an export remote for an older version of a file. In order to support versioned export mode, we need to implement support for importtree (the special remote protocol only switches to support versioned special remotes, when both exporttree and importtree are set to yes.

And for that, I wanted to enhance annexremote for import. Didn't manage to finish that yet.

mih · 2023-03-10T04:40:24Z

In #148 (comment) I recorded the fact that this development only superficially sounds like "one more thing here", but it actually requires implementing support for the importtree protocol first. Not here, but in annexremote. Given that the protocol specification itself in labeled as a partially implemented draft by git-annex itself, the scope of this issue is a bit larger than what the discussion here makes it sound.

mih · 2023-03-10T09:30:10Z

Given #148 is closed 'wont-fix', there is nothing left to do here.

mih mentioned this issue Jul 18, 2022

Two subsequent git-push to a dataverse dataset in filetree mode break it #143

Closed

bpoldrack mentioned this issue Jul 19, 2022

Fixing special remote #147

Closed

bpoldrack mentioned this issue Oct 11, 2022

Rewrite special remote #154

Merged

bpoldrack mentioned this issue Dec 1, 2022

Implement importtree? #148

Closed

bpoldrack self-assigned this Dec 1, 2022

mih mentioned this issue Mar 10, 2023

Split export from non-export remote functionality #177

Closed

bpoldrack mentioned this issue Mar 10, 2023

Key operations non-export mode #204

Closed

mih closed this as completed Mar 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

In exporttree mode store information about file id for each file #146

In exporttree mode store information about file id for each file #146

yarikoptic commented Jul 12, 2022 •

edited

Loading

welcome bot commented Jul 12, 2022

pdurbin commented Jul 13, 2022

yarikoptic commented Jul 13, 2022

pdurbin commented Jul 14, 2022

yarikoptic commented Jul 15, 2022

bpoldrack commented Jul 18, 2022

pdurbin commented Jul 18, 2022

landreev commented Jul 18, 2022

bpoldrack commented Jul 19, 2022

landreev commented Jul 19, 2022

bpoldrack commented Aug 14, 2022

mih commented Mar 6, 2023

bpoldrack commented Mar 6, 2023

mih commented Mar 10, 2023

mih commented Mar 10, 2023

In exporttree mode store information about file id for each file #146

In exporttree mode store information about file id for each file #146

Comments

yarikoptic commented Jul 12, 2022 • edited Loading

welcome bot commented Jul 12, 2022

pdurbin commented Jul 13, 2022

yarikoptic commented Jul 13, 2022

pdurbin commented Jul 14, 2022

yarikoptic commented Jul 15, 2022

bpoldrack commented Jul 18, 2022

pdurbin commented Jul 18, 2022

landreev commented Jul 18, 2022

bpoldrack commented Jul 19, 2022

landreev commented Jul 19, 2022

bpoldrack commented Aug 14, 2022

mih commented Mar 6, 2023

bpoldrack commented Mar 6, 2023

mih commented Mar 10, 2023

mih commented Mar 10, 2023

yarikoptic commented Jul 12, 2022 •

edited

Loading