Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In exporttree mode store information about file id for each file #146

Closed
yarikoptic opened this issue Jul 12, 2022 · 15 comments
Closed

In exporttree mode store information about file id for each file #146

yarikoptic opened this issue Jul 12, 2022 · 15 comments
Assignees

Comments

@yarikoptic
Copy link
Member

yarikoptic commented Jul 12, 2022

  • dataverse, as was described by @pdurbin supports identification of a path version, so each path/version can be uniquely identified. I.e. if I have path which was first published in one version of dataset on dataverse, and then it was modified, and published again, it should be possible to access both versions of that path through two different ids for it. It is analogous to having versionId on versioned S3 bucket.
e.g. demo for versionId on s3 in our test bucket
$> datalad ls -aL s3://datalad-test0-versioned/2versions-removed-recreated.txt
Bucket info:
  Versioning: {'Versioning': 'Enabled', 'MfaDelete': 'Disabled'}
     Website: datalad-test0-versioned.s3-website-us-east-1.amazonaws.com
         ACL: <Policy: [email protected] (owner) = FULL_CONTROL>
2versions-removed-recreated.txt            2015-11-07T05:23:37.000Z 8 ver:bBVSCB4MdBOeEXDQ2KwrjtrevpwFabaY  acl:<Policy: [email protected] (owner) = FULL_CONTROL>  http://datalad-test0-versioned.s3.amazonaws.com/2versions-removed-recreated.txt?versionId=bBVSCB4MdBOeEXDQ2KwrjtrevpwFabaY [OK]
2versions-removed-recreated.txt            2015-11-07T05:23:37.000Z DeleteMarker
2versions-removed-recreated.txt            2015-11-07T05:23:37.000Z 8 ver:zwW0b567gYO3puJeLMZsOmETqowJnv6l  acl:<Policy: [email protected] (owner) = FULL_CONTROL>  http://datalad-test0-versioned.s3.amazonaws.com/2versions-removed-recreated.txt?versionId=zwW0b567gYO3puJeLMZsOmETqowJnv6l [OK]
2versions-removed-recreated.txt_sameprefix 2015-11-07T05:23:37.000Z 8 ver:cfTdf3N8exZLFg.KcW5szQKrFNLUyCu1  acl:<Policy: [email protected] (owner) = FULL_CONTROL>  http://datalad-test0-versioned.s3.amazonaws.com/2versions-removed-recreated.txt_sameprefix?versionId=cfTdf3N8exZLFg.KcW5szQKrFNLUyCu1 [OK]

so you can see 2versions-removed-recreated.txt available with two different versionIds.

- http://datasets.datalad.org/labs/hasson/narratives/.git is an example of a datalad dataset exported to s3 bucket with versioning enabled, so you can see git-annex storing versionIds for the files
looking inside how git-annex stores it
$> git annex whereis sub-001/anat/sub-001_T1w.nii.gz
whereis sub-001/anat/sub-001_T1w.nii.gz (5 copies) 
  	25af5b16-0e8f-4536-a9d9-30dc416fd6fe -- yoh@falkor:/srv/datasets.datalad.org/www/labs/hasson/narratives [origin]
   	7c87d330-1f68-459a-b518-5a0d5fe5ce7b -- nastase@smaug:/mnt/datasets/incoming/nastase/narratives
   	a5565bb2-df69-4eb6-800e-125e7766a53f -- Narratives data collection on PNI server
   	a733154e-1fc3-4fa0-843f-bdf5bd7490d3 -- yoh@smaug:/mnt/datasets/datalad/crawl/labs/hasson/narratives
   	abe19045-3f76-4ec5-a673-1fd3785fa62f -- [fcp-indi]

  fcp-indi: https://s3.amazonaws.com/fcp-indi/data/Projects/narratives/sub-001/anat/sub-001_T1w.nii.gz?versionId=ZQstkVt.4Szohp204gTDhytGQ7kkqwA_
ok

$> ls -l sub-001/anat/sub-001_T1w.nii.gz
lrwxrwxrwx 1 yoh yoh 142 Mar 15  2021 sub-001/anat/sub-001_T1w.nii.gz -> ../../.git/annex/objects/7w/2z/MD5E-s13681254--03f09045f3343b776f7403a43e14341e.nii.gz/MD5E-s13681254--03f09045f3343b776f7403a43e14341e.nii.gz

so for key MD5E-s13681254--03f09045f3343b776f7403a43e14341e.nii.gz it is ZQstkVt.4Szohp204gTDhytGQ7kkqwA_ versionId. Let's look in git-annex branch for that key and how it is stored:

$> cat ./f1a/188/MD5E-s13681254--03f09045f3343b776f7403a43e14341e.nii.gz.log.rmet
1607610479.067007s abe19045-3f76-4ec5-a673-1fd3785fa62f:V +ZQstkVt.4Szohp204gTDhytGQ7kkqwA_#data/Projects/narratives/sub-001/anat/sub-001_T1w.nii.gz
  • dataset in dataverse need to be published before a new version of a file could be uploaded
  • git-annex docs at https://git-annex.branchable.com/git-annex-export/ say "some special remotes, notably S3, support keeping track of old versions of files stored in them. If a special remote is set up to do that, it can be used as a key/value store and the limitations in the above paragraph do not apply. ..."
  • https://git-annex.branchable.com/design/external_special_remote_protocol/export_and_import_appendix/ talks about "content identifiers" and it seems that support is as easy as reporting that (in dataverse could be file ID) back in STORE-SUCCESS Key ContentIdentifier for STOREEXPORTEXPECTED invocation.
  • and then retrieval (i.e. for git annex get) should also use such ContentIdentifier if known/provided, thus allowing to access any (possibly prior) version of any path exported to dataverse
@welcome
Copy link

welcome bot commented Jul 12, 2022

Hi! 👋 We are happy that you opened your first issue here! 😄 If you haven't done so already, please make sure you check out our Code of Conduct.

@pdurbin
Copy link

pdurbin commented Jul 13, 2022

Right, in Dataverse we call this "replacing" a file: https://guides.dataverse.org/en/5.11/user/dataset-management.html#replace-files and https://guides.dataverse.org/en/5.11/api/native-api.html#replacing-files

Before we had the concept of replacing a file, people would simple delete the old file, upload a new one, and publish. (And lots of people still do this.)

With the "replace file" functionality, we store the following:

  • previousDataFileId: the file id that was replaced most recently
  • rootDataFileId: the file id that was originally replaced.

For a little more context (from "export a dataset as native JSON" where you can see info about files):

    "files": [
      {
        "label": "file.txt",
        "restricted": false,
        "version": 1,
        "datasetVersionId": 3,
        "dataFile": {
          "id": 5,
          "persistentId": "",
          "pidURL": "",
          "filename": "file.txt",
          "contentType": "text/plain",
          "filesize": 4,
          "storageIdentifier": "file://181f94173f4-5f4de5a4fd27",
          "rootDataFileId": 4,
          "previousDataFileId": 4,
          "md5": "6ddb4095eb719e2a9f0a3f95677d24e0",
          "checksum": {
            "type": "MD5",
            "value": "6ddb4095eb719e2a9f0a3f95677d24e0"
          },
          "creationDate": "2022-07-13"
        }
      }
    ]

@yarikoptic
Copy link
Member Author

@pdurbin is there API to request a file simply based on its MD5 checksum?

@pdurbin
Copy link

pdurbin commented Jul 14, 2022

@yarikoptic no. File download requests are done by database id or DOI, if available. (DOIs for files are off by default).

My first thought is, what about empty files like __init.py__ that Python uses. All empty files will have the same MD5 but they would have differently filenames, etc. Maybe you're only interested in the content of the file and don't care about the filename.

@yarikoptic
Copy link
Member Author

Maybe you're only interested in the content of the file and don't care about the filename.

yes. Filename in git-annex repo contained within "git tree" which then symlinks the content. So if we could query content (could be delivered without any content-disposition filename, could be anything, e.g. checksum) that would be very cool! Does dataverse DB schema makes it feasible to add a request content based on md5?

@bpoldrack
Copy link
Member

@pdurbin, could you shine some light on this with respect to identifiers:

"files": [
      {
        "label": "file.txt",
        "restricted": false,
        "version": 1,
        "datasetVersionId": 3,
        "dataFile": {
          "id": 5,
          "persistentId": "",
          "pidURL": "",
          "filename": "file.txt",
          "contentType": "text/plain",
          "filesize": 4,
          "storageIdentifier": "file://181f94173f4-5f4de5a4fd27",
          "rootDataFileId": 4,
          "previousDataFileId": 4,
          "md5": "6ddb4095eb719e2a9f0a3f95677d24e0",
          "checksum": {
            "type": "MD5",
            "value": "6ddb4095eb719e2a9f0a3f95677d24e0"
          },
          "creationDate": "2022-07-13"
        }
      }
    ]

I suppose:

  1. id should always be there and probably is the database id, correct? Hence, it's not really guaranteed to be persistent, right?
  2. Availability of persistentId depends on whether the dataset was published with that version of the file and on the dataverse instance's proper setup to provide one, correct?
  3. storageIdentifier: I have no clue so far ;-) Any guarantees for that one?

@pdurbin
Copy link

pdurbin commented Jul 18, 2022

@yarikoptic one could certainly query the Dataverse database for files based on checksum. The table in question is "datafile": https://guides.dataverse.org/en/5.6/schemaspy/tables/datafile.html

A query would look something like this:

select * from datafile where checksumtype = 'MD5' and checksumvalue = '6ddb4095eb719e2a9f0a3f95677d24e0';

If you're really interested in this feature, please go ahead and create an issue at https://github.com/IQSS/dataverse/issues . I assume you'd want the "File Access" API to support checksums.

@bpoldrack yes id is the database id of the file (id in the datafile table mentioned above). I'd say it's fairly persistent. When do primary keys change? Hopefully not very often! 😄

Right, at the file level, persistentId is only populated if you have the feature turned on and if the file is published. For this reason, it's more reliable to identify files using their id (database id, as discussed above).

I wouldn't worry about storageIdentifier. 😄 file:// means the file is on the filesystem as opposed to S3 or Swift. You can read more about Dataverse storage options at https://guides.dataverse.org/en/5.11/installation/config.html#file-storage-using-a-local-filesystem-and-or-swift-and-or-object-stores

@landreev
Copy link

@bpoldrack Yes, storageIdentifier identifies the location of the physical file in storage. Generally these are for internal use only. Note that the storageIdentifier of a file is only guaranteed to be unique within the dataset. In other words there may be multiple datafiles across a Dataverse installation with the storage identifier file://xyz; but the combinations of the persistent id of the dataset + the storage identifier will all be unique.

The database ids are not truly persistent of course; but they are definitely unique, and for all practical purposes you can assume they will stay constant for the lifetime of a Dataverse installation.

@yarikoptic I'm still trying to understand your use case, specifically, why it is necessary to access the file by its checksum;

Filename in git-annex repo contained within "git tree" which then symlinks the content. So if we could query content (could be delivered without any content-disposition filename, could be anything, e.g. checksum) that would be very cool! Does dataverse DB schema makes it feasible to add a request content based on md5?

So is it just to be able to produce the urls for accessing the content of the files, without having to make any intermediate lookups via our API? (or ...?)

@bpoldrack
Copy link
Member

Thanks, @pdurbin and @landreev!

Re looking up files based on their MD5:

From my point of view this is semi-important for us. One the one hand, it would be kinda nice, since MD5 is datalad's default annex backend and to be able to lookup annex keys directly makes things smooth. However, users can decide to use some other hash instead and then this lookup doesn't help us much. In any case: per-file lookup likely wouldn't scale nicely. And as bulk info upfront we can already get it, it seems.

@landreev
Copy link

And as bulk info upfront we can already get it, it seems.

Correct, you get the assigned database ids of all the files as you deposit them. So no extra lookups are really needed.

@bpoldrack
Copy link
Member

I'm cleaning up PR #147 before it's ready. Meant to close this issue with it, but not sure about it anymore:

I implemented an approach where we do indeed use dataverse's database IDs and keep track of them via git-annex's SETSTATE special remote feature. However, that does not allow us to access "old" file content in export mode, @yarikoptic.

While the special remote implementation is kinda capable of that, git-annex isn't. Consider a repo, push its state and content to a dataverse special remote in export mode and "publish" that dataset (as in the dataverse feature, not the old datalad command). Now, locally remove a file, commit and export again. The result would a draft version on dataverse, that doesn't have the file anymore. However, the file is still available from the published version and we know its ID. But:

If I now go back one commit and try to get that "old" key, I found no way to make annex even try with an export remote.
get, fsck, checkpresent, etc. - with no option does it anything other than firing up the special remote and exiting again right after taking note, that it supports EXPORT. No PREPARE, no CHECKPRESENT, no nothing. Not even setpresent helps really. annex thinks it knows better. ;-)

So, the only option I see is via setting a URL for a key when it's removed (from annex POV, but not really since published). However, the utility and validity of this depends, I think. So, not sure how to proceed in that regard. With respect to "keeping track of the ID" - done and in that sense could be closed. It's already useful w/o providing easy annex-based access to old content. (Note, that the ID tracking is used in non-export mode as well)

WDYT? How does it work in the S3 case you mentioned?

@mih
Copy link
Member

mih commented Mar 6, 2023

It is unclear to me why this is still open.

@bpoldrack
Copy link
Member

It's still open, because in order to properly address what Yarik wanted, we not only need the ids, but support versioned export mode. Otherwise we can not get annex to ever ask an export remote for an older version of a file. In order to support versioned export mode, we need to implement support for importtree (the special remote protocol only switches to support versioned special remotes, when both exporttree and importtree are set to yes.

And for that, I wanted to enhance annexremote for import. Didn't manage to finish that yet.

@mih
Copy link
Member

mih commented Mar 10, 2023

In #148 (comment) I recorded the fact that this development only superficially sounds like "one more thing here", but it actually requires implementing support for the importtree protocol first. Not here, but in annexremote. Given that the protocol specification itself in labeled as a partially implemented draft by git-annex itself, the scope of this issue is a bit larger than what the discussion here makes it sound.

@mih
Copy link
Member

mih commented Mar 10, 2023

Given #148 is closed 'wont-fix', there is nothing left to do here.

@mih mih closed this as completed Mar 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants