Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

datalad sensitive marking fixes #739

Merged
merged 4 commits into from
Feb 24, 2024

Conversation

bpinsard
Copy link
Contributor

Mark sensitive files added in the commit not all globbed.

Currently all globbed files in the current tree get marked as sensitive which could overwrite previously removed metas (eg after defacing), it now filter file in the commit where converted files are added.

Add metadata rather than init to fix reruns issue.

If one runs heudiconv on a branch, remove the metadata (eg after defacing) then remove that branch and then rerun the same conversion on the same dataset, it seems like the new files with same names inherit the absence of metadata from the previous files (from git-annex branch). This is unexpected as the git-annex hash should differ so .met file should not be picked-up, but there is some git-annex black magic going on here. The issue is that sensitive files do not get marked as non-sensitive.

@bpinsard bpinsard force-pushed the fix/git-annex-sensitive branch from bd91bbb to 6ba6cba Compare February 20, 2024 19:02
Copy link

codecov bot commented Feb 20, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (d3385d7) 81.97% compared to head (78db99e) 82.01%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #739      +/-   ##
==========================================
+ Coverage   81.97%   82.01%   +0.04%     
==========================================
  Files          41       41              
  Lines        4149     4159      +10     
==========================================
+ Hits         3401     3411      +10     
  Misses        748      748              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea, thank you!

But I think we can avoid referring to last_commit since we are not even 100% certain it has happened just now, and we can explicitly have the list of just committed files ;-)

Above in a call to ds.save - get the results and go through them -- they should include all files which were add'/committ'ed . Eg.

❯ rm existed && echo "replaced" >| existed; touch newfile; python -c "import datalad.api as dl; print(dl.save('.'))"
<frozen importlib._bootstrap>:1047: ImportWarning: _SixMetaPathImporter.find_spec() not found; falling back to find_module()
add(ok): newfile (file)                                                                                                                                                              
add(ok): existed (file)                                                                                                                                                              
save(ok): . (dataset)                                                                                                                                                                
action summary:                                                                                                                                                                      
  add (ok: 2)
  save (ok: 1)
[{'action': 'add', 'path': '/tmp/testds/newfile', 'type': 'file', 'refds': '/tmp/testds', 'status': 'ok', 'message': '', 'key': 'MD5E-s0--d41d8cd98f00b204e9800998ecf8427e'}, {'action': 'add', 'path': '/tmp/testds/existed', 'type': 'file', 'refds': '/tmp/testds', 'status': 'ok', 'message': '', 'key': 'MD5E-s9--d908d26cac8092d475f40a5179ca6347'}, {'action': 'save', 'type': 'dataset', 'path': '/tmp/testds', 'refds': '/tmp/testds', 'status': 'ok'}]

so go through those with action: 'add' and it should give you all the paths which were just saved if any

@bpinsard bpinsard force-pushed the fix/git-annex-sensitive branch from f9c37d1 to 2844d54 Compare February 21, 2024 15:19
@bpinsard
Copy link
Contributor Author

I don't think it should be limited to the one added, eg. if heudiconv is run with overwrite (and overwritten annexed files are unlocked) we want to (re) mark the updated annexed files as sensitive in case the metadata was removed on the overwritten files.

if not paths:
return
lgr.debug("Marking %d files with distribution-restrictions field", len(paths))
# set_metadata can be a bloody generator
res = ds.repo.set_metadata(
paths, init=dict([("distribution-restrictions", "sensitive")]), recursive=True
paths, add=dict([("distribution-restrictions", "sensitive")]), recursive=True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here I worried in case of --overwrite we might keep adding multiple distribution-restrictions=sensitive... but it seems that same value is not getting duplicated, only a new one

❯ git annex metadata --set distribution-restrictions+=sensitive sub-YutaMouse20_ses-YutaMouse20-140321_behavior+ecephys.nwb
metadata sub-YutaMouse20_ses-YutaMouse20-140321_behavior+ecephys.nwb 
  distribution-restrictions=sensitive
  distribution-restrictions-lastchanged=2024-02-24@00-12-47
  lastchanged=2024-02-24@00-12-47
ok
(recording state in git...)
❯ git annex metadata --set distribution-restrictions+=sensitive sub-YutaMouse20_ses-YutaMouse20-140321_behavior+ecephys.nwb
metadata sub-YutaMouse20_ses-YutaMouse20-140321_behavior+ecephys.nwb 
  distribution-restrictions=sensitive
  distribution-restrictions-lastchanged=2024-02-24@00-12-49
  lastchanged=2024-02-24@00-12-49
ok
(recording state in git...)
❯ git annex metadata  sub-YutaMouse20_ses-YutaMouse20-140321_behavior+ecephys.nwb
metadata sub-YutaMouse20_ses-YutaMouse20-140321_behavior+ecephys.nwb 
  distribution-restrictions=sensitive
  distribution-restrictions-lastchanged=2024-02-24@00-12-49
  lastchanged=2024-02-24@00-12-49

only the time stamp would be changed ... since we are filtering on only saved files -- I think that should be good

@yarikoptic
Copy link
Member

I don't think it should be limited to the one added,

here "add" is more like git add -- something was committed since was modified so it should have been fine too

❯ echo 123 >> 123; datalad save 123
add(ok): sub-YutaMouse20/123 (file)                                                                 
save(ok): . (dataset)                                                                               
action summary:                                                                                     
  add (ok: 1)
  save (ok: 1)
❯ echo 123 >> 123; datalad save 123
add(ok): sub-YutaMouse20/123 (file)                                                                 
save(ok): . (dataset)   

well, let's hope that with filtering on globs nothing unexpected would happen -- let's try it out!

@yarikoptic yarikoptic merged commit bf79098 into nipy:master Feb 24, 2024
10 checks passed
@yarikoptic yarikoptic added the bug label Feb 24, 2024
Copy link

🚀 PR was released in v1.0.2 🚀

bpinsard added a commit to UNFmontreal/heudiconv that referenced this pull request Oct 21, 2024
yarikoptic added a commit that referenced this pull request Oct 25, 2024
Fix assignment of sensitive git-annex metadata data via glob patterns (regression introduced by #739)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants