-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Seriesgrouping #684
base: master
Are you sure you want to change the base?
Seriesgrouping #684
Conversation
You may want to consider #682 first (changes are included in this PR) |
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #684 +/- ##
==========================================
+ Coverage 81.94% 81.98% +0.04%
==========================================
Files 41 41
Lines 4132 4153 +21
==========================================
+ Hits 3386 3405 +19
- Misses 746 748 +2 ☔ View full report in Codecov by Sentry. |
@yarikoptic As discussed in #624 I implemented a hash to shorten the name, but the full Series UID could be used too |
In general I think the approach makes sense in general. Re why not before: I guess (@satra and @mih might correct if have any recoll of the motivations) is that series id and name (instead of UID) were chosen for human accessibility since they make it easier to understand which particular series in question while annotating TL;DR: we need to approach it with caution and thus need more checking/work. Some reservations/concerns and thinking out loud:
|
@yarikoptic thank you very much for the feedback Regarding scope of use case: i think it is actually common to have artificially split a single BIDS session into two when the scanning protocol is too long (e.g. AM: anat, resting state, ... PM: diffusion MRI, ...) or the subject needs to exit. It is actually mentioned explicitely in the BIDS definition of session (def number 5):
The fact that not many people reported it i think is because users did not know what went wrong and would not think this may be due to heudiconv (who would think of blaming such a magnificent tool ?) Regarding change of behavior i totally agree that this should be done carefully. I think what seems to be your preferred option would be best:
I will update the PR with that implemented. If the error is too much for users (for which the use case of needing to keep previous data without uid is justified) then maybe we could add later a temporary option |
@yarikoptic I modified the code to check that the conversion table does not use deprecated series IDs. Otherwise it recomputes the conversation table with the non-deprecated series IDs. I think in the end this may be less frustrating than an error since it identifies the deprecated IDs and recomputes the conversion table accordingly. I also set the hash length to 8 as you suggested. |
Hello @yarikoptic does it look like the code is ok now ? Cheers ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yarikoptic I modified the code to check that the conversion table does not use deprecated series IDs. Otherwise it recomputes the conversation table with the non-deprecated series IDs.
I worry that it is a bit too intrusive, in my comment above I thought it might be better to be more polite and not to just abolish the work people might have done already by just saying that "now we need another way". But Ok, I guess I can swallow this, we would just announce some "major"ish release.
But one major concern about the PR is absent "setting in stone" for such new behaviors in tests: not a single test was changed or added, thus suggesting actual problem of us not bothering to test that behavior before. Could you please add/change some test so it does get duplicated seqinfo (I guess we could just copy some subfolder with dicoms into a new one thus causing duplicates) and seeing it fail?
I also left some comments on how to possibly shorten that function to determine if outdated using re
-- can become just 1 liner. But we would need to have a test either way.
"""Return True if any series_id in the info_file is in deprecated format.""" | ||
info = read_config(info_file) | ||
series_ids = [series_id for sublist in info.values() for series_id in sublist] | ||
if any(is_deprecated_seriesid(series_id) for series_id in series_ids): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be shorter
if any(is_deprecated_seriesid(series_id) for series_id in series_ids): | |
if any(map(is_deprecated_seriesid, series_ids)): |
but it is ok either way.
"Found deprecated series identifier in info file in the format" | ||
"<series num>-<protocol name> instead of <series num>-<protocol name>-<UID hash>" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Found deprecated series identifier in info file in the format" | |
"<series num>-<protocol name> instead of <series num>-<protocol name>-<UID hash>" | |
"Found deprecated series identifier in info file in the format " | |
"<series num>-<protocol name> instead of <series num>-<protocol name>-<UID hash>. " |
if len(hash_hex) != len_hash_hex or not all(c in hex_digits for c in hash_hex): | ||
return True | ||
# If all previous tests passed, then the series_id is not deprecated | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if len(hash_hex) != len_hash_hex or not all(c in hex_digits for c in hash_hex): | |
return True | |
# If all previous tests passed, then the series_id is not deprecated | |
return False | |
return len(hash_hex) != len_hash_hex or not all(c in hex_digits for c in hash_hex) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or may be better to use re
(might need an import)
if len(hash_hex) != len_hash_hex or not all(c in hex_digits for c in hash_hex): | |
return True | |
# If all previous tests passed, then the series_id is not deprecated | |
return False | |
return re.match(f"[{string.hexdigits}]{{{len_hash_hex}}}$") |
this function needs a unittest which would be trivial to do
if series_id.count('-') <= 1: | ||
return True | ||
# Check the first part of the series_id is a number | ||
series_number = series_id.split('-')[0] | ||
if not series_number.isdigit(): | ||
return True | ||
# Check the last part of the series_id is a hash in hexadecimal format of length len_hash_hex | ||
hash_hex = series_id.split('-')[-1] | ||
hex_digits = set(string.hexdigits) | ||
if len(hash_hex) != len_hash_hex or not all(c in hex_digits for c in hash_hex): | ||
return True | ||
# If all previous tests passed, then the series_id is not deprecated | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall can be
if series_id.count('-') <= 1: | |
return True | |
# Check the first part of the series_id is a number | |
series_number = series_id.split('-')[0] | |
if not series_number.isdigit(): | |
return True | |
# Check the last part of the series_id is a hash in hexadecimal format of length len_hash_hex | |
hash_hex = series_id.split('-')[-1] | |
hex_digits = set(string.hexdigits) | |
if len(hash_hex) != len_hash_hex or not all(c in hex_digits for c in hash_hex): | |
return True | |
# If all previous tests passed, then the series_id is not deprecated | |
return False | |
return not re.match(f"[{string.digits}]+-.+-[{string.hexdigits}]{{{len_hash_hex}}}$") |
to just match regex... again -- just needs a unittest
series_number = series_id.split('-')[0] | ||
if not series_number.isdigit(): | ||
return True | ||
# Check the last part of the series_id is a hash in hexadecimal format of length len_hash_hex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Check the last part of the series_id is a hash in hexadecimal format of length len_hash_hex | |
# Check the last part of the series_id is a hash in hexadecimal format of length len_hash_hex | |
# Note that protocol_name could contain - |
is that correct and that is why [-1]
?
@neurorepro ping on above comments. |
Yes totally for the tests. I actually tested it on two datasets (already having old data, and newly converted) and it went fine. But indeed tests need to be included for continuous testing. I'll get back to you! |
a gentle ping |
This is to address #624