Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Split ICA into multiple steps #865

Merged
merged 11 commits into from
Mar 1, 2024
Merged

ENH: Split ICA into multiple steps #865

merged 11 commits into from
Mar 1, 2024

Conversation

larsoner
Copy link
Member

@larsoner larsoner commented Feb 28, 2024

Before merging …

  • Changelog has been updated (docs/source/changes.md)

Closes #864
Closes #861
Closes #857
Closes #804

Sample output
$ pytest mne_bids_pipeline/ -k ds000248_ica
...
┌────────┬ preprocessing/_06a1_fit_ica ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│12:49:48│ ⏳️ sub-01 run-01 Processing raw data from sub-01_task-audiovisual_run-01_proc-filt_split-01_raw.fif
│12:49:49│ ⏳️ sub-01 run-01 Applying high-pass filter with 1.0 Hz cutoff …
│12:49:50│ ⏳️ sub-01 run-01 Creating task-related epochs …
│12:49:51│ ⏳️ sub-01 Using PTP rejection thresholds: {'mag': 3e-12, 'grad': 3e-10}
│12:49:51│ ⏳️ sub-01 Saving ICA epochs to disk.
│12:49:51│ ⏳️ sub-01 Calculating ICA solution using method: extended_infomax.
│12:50:14│ ⏳️ sub-01 Fit 31 components (explaining 80.2% of the variance) in 109 iterations.
│12:50:14│ ⏳️ sub-01 Saving ICA solution to disk.
│12:50:14│ ⏳️ sub-01 Initializing ICA.fit report HDF5 file
│12:50:29│ ⏳️ sub-01 Saving ICA.fit report: /home/larsoner/mne_data/derivatives/mne-bids-pipeline/ds000248_ica/sub-01/meg/sub-01_task-audiovisual_proc-icafit_report.html
└────────┴ done (42s)
┌────────┬ preprocessing/_06a2_find_ica_artifacts ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│12:50:29│ ⏳️ sub-01 Loading ICA solution
│12:50:30│ ⏳️ sub-01 Creating ECG epochs …
│12:50:32│ ⏳️ sub-01 Performing automated ECG artifact detection …
│12:50:33│ ⏳️ sub-01 Detected 5 ECG-related ICs in 166 ECG epochs.
│12:50:33│ ⏳️ sub-01 Creating EOG epochs …
│12:50:33│ ⏳️ sub-01 Performing automated EOG artifact detection …
│12:50:33│ ⏳️ sub-01 Detected 1 EOG-related ICs in 10 EOG epochs.
│12:50:33│ ⏳️ sub-01 Saving ICA solution and detected artifacts to disk.
│12:51:19│ ⏳️ sub-01 Saving ICA report: /home/larsoner/mne_data/derivatives/mne-bids-pipeline/ds000248_ica/sub-01/meg/sub-01_task-audiovisual_proc-ica+components_report.html
│12:51:19│ ⏳️ sub-01 ICA completed. Please carefully review the extracted ICs in the report sub-01_task-audiovisual_proc-ica+components_report.h5, and mark all components you wish to reject as 'bad' in sub-01_task-audiovisual_proc-ica_components.tsv
└────────┴ done (50s)
...
┌────────┬ preprocessing/_08a_apply_ica ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│12:51:31│ ⏳️ sub-01 Input: sub-01_task-audiovisual_split-01_epo.fif
│12:51:31│ ⏳️ sub-01 Output: sub-01_task-audiovisual_proc-ica_epo.fif
│12:51:32│ ⏳️ sub-01 Rejecting ICs: 0, 1, 4, 9, 14, 20
│12:51:33│ ⏳️ sub-01 Saving reconstructed epochs after ICA.
│12:51:45│ ⏳️ sub-01 Adding ICA to report.
│12:51:57│ ⏳️ sub-01 Saving ICA.apply report: /home/larsoner/mne_data/derivatives/mne-bids-pipeline/ds000248_ica/sub-01/meg/sub-01_task-audiovisual_report.html
│12:51:57│ ⏳️ sub-01 run-01 Writing sub-01_task-audiovisual_run-01_proc-clean_raw.fif …
│12:51:59│ ⏳️ sub-01 run-01 Adding cleaned raw data to report
│12:52:05│ ⏳️ sub-01 run-01 Saving report: /home/larsoner/mne_data/derivatives/mne-bids-pipeline/ds000248_ica/sub-01/meg/sub-01_task-audiovisual_report.html
└────────┴ done (35s)
  • Splits ICA into: 1) fitting + epoch creation, 2) detection of ecg/eog, 3) applying. To do this, two ICAs (and two reports) now get saved: one for the original fit (proc="icafit"), and a second one for after the automated detection.

  • Updates our docs with some stuff about caching (figured I'd take a stab at it while updating docs for ICA)

  • Pretty sure this closes Logger reports subject, session, run even if not requested #804 by cleaning up run/session etc. a bit but if you think it's related to stuff other than ICA feel free to remove that from the closes lines above @hoechenberger

  • Fixes a bug where we put _components.tsv in out_files dict. Our caching code looks to make sure that not just the input files but also the output files have the expected hashes. Since it's expected that users will modify this file, it should not be in out_files in the _06 fit/detect step(s), otherwise it will be a cache miss when users modify the file, and the step will re-run and overwrite their changes (!), e.g. on main:

    ┌────────┬ preprocessing/_06a2_find_ica_artifacts 
    │12:52:50│ 🚫 sub-01 Output file hash mismatch for /home/larsoner/mne_data/derivatives/mne-bids-pipeline/ds000248_ica/sub-01/meg/sub-01_task-audiovisual_proc-ica_components.tsv, will recompute …
    │12:52:50│ ⏳️ sub-01 Loading ICA solution
    │12:52:51│ ⏳️ sub-01 Creating ECG epochs …
    

    on this PR:

    ┌────────┬ preprocessing/_06a2_find_ica_artifacts 
    │12:59:34│ ✅ sub-01 Computation unnecessary (cached) …
    └────────┴ done (1s)
    ...
    ┌────────┬ preprocessing/_08a_apply_ica   │12:59:34│ ⏳️ sub-01 Input: sub-01_task-audiovisual_split-01_epo.fif
    │12:59:34│ ⏳️ sub-01 Output: sub-01_task-audiovisual_proc-ica_epo.fif
    │12:59:35│ ⏳️ sub-01 Rejecting ICs: 0, 1, 2, 4, 9, 14, 20
    │12:59:37│ ⏳️ sub-01 Saving reconstructed epochs after ICA.
    ...
    

Comment on lines 289 to 290
# We need to make the logs more compact to be able to write Excel format
# (32767 char limit)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this, how does compressing the logs help with writing the (then-uncompressed) data to Excel?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Xlsx has a 32767 chat limit for each cell. The cell data for entries in what was formerly the cfg column in the spreadsheet gets compressed both in json characters (remove indentation and unnecessary whitespace) and then with zlib. This compressed data is now written in a cfg_zlib column instead

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does that mean that it's not human-readable anymore?

Comment on lines 289 to 292
# We need to make the logs more compact to be able to write Excel format
# (32767 char limit per cell), in particular the "cfg" column has very large
# cells, so replace the "cfg" column with a "cfg_zip" column (parentheticals
# contain real-world example numbers used to illustrate the compression):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hoechenberger better?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it helps my understanding, thanks! but does that mean that this column isn't human-readable anymore?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, no longer human readable. On main part of it is human readable, but the vast majority of it is just completely missing!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... at least that's the case for the ERP_CORE dataset I was looking at, which is probably a worst case. There are probably some steps for which compression is not necessary. But it's hard to know which will be which.

I think a big part of it might have been the support for a custom Montage -- the json_tricks dump of that was super long.

Copy link
Member

@hoechenberger hoechenberger Feb 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we, perhaps, create a new sheet and write one line per row (easy) or one config setting per row (more difficult), instead? this would surely keep us below the character limit per cell and retain readability as we wouldn't need to use compression

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it looks like right now in main:

  • Each row in the spreadsheet is a log from a (usually parallel) function call
  • Each col is an entry like exec_params, subject, session, etc.
  • Each sheet is a pipeline step

So for example (with my changes here)

Screenshot from 2024-02-29 15-29-50

So I'm not quite sure how to map the config -- which could change from step to step and also potentially call to call -- to new sheet(s).

Maybe instead it could be N new cfg.<attr> columns for the N config attributes for the given step? Presumably these would nearly (but not quite necessarily I think!) match across rows (runs/subjects/whatever), and they would differ quite a bit across sheets (pipeline steps). This would require un-serializing the JSON and getting a list of attributes and mapping these into the Series but that's not too bad I think. It makes the SimpleNamespace object much harder to un-serialize but the params (much) more human readable I think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was actually fairly easy to implement so I tried it, this is the result:

Screenshot from 2024-02-29 16-09-09

I'll push a commit in case it's a reasonable way to go.

Copy link
Member

@hoechenberger hoechenberger Mar 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this

I'm questioning that we need this level of "differentiability" across participants, sessions, tasks, and steps.

I claim that currently, the only really reliable and "sound" way to use the pipeline is by processing all data with the same configuration. If different settings are supposed to be used for different sessions or tasks, a new, separate derivatives folder should be created.

If we follow this premise, we could create a single Excel sheet where one row corresponds to one line or one setting in the config file that was used

WDYT?

Copy link
Member Author

@larsoner larsoner Mar 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to keep in mind is that the whole config isn't currently available at this level. This isn't everything from config.py -- it's only the stuff used by the current step. And it can be stuff like which runs are being used (which could in principle vary by participant someday). So it's not about using different config.py files that this differs, but rather given a static config.py the set of params (and even potentially stuff like which runs are used for a given subject someday maybe?) could differ.

So yes we could in principle do what you want but it's different information from what's in there now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got you. In that case, let's maybe keep it the way it is right now

@larsoner larsoner added this to the 1.6 milestone Feb 29, 2024
@larsoner larsoner merged commit 94256f7 into mne-tools:main Mar 1, 2024
52 checks passed
@larsoner larsoner deleted the ica branch March 1, 2024 13:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants