ENH: Split ICA into multiple steps #865

larsoner · 2024-02-28T18:02:07Z

Before merging …

Changelog has been updated (docs/source/changes.md)

Closes #864
Closes #861
Closes #857
Closes #804

Sample output

$ pytest mne_bids_pipeline/ -k ds000248_ica
...
┌────────┬ preprocessing/_06a1_fit_ica ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│12:49:48│ ⏳️ sub-01 run-01 Processing raw data from sub-01_task-audiovisual_run-01_proc-filt_split-01_raw.fif
│12:49:49│ ⏳️ sub-01 run-01 Applying high-pass filter with 1.0 Hz cutoff …
│12:49:50│ ⏳️ sub-01 run-01 Creating task-related epochs …
│12:49:51│ ⏳️ sub-01 Using PTP rejection thresholds: {'mag': 3e-12, 'grad': 3e-10}
│12:49:51│ ⏳️ sub-01 Saving ICA epochs to disk.
│12:49:51│ ⏳️ sub-01 Calculating ICA solution using method: extended_infomax.
│12:50:14│ ⏳️ sub-01 Fit 31 components (explaining 80.2% of the variance) in 109 iterations.
│12:50:14│ ⏳️ sub-01 Saving ICA solution to disk.
│12:50:14│ ⏳️ sub-01 Initializing ICA.fit report HDF5 file
│12:50:29│ ⏳️ sub-01 Saving ICA.fit report: /home/larsoner/mne_data/derivatives/mne-bids-pipeline/ds000248_ica/sub-01/meg/sub-01_task-audiovisual_proc-icafit_report.html
└────────┴ done (42s)
┌────────┬ preprocessing/_06a2_find_ica_artifacts ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│12:50:29│ ⏳️ sub-01 Loading ICA solution
│12:50:30│ ⏳️ sub-01 Creating ECG epochs …
│12:50:32│ ⏳️ sub-01 Performing automated ECG artifact detection …
│12:50:33│ ⏳️ sub-01 Detected 5 ECG-related ICs in 166 ECG epochs.
│12:50:33│ ⏳️ sub-01 Creating EOG epochs …
│12:50:33│ ⏳️ sub-01 Performing automated EOG artifact detection …
│12:50:33│ ⏳️ sub-01 Detected 1 EOG-related ICs in 10 EOG epochs.
│12:50:33│ ⏳️ sub-01 Saving ICA solution and detected artifacts to disk.
│12:51:19│ ⏳️ sub-01 Saving ICA report: /home/larsoner/mne_data/derivatives/mne-bids-pipeline/ds000248_ica/sub-01/meg/sub-01_task-audiovisual_proc-ica+components_report.html
│12:51:19│ ⏳️ sub-01 ICA completed. Please carefully review the extracted ICs in the report sub-01_task-audiovisual_proc-ica+components_report.h5, and mark all components you wish to reject as 'bad' in sub-01_task-audiovisual_proc-ica_components.tsv
└────────┴ done (50s)
...
┌────────┬ preprocessing/_08a_apply_ica ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│12:51:31│ ⏳️ sub-01 Input: sub-01_task-audiovisual_split-01_epo.fif
│12:51:31│ ⏳️ sub-01 Output: sub-01_task-audiovisual_proc-ica_epo.fif
│12:51:32│ ⏳️ sub-01 Rejecting ICs: 0, 1, 4, 9, 14, 20
│12:51:33│ ⏳️ sub-01 Saving reconstructed epochs after ICA.
│12:51:45│ ⏳️ sub-01 Adding ICA to report.
│12:51:57│ ⏳️ sub-01 Saving ICA.apply report: /home/larsoner/mne_data/derivatives/mne-bids-pipeline/ds000248_ica/sub-01/meg/sub-01_task-audiovisual_report.html
│12:51:57│ ⏳️ sub-01 run-01 Writing sub-01_task-audiovisual_run-01_proc-clean_raw.fif …
│12:51:59│ ⏳️ sub-01 run-01 Adding cleaned raw data to report
│12:52:05│ ⏳️ sub-01 run-01 Saving report: /home/larsoner/mne_data/derivatives/mne-bids-pipeline/ds000248_ica/sub-01/meg/sub-01_task-audiovisual_report.html
└────────┴ done (35s)

Splits ICA into: 1) fitting + epoch creation, 2) detection of ecg/eog, 3) applying. To do this, two ICAs (and two reports) now get saved: one for the original fit (proc="icafit"), and a second one for after the automated detection.
Updates our docs with some stuff about caching (figured I'd take a stab at it while updating docs for ICA)
Pretty sure this closes Logger reports subject, session, run even if not requested #804 by cleaning up run/session etc. a bit but if you think it's related to stuff other than ICA feel free to remove that from the closes lines above @hoechenberger

Fixes a bug where we put _components.tsv in out_files dict. Our caching code looks to make sure that not just the input files but also the output files have the expected hashes. Since it's expected that users will modify this file, it should not be in out_files in the _06 fit/detect step(s), otherwise it will be a cache miss when users modify the file, and the step will re-run and overwrite their changes (!), e.g. on main:

┌────────┬ preprocessing/_06a2_find_ica_artifacts 
│12:52:50│ 🚫 sub-01 Output file hash mismatch for /home/larsoner/mne_data/derivatives/mne-bids-pipeline/ds000248_ica/sub-01/meg/sub-01_task-audiovisual_proc-ica_components.tsv, will recompute …
│12:52:50│ ⏳️ sub-01 Loading ICA solution
│12:52:51│ ⏳️ sub-01 Creating ECG epochs …

on this PR:

┌────────┬ preprocessing/_06a2_find_ica_artifacts 
│12:59:34│ ✅ sub-01 Computation unnecessary (cached) …
└────────┴ done (1s)
...
┌────────┬ preprocessing/_08a_apply_ica   │12:59:34│ ⏳️ sub-01 Input: sub-01_task-audiovisual_split-01_epo.fif
│12:59:34│ ⏳️ sub-01 Output: sub-01_task-audiovisual_proc-ica_epo.fif
│12:59:35│ ⏳️ sub-01 Rejecting ICs: 0, 1, 2, 4, 9, 14, 20
│12:59:37│ ⏳️ sub-01 Saving reconstructed epochs after ICA.
...

docs/source/v1.6.md.inc

hoechenberger · 2024-02-29T10:25:34Z

mne_bids_pipeline/_run.py

+    # We need to make the logs more compact to be able to write Excel format
+    # (32767 char limit)


I don't understand this, how does compressing the logs help with writing the (then-uncompressed) data to Excel?

Xlsx has a 32767 chat limit for each cell. The cell data for entries in what was formerly the cfg column in the spreadsheet gets compressed both in json characters (remove indentation and unnecessary whitespace) and then with zlib. This compressed data is now written in a cfg_zlib column instead

does that mean that it's not human-readable anymore?

mne_bids_pipeline/steps/preprocessing/_06a1_fit_ica.py

Co-authored-by: Richard Höchenberger <[email protected]>

larsoner · 2024-02-29T13:54:14Z

mne_bids_pipeline/_run.py

+    # We need to make the logs more compact to be able to write Excel format
+    # (32767 char limit per cell), in particular the "cfg" column has very large
+    # cells, so replace the "cfg" column with a "cfg_zip" column (parentheticals
+    # contain real-world example numbers used to illustrate the compression):


@hoechenberger better?

it helps my understanding, thanks! but does that mean that this column isn't human-readable anymore?

Yep, no longer human readable. On main part of it is human readable, but the vast majority of it is just completely missing!

... at least that's the case for the ERP_CORE dataset I was looking at, which is probably a worst case. There are probably some steps for which compression is not necessary. But it's hard to know which will be which.

I think a big part of it might have been the support for a custom Montage -- the json_tricks dump of that was super long.

Could we, perhaps, create a new sheet and write one line per row (easy) or one config setting per row (more difficult), instead? this would surely keep us below the character limit per cell and retain readability as we wouldn't need to use compression

So it looks like right now in main:

Each row in the spreadsheet is a log from a (usually parallel) function call

Each col is an entry like exec_params, subject, session, etc.

Each sheet is a pipeline step

So for example (with my changes here)

So I'm not quite sure how to map the config -- which could change from step to step and also potentially call to call -- to new sheet(s).

Maybe instead it could be N new cfg.<attr> columns for the N config attributes for the given step? Presumably these would nearly (but not quite necessarily I think!) match across rows (runs/subjects/whatever), and they would differ quite a bit across sheets (pipeline steps). This would require un-serializing the JSON and getting a list of attributes and mapping these into the Series but that's not too bad I think. It makes the SimpleNamespace object much harder to un-serialize but the params (much) more human readable I think.

This was actually fairly easy to implement so I tried it, this is the result:

I'll push a commit in case it's a reasonable way to go.

Thanks for looking into this

I'm questioning that we need this level of "differentiability" across participants, sessions, tasks, and steps.

I claim that currently, the only really reliable and "sound" way to use the pipeline is by processing all data with the same configuration. If different settings are supposed to be used for different sessions or tasks, a new, separate derivatives folder should be created.

If we follow this premise, we could create a single Excel sheet where one row corresponds to one line or one setting in the config file that was used

WDYT?

One thing to keep in mind is that the whole config isn't currently available at this level. This isn't everything from config.py -- it's only the stuff used by the current step. And it can be stuff like which runs are being used (which could in principle vary by participant someday). So it's not about using different config.py files that this differs, but rather given a static config.py the set of params (and even potentially stuff like which runs are used for a given subject someday maybe?) could differ.

So yes we could in principle do what you want but it's different information from what's in there now.

Got you. In that case, let's maybe keep it the way it is right now

larsoner and others added 4 commits February 28, 2024 13:01

ENH: Split ICA into multiple steps

a121c07

FIX: Name

0bd7a83

FIX: Compact logs

ec8a6ee

Merge branch 'main' into ica

6158458

hoechenberger reviewed Feb 29, 2024

View reviewed changes

docs/source/v1.6.md.inc Outdated Show resolved Hide resolved

hoechenberger reviewed Feb 29, 2024

View reviewed changes

mne_bids_pipeline/steps/preprocessing/_06a1_fit_ica.py Outdated Show resolved Hide resolved

hoechenberger reviewed Feb 29, 2024

View reviewed changes

mne_bids_pipeline/steps/preprocessing/_06a1_fit_ica.py Outdated Show resolved Hide resolved

larsoner and others added 2 commits February 29, 2024 08:51

Apply suggestions from code review

c722f5f

Co-authored-by: Richard Höchenberger <[email protected]>

DOC: Comment

2ef59e7

larsoner commented Feb 29, 2024

View reviewed changes

larsoner mentioned this pull request Feb 29, 2024

DOC: Mark steps as optional in flowchart #861

Closed

larsoner added 3 commits February 29, 2024 09:04

FIX: Skipped

6f361b6

FIX: Readability

a60036a

FIX: Cant add

81a30ad

larsoner added this to the 1.6 milestone Feb 29, 2024

larsoner added 2 commits February 29, 2024 19:58

FIX: Lenient

2bfb6cd

FIX: Meant to increase not decrease

04e1860

hoechenberger approved these changes Mar 1, 2024

View reviewed changes

larsoner mentioned this pull request Mar 1, 2024

ENH: Simplify config in logs #868

Closed

larsoner merged commit 94256f7 into mne-tools:main Mar 1, 2024
52 checks passed

larsoner deleted the ica branch March 1, 2024 13:24

behinger mentioned this pull request May 21, 2024

eyelink sync s-ccs/mne-bids-pipeline#2

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Split ICA into multiple steps #865

ENH: Split ICA into multiple steps #865

larsoner commented Feb 28, 2024 •

edited

Loading

hoechenberger Feb 29, 2024

larsoner Feb 29, 2024

hoechenberger Feb 29, 2024

larsoner Feb 29, 2024

hoechenberger Feb 29, 2024

larsoner Feb 29, 2024

larsoner Feb 29, 2024

hoechenberger Feb 29, 2024 •

edited

Loading

larsoner Feb 29, 2024

larsoner Feb 29, 2024

hoechenberger Mar 1, 2024 •

edited

Loading

larsoner Mar 1, 2024 •

edited

Loading

hoechenberger Mar 1, 2024

		# We need to make the logs more compact to be able to write Excel format
		# (32767 char limit)

ENH: Split ICA into multiple steps #865

ENH: Split ICA into multiple steps #865

Conversation

larsoner commented Feb 28, 2024 • edited Loading

Before merging …

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hoechenberger Feb 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hoechenberger Mar 1, 2024 • edited Loading

Choose a reason for hiding this comment

larsoner Mar 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

larsoner commented Feb 28, 2024 •

edited

Loading

hoechenberger Feb 29, 2024 •

edited

Loading

hoechenberger Mar 1, 2024 •

edited

Loading

larsoner Mar 1, 2024 •

edited

Loading