Fix train metrics #868

VukW · 2024-05-15T00:42:59Z

Fixes: N.A.

Proposed Changes

The PR addresses a few important bugs in how metrics are calculated.
Because of how the batches are structured, during multi-batch training, only the first batch was used to calculate metrics.
The current codebase is fine with batch size of 1.
This has no effect on validation/testing/inference runs.

Checklist

1. return output of whole batch, not just one item 2. make ground truth & predictions array to take into account `q_samples_per_volume` (the whole dataset size during 1 epoch is equal to len(data) * q_samples_per_volume; so if dataset df contains 100 records and q_samples_per_volume = 10 (by default) and batch size is 4, there would be 250 batches by 4 elements 3. make ground truth take into account that train_dataloader is shuffled. So now ground truth is sorted in the same order as predictions and as train_dataloader.

To ensure values in csv are always written in the same order as header

github-actions · 2024-05-15T00:43:13Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

sarthakpati · 2024-05-19T03:12:52Z

Converting to draft until the tests are passing.

VukW · 2024-05-21T22:06:58Z

@sarthakpati It seems to me I fixed the code for usual segmentation cases.
However, I found that the code is essentially broken for some specific architectures - deep_* and sdnet, where model returns list instead of just one Tensor. The reason is we strongly assume output is Tensor (say, when we aggregate segmentation results during validation step).
In master branch the code is not failing but calculates valid metrics in wrong way:

we take just first elem of the list (one tensor)
AFAIU for sdnet it works only if batch_size >= 5 (as prediction here is BxCxHxWxD, we take not 1st piece of output but 1st batch element prediction) and still not properly
for deep_* models here also prediction is BxCxHxWxD and not a list of the tensors

I can't imagine right now how to fix that easily without massive refactoring. I returned the crutch back (so at least code doesn't fail right now). From one side a train metrics are calculated like averaging by sample and thus are calculated properly. From other side, validation metrics are broken. I'd strongly prefer to disable / remove these list architectures from GaNDLF for now; but what do you think?

…rics

was turned off as workaround at mlcommons#870

fixes test_train_inference_classification_histology_large_2d (35)

sarthakpati · 2024-06-04T15:12:51Z

@szmazurek - can you confirm if the BraTS training is working for you?

szmazurek · 2024-06-04T17:03:40Z

@szmazurek - can you confirm if the BraTS training is working for you?

Re-launched the training after pulling yesterday's merge by @Geeks-Sid. Will keep you updated if it runs, keeping the rest the same.

szmazurek · 2024-06-04T18:40:11Z

@szmazurek - can you confirm if the BraTS training is working for you?

Re-launched the training after pulling yesterday's merge by @Geeks-Sid. Will keep you updated if it runs, keeping the rest the same.

Still negative. the output:

Looping over training data:   0%|          | 0/6255 [02:51<?, ?it/s]
ERROR: Traceback (most recent call last):
  File "/net/tscratch/people/plgmazurekagh/gandlf/gandlf_env/bin/gandlf_run", line 126, in <module>
    main_run(
  File "/net/tscratch/people/plgmazurekagh/gandlf/gandlf_env/lib/python3.10/site-packages/GANDLF/cli/main_run.py", line 92, in main_run
    TrainingManager_split(
  File "/net/tscratch/people/plgmazurekagh/gandlf/gandlf_env/lib/python3.10/site-packages/GANDLF/training_manager.py", line 173, in TrainingManager_split
    training_loop(
  File "/net/tscratch/people/plgmazurekagh/gandlf/gandlf_env/lib/python3.10/site-packages/GANDLF/compute/training_loop.py", line 445, in training_loop
    epoch_train_loss, epoch_train_metric = train_network(
  File "/net/tscratch/people/plgmazurekagh/gandlf/gandlf_env/lib/python3.10/site-packages/GANDLF/compute/training_loop.py", line 171, in train_network
    total_epoch_train_metric[metric] += metric_val
ValueError: operands could not be broadcast together with shapes (4,) (3,) (4,)

I am using flexinet cosine annealing config as provided on brats 2021.

@sarthakpati @VukW @Geeks-Sid any ideas? Did you maybe succeed?

GANDLF/compute/forward_pass.py

VukW · 2024-06-05T14:17:25Z

@szmazurek Can you plz show the exact config you're using? I'm not familiar with brats challenge:)

sarthakpati · 2024-06-05T14:33:30Z

@szmazurek - I would assume I will also get the same error. I am currently running other jobs so I don't have any free slots to queue up any other training.

szmazurek · 2024-06-06T07:06:03Z

Hey dears, so, apparently commenting one parameter in the config made it work! The problem was metric option dice_per_label, having that commented did not result in any error. I will tackle that further, I also sent you the example config via gmail @VukW .

sarthakpati · 2024-06-06T13:04:10Z

Hey dears, so, apparently commenting one parameter in the config made it work! The problem was metric option dice_per_label, having that commented did not result in any error. I will tackle that further, I also sent you the example config via gmail @VukW .

This will need more investigation - can you please open a new issue to track it?

(for one of classes per-label metrics are not counted thus metric shape may differ)

VukW · 2024-06-06T14:18:19Z

@sarthakpati @szmazurek I catch the bug with @szmazurek 's model config. The issue is when ignore_label_validation is given in the model config, metrics for this specific label is not evaluated, thus metrics output differ from what I assumed (N_CLASSES). Fix: d0d25fb
Now it works for me

sarthakpati · 2024-06-06T19:37:14Z

@szmazurek can you confirm the fix on your end?

szmazurek · 2024-06-06T20:06:30Z

@sarthakpati On it, training scheduled. Thanks @VukW for tackling that!

szmazurek · 2024-06-07T13:45:23Z

@sarthakpati On it, training scheduled. Thanks @VukW for tackling that!

Hey, my initial tests failed, it turned out that the error spotted by @VukW was present also in the validation and test loops. Corrected it and successfully completed the entire training epoch, the changes are applied in commit 5148e86.

EDIT: I also now initialized the training with confing in the exact same way as you sent me @sarthakpati, will keep you noticed on the results.

sarthakpati · 2024-07-16T01:26:34Z

@VukW - I think the CLA bot it complaining because of 5148e86 ... Can you please remove this?

- the same metric error was occuring in the loops in forward_pass.py - now it is fixed - entire epoch completes successfully Implemented by Szymon Mazurek [email protected]

VukW · 2024-07-18T11:59:49Z

😁What a crutch
@sarthakpati overrode branch's commits history, making myself as commit author (sorry, @szmazurek ), so failed check should be fixed now.
But isn't it strange Szymon's CLA agreement was lost?

sarthakpati · 2024-07-18T14:36:10Z

Thanks!

But isn't it strange Szymon's CLA agreement was lost?

Actually, I think he submitted a PR from a machine where git was improperly configured, and his username for that commit was registered as Mazurek, Szymon instead of szmazurek, and thus resulting in the failed CLA check. This usually happens because in the initial git setup step, git asks for Full Name when it should be asking for username, followed by email.

sarthakpati · 2024-07-19T14:24:29Z

Multiple experiments have shown the validity of this PR:

2D Histology binary segmentation:

3D Radiology multi-class segmentation:

Merging this PR in, and subsequent issues are to be addressed in more PRs.

VukW added 5 commits May 9, 2024 01:03

Output_metrics is filled only for the last weighted avg_type

ada9577

Refactored logger

73174c7

To ensure values in csv are always written in the same order as header

general refactoring & typing

7303736

Fix after changing step output shape

36bbfa9

sarthakpati self-requested a review May 19, 2024 03:12

sarthakpati marked this pull request as draft May 19, 2024 03:12

sarthakpati mentioned this pull request May 21, 2024

Made train batches order same as gt #870

Merged

VukW added 2 commits May 21, 2024 19:43

dynamic lists instead of fixed size for handling dynamic batch_size

62ffb14

Fix for segmentation

3987439

VukW force-pushed the fix_train_metrics branch from 65f3d8a to 3987439 Compare May 21, 2024 20:51

a crutch for deep_* and sdnet architectures (that return list)

26b33a9

VukW force-pushed the fix_train_metrics branch from f94cc15 to 26b33a9 Compare May 21, 2024 22:14

VukW added 5 commits May 22, 2024 12:48

Merge remote-tracking branch 'origin/VukW-patch-1' into fix_train_met…

887def5

…rics

turning training dataset shuffle on

71273ce

was turned off as workaround at mlcommons#870

Test fix for the case when both label and value_to_pred exist

d30cf20

fixes test_train_inference_classification_histology_large_2d (35)

bugfix when label is not present

92c4387

Merge branch 'master' into fix_train_metrics

8409f7a

VukW marked this pull request as ready for review May 23, 2024 11:55

VukW changed the title ~~[WIP] Fix train metrics~~ Fix train metrics May 23, 2024

Geeks-Sid previously approved these changes Jun 3, 2024

View reviewed changes

Merge branch 'master' into fix_train_metrics

ac2a442

sarthakpati requested a review from a team as a code owner June 4, 2024 15:12

VukW commented Jun 4, 2024

View reviewed changes

GANDLF/compute/forward_pass.py Outdated Show resolved Hide resolved

VukW mentioned this pull request Jun 6, 2024

[BUG] dice_per_label metric fails #883

Closed

Do not assert metric shape; lets take a first evaluated instead

d0d25fb

(for one of classes per-label metrics are not counted thus metric shape may differ)

VukW dismissed Geeks-Sid’s stale review via d0d25fb June 6, 2024 14:17

Blacked

ca8a904

VukW added 2 commits July 18, 2024 14:52

Error correction in validation and testing loops

6b22745

- the same metric error was occuring in the loops in forward_pass.py - now it is fixed - entire epoch completes successfully Implemented by Szymon Mazurek [email protected]

Merge branch 'master' into fix_train_metrics

0829662

VukW force-pushed the fix_train_metrics branch from e81b63c to 0829662 Compare July 18, 2024 11:55

sarthakpati approved these changes Jul 19, 2024

View reviewed changes

sarthakpati merged commit 8b9fb47 into mlcommons:master Jul 19, 2024
20 checks passed

github-actions bot locked and limited conversation to collaborators Jul 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix train metrics #868

Fix train metrics #868

VukW commented May 15, 2024 •

edited by sarthakpati

Loading

github-actions bot commented May 15, 2024 •

edited

Loading

sarthakpati commented May 19, 2024

VukW commented May 21, 2024

sarthakpati commented Jun 4, 2024

szmazurek commented Jun 4, 2024

szmazurek commented Jun 4, 2024 •

edited by sarthakpati

Loading

VukW commented Jun 5, 2024

sarthakpati commented Jun 5, 2024

szmazurek commented Jun 6, 2024

sarthakpati commented Jun 6, 2024

VukW commented Jun 6, 2024

sarthakpati commented Jun 6, 2024

szmazurek commented Jun 6, 2024

szmazurek commented Jun 7, 2024 •

edited

Loading

sarthakpati commented Jul 16, 2024

VukW commented Jul 18, 2024

sarthakpati commented Jul 18, 2024

sarthakpati commented Jul 19, 2024

Fix train metrics #868

Fix train metrics #868

Conversation

VukW commented May 15, 2024 • edited by sarthakpati Loading

Proposed Changes

Checklist

github-actions bot commented May 15, 2024 • edited Loading

sarthakpati commented May 19, 2024

VukW commented May 21, 2024

sarthakpati commented Jun 4, 2024

szmazurek commented Jun 4, 2024

szmazurek commented Jun 4, 2024 • edited by sarthakpati Loading

VukW commented Jun 5, 2024

sarthakpati commented Jun 5, 2024

szmazurek commented Jun 6, 2024

sarthakpati commented Jun 6, 2024

VukW commented Jun 6, 2024

sarthakpati commented Jun 6, 2024

szmazurek commented Jun 6, 2024

szmazurek commented Jun 7, 2024 • edited Loading

sarthakpati commented Jul 16, 2024

VukW commented Jul 18, 2024

sarthakpati commented Jul 18, 2024

sarthakpati commented Jul 19, 2024

VukW commented May 15, 2024 •

edited by sarthakpati

Loading

github-actions bot commented May 15, 2024 •

edited

Loading

szmazurek commented Jun 4, 2024 •

edited by sarthakpati

Loading

szmazurek commented Jun 7, 2024 •

edited

Loading