-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix train metrics #868
Fix train metrics #868
Conversation
1. return output of whole batch, not just one item 2. make ground truth & predictions array to take into account `q_samples_per_volume` (the whole dataset size during 1 epoch is equal to len(data) * q_samples_per_volume; so if dataset df contains 100 records and q_samples_per_volume = 10 (by default) and batch size is 4, there would be 250 batches by 4 elements 3. make ground truth take into account that train_dataloader is shuffled. So now ground truth is sorted in the same order as predictions and as train_dataloader.
To ensure values in csv are always written in the same order as header
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
Converting to draft until the tests are passing. |
@sarthakpati It seems to me I fixed the code for usual segmentation cases.
I can't imagine right now how to fix that easily without massive refactoring. I returned the crutch back (so at least code doesn't fail right now). From one side a train metrics are calculated like averaging by sample and thus are calculated properly. From other side, validation metrics are broken. I'd strongly prefer to disable / remove these list architectures from GaNDLF for now; but what do you think? |
was turned off as workaround at mlcommons#870
fixes test_train_inference_classification_histology_large_2d (35)
@szmazurek - can you confirm if the BraTS training is working for you? |
Re-launched the training after pulling yesterday's merge by @Geeks-Sid. Will keep you updated if it runs, keeping the rest the same. |
Still negative. the output: Looping over training data: 0%| | 0/6255 [02:51<?, ?it/s]
ERROR: Traceback (most recent call last):
File "/net/tscratch/people/plgmazurekagh/gandlf/gandlf_env/bin/gandlf_run", line 126, in <module>
main_run(
File "/net/tscratch/people/plgmazurekagh/gandlf/gandlf_env/lib/python3.10/site-packages/GANDLF/cli/main_run.py", line 92, in main_run
TrainingManager_split(
File "/net/tscratch/people/plgmazurekagh/gandlf/gandlf_env/lib/python3.10/site-packages/GANDLF/training_manager.py", line 173, in TrainingManager_split
training_loop(
File "/net/tscratch/people/plgmazurekagh/gandlf/gandlf_env/lib/python3.10/site-packages/GANDLF/compute/training_loop.py", line 445, in training_loop
epoch_train_loss, epoch_train_metric = train_network(
File "/net/tscratch/people/plgmazurekagh/gandlf/gandlf_env/lib/python3.10/site-packages/GANDLF/compute/training_loop.py", line 171, in train_network
total_epoch_train_metric[metric] += metric_val
ValueError: operands could not be broadcast together with shapes (4,) (3,) (4,) I am using flexinet cosine annealing config as provided on brats 2021. @sarthakpati @VukW @Geeks-Sid any ideas? Did you maybe succeed? |
@szmazurek Can you plz show the exact config you're using? I'm not familiar with brats challenge:) |
@szmazurek - I would assume I will also get the same error. I am currently running other jobs so I don't have any free slots to queue up any other training. |
Hey dears, so, apparently commenting one parameter in the config made it work! The problem was metric option |
This will need more investigation - can you please open a new issue to track it? |
(for one of classes per-label metrics are not counted thus metric shape may differ)
@sarthakpati @szmazurek I catch the bug with @szmazurek 's model config. The issue is when |
@szmazurek can you confirm the fix on your end? |
@sarthakpati On it, training scheduled. Thanks @VukW for tackling that! |
Hey, my initial tests failed, it turned out that the error spotted by @VukW was present also in the validation and test loops. Corrected it and successfully completed the entire training epoch, the changes are applied in commit 5148e86. EDIT: I also now initialized the training with confing in the exact same way as you sent me @sarthakpati, will keep you noticed on the results. |
- the same metric error was occuring in the loops in forward_pass.py - now it is fixed - entire epoch completes successfully Implemented by Szymon Mazurek [email protected]
😁What a crutch |
Thanks!
Actually, I think he submitted a PR from a machine where git was improperly configured, and his username for that commit was registered as |
Fixes: N.A.
Proposed Changes
Checklist
CONTRIBUTING
guide has been followed.typing
is used to provide type hints, including and not limited to usingOptional
if a variable has a pre-defined value).pip install
step is needed for PR to be functional), please ensure it is reflected in all the files that control the CI, namely: python-test.yml, and all docker files [1,2,3].