-
-
Notifications
You must be signed in to change notification settings - Fork 617
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling empty datasets in distributed metric computation #1242
Comments
@linhr thanks for clean reproduction code snippet. I was playing around and I confirm that there can be some incoherences between dtypes.
Actually, is it correct to distribute the data like that creating asymmetry ? I'm just replaced the sampler such that there is 1 common batch for all processes and 1 for only zero rank and code runs seems like as expected class SampleDataset(IterableDataset):
def __iter__(self):
yield torch.zeros((4, 3)), torch.ones((4, 3))
if idist.get_rank() == 0:
yield torch.zeros((2, 3)), torch.ones((2, 3))
I agree we should fix inconsistent dtype between initialization and updated accumulators.
Yes, I agree this is a bug, all other metrics use
Yes, looks like a bug, but I need to take a deeper look into it... If you would like help us with a fix of the first two issue, please feel free to send a PR. Thanks again! |
This will be fixed once #1238 is merged. All the metrics will create their accumulators on the right device, except Precision and Recall in the non-multilabel case, since we don't know the shape of the updates during initialization (this logic should clarify why)
I'm not sure if moving the check after the all_reduce is sufficient. In some cases, you would have a shape mismatch since the initialized value would just be a 0 while the ranks that actually updated a value will have some other shape, so the all-reduce would fail. Is there any way to all-reduce only a subset of ranks? If you knew the shape of
This should also be fixed once #1238 is merged. |
Thanks @vfdev-5 and @n2cholas for the reply! Glad to learn that many of the issues will be fixed in #1238!
Yeah, this issue is about an edge case that are unlikely to happen usually. Here is a situation similar to my use case when I discovered the issue: there are 4 GPUs (and 4 processes) for distributed training; the validation data is partitioned into 3 files, and each process will load one or more data files during validation. We can see that one process does not handle any validation data, so metric calculation needs to take this into account. At beginning I was puzzled by the error message since it didn't directly connect me to data imbalance in my setting. :)
My understanding is moving the check after I agree that supporting |
@linhr thanks for providing the info about your use-case. Won't be better to read everything in all processes and then split the data in a more balanced way between processes ? Or try to use 3 processes instead of 4 ?
I think this is currently a bug as for other metrics EDIT: and seems like there is a wrong condition in precision/recall to raise
Yes, I agree. This could add an overhead (by pre-checking input size etc) for nomimal cases which is not that good.
In multiclass case, we have |
Thanks @vfdev-5 !
Yes, definitely. Besides fixing the issues for the edge case, I agree it is more reasonable to distribute data in a more balanced way, depending on how many processes we have.
Agreed!
Thanks for the explanation! For this non-average multiclass case, I didn't realize So empty dataset is an issue as well for the non-average multiclass case. (My original post only looked at multilabel cases.) After thinking about it a bit more, I found we can probably solve the problem by two # test.py
import time
import torch
import ignite.distributed as idist
def all_reduce_metric(value):
(size,) = value.shape
max_size = idist.all_reduce(size, "MAX")
output = torch.zeros((max_size,))
output[:size] = value
return idist.all_reduce(output)
def test(local_rank):
before = torch.arange(idist.get_rank())
after = all_reduce_metric(before) # this is not an in-place operation
time.sleep(idist.get_rank()) # sleep to avoid messing up `print()`
print('before:', before, 'after:', after)
with idist.Parallel(backend="gloo") as parallel:
parallel.run(test) Run the distributed command: python -m torch.distributed.launch --nproc_per_node=4 --use_env test.py Output:
Let me know if the approach makes sense. I'm happy to discuss improvements/alternatives. :) For the |
@linhr yes, this can be one of approaches to handle variable size data. But calling the first all reduce on equal sized data will be what I called the overhead in a previous message. I was thinking if we could make metric's collective ops like
Yes, here the idea is the same as what you did for all reduce: a) we have to create an empty max-size tensor as a placeholder (maybe undef values should be marked differently, as NaN for example), b) put data into the placeholders, c) all gather => a big tensor as single placeholder x world_size on each process. d) remove undef values and compute the metric. |
Since v1.7.0 pytorch seem to support uneven inputs accross participating processes: https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel.join |
🐛 Bug description
Metric computation does not work properly in distributed settings when some processes do not handle any batch in the dataset. It becomes a problem when small validation or test datasets are distributed to processes in an imbalanced manner.
How to Reproduce
Create a Python script named
main.py
with the following content.Run the following command inside a CPU Docker container with PyTorch and Ignite installed.
Problem 1
The command terminated with an error. Part of the output is shown below.
It seems there is type inconsistency (
int
vsfloat
) insideidist.all_reduce()
when callingcompute()
, because not all processes have calledupdate()
at least once. A simple fix could be changing this line toself._sum = 0.0
.However this issue could affect other metrics as well. We probably need unit tests for such scenario for all metrics.
Problem 2
In the above script, if we change
Loss(...)
to the precision or recall metric (e.g.Precision()
), we get the following error message.The issue is the verification should actually be moved after
idist.all_reduce()
. Although some processes may have seen empty dataset, the metric is still valid collectively.Problem 3
After fixing Problem 2, there is still an issue with multi-label precision or recall. For example, changing
Loss(...)
toPrecision(is_multilabel=True, average=True)
and running the script will give the following error:The issue is with this line. Because not all processes have called
update()
at least once, there is again type inconsistency, where in some processesself._true_positives
is of typefloat
while in other processes it is a scalar tensor.Environment
conda
,pip
, source):pip
The text was updated successfully, but these errors were encountered: