-
Notifications
You must be signed in to change notification settings - Fork 167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle edge cases in DecentralizedAverager #171
Merged
Merged
Changes from 12 commits
Commits
Show all changes
16 commits
Select commit
Hold shift + click to select a range
6a2bcad
move metadata serialization outside user scope
justheuristic b8f0c68
Merge branch 'master' of github.com:learning-at-home/hivemind
justheuristic 4c8bd1c
actualize docstrings
justheuristic 0eb600f
actualize docstrings
justheuristic 153235e
retry after network erros
justheuristic 79c2fc8
retry after network errors
justheuristic 56de7da
retry after network errors
justheuristic 239daf6
raise AllreduceException on partial tensor
justheuristic 26837a9
test split/combine tensors, combine corrupted stream
justheuristic 440964a
remove unnecessary trailing spaces in test_util_modules.py
justheuristic b998053
address review by mryab@
mryab 9ea8abf
typo
justheuristic c677433
Update tests/test_util_modules.py
justheuristic 81be9a0
typo
justheuristic c667c65
Merge remote-tracking branch 'origin/averager_edgecase' into averager…
justheuristic 22c15a0
typo
justheuristic File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -1,5 +1,6 @@ | ||||||
import asyncio | ||||||
import torch | ||||||
import numpy as np | ||||||
|
||||||
import pytest | ||||||
import hivemind | ||||||
|
@@ -57,7 +58,7 @@ def test_mpfuture_cancel(): | |||||
with pytest.raises(RuntimeError): | ||||||
future.set_result(123) | ||||||
with pytest.raises(RuntimeError): | ||||||
future.set_exception(NotImplementedError) | ||||||
future.set_exception(NotImplementedError()) | ||||||
assert future.cancelled() and future.done() and not future.running() | ||||||
|
||||||
|
||||||
|
@@ -192,6 +193,42 @@ def test_serialize_tensor(): | |||||
restored = hivemind.combine_from_streaming(chunks) | ||||||
assert torch.allclose(hivemind.deserialize_torch_tensor(restored), tensor) | ||||||
|
||||||
|
||||||
def test_split_parts(): | ||||||
tensor = torch.randn(910, 512) | ||||||
serialized_tensor_part = hivemind.utils.serialize_torch_tensor(tensor, allow_inplace=False) | ||||||
chunks1 = list(hivemind.utils.split_for_streaming(serialized_tensor_part, 16384)) | ||||||
assert len(chunks1) == int(np.ceil(tensor.numel() * 4 / 16384)) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
chunks2 = list(hivemind.utils.split_for_streaming(serialized_tensor_part, 10_000)) | ||||||
assert len(chunks2) == int(np.ceil(tensor.numel() * 4 / 10_000)) | ||||||
justheuristic marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
chunks3 = list(hivemind.utils.split_for_streaming(serialized_tensor_part, 10 ** 9)) | ||||||
assert len(chunks3) == 1 | ||||||
|
||||||
compressed_tensor_part = hivemind.utils.serialize_torch_tensor(tensor, hivemind.CompressionType.FLOAT16, | ||||||
allow_inplace=False) | ||||||
chunks4 = list(hivemind.utils.split_for_streaming(compressed_tensor_part, 16384)) | ||||||
assert len(chunks4) == int(np.ceil(tensor.numel() * 2 / 16384)) | ||||||
|
||||||
combined1 = hivemind.utils.combine_from_streaming(chunks1) | ||||||
combined2 = hivemind.utils.combine_from_streaming(iter(chunks2)) | ||||||
combined3 = hivemind.utils.combine_from_streaming(chunks3) | ||||||
combined4 = hivemind.utils.combine_from_streaming(chunks4) | ||||||
for combined in combined1, combined2, combined3: | ||||||
assert torch.allclose(tensor, hivemind.deserialize_torch_tensor(combined), rtol=1e-5, atol=1e-8) | ||||||
|
||||||
assert torch.allclose(tensor, hivemind.deserialize_torch_tensor(combined4), rtol=1e-3, atol=1e-3) | ||||||
|
||||||
combined_incomplete = hivemind.utils.combine_from_streaming(chunks4[:5]) | ||||||
combined_incomplete2 = hivemind.utils.combine_from_streaming(chunks4[:1]) | ||||||
combined_incomplete3 = hivemind.utils.combine_from_streaming(chunks4[:-1]) | ||||||
for combined in combined_incomplete, combined_incomplete2, combined_incomplete3: | ||||||
with pytest.raises(RuntimeError): | ||||||
_ = hivemind.deserialize_torch_tensor(combined) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is assignment necessary? |
||||||
# note: we rely on this being RuntimeError in hivemind.client.averager.allreduce.AllreduceProtocol | ||||||
|
||||||
|
||||||
def test_generic_data_classes(): | ||||||
from hivemind.utils import ValueWithExpiration, HeapEntry, DHTExpiration | ||||||
|
||||||
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, we should narrow down the scope of this exception:
combine_from_streaming
inside the try-except blockdeserialize_torch_tensor
. If you still want an AllreduceException here, you can then catch a specific SerializerExceptionThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that errors in combine_from_streaming only manifest in deserialize_torch_tensor .