Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix random freezes in averager.step, improve error handling #254

Merged
merged 8 commits into from
May 6, 2021

Conversation

justheuristic
Copy link
Member

@justheuristic justheuristic commented May 6, 2021

This patch introduces three changes:

@justheuristic
Copy link
Member Author

justheuristic commented May 6, 2021

So, here's how this might explain the error reported earlier by @nevec :

  • initializing small tensors with ones/zeros -> no hang
  • initializing large tensors with ones/zeros -> averager hangs
  • initializing large tensors with empty -> no hang
  • initializing large tensors with randn -> no hang

Maybe

  • large ones/zeros are assigned with multiple threads, which triggers OMP -> subsequent fork -> pytorch ops in fork will hang
  • small ones/zeros contend with a single thread -> no OMP -> no hang
  • empty needs no item assignment, just malloc -> no OMP -> no hang
  • randn uses RNG which is is inherently sequential -> no OMP -> no hang

@justheuristic justheuristic changed the title Patch random hanging and errors Fixed random hangs in averager.step, improved error handling May 6, 2021
@mryab mryab changed the title Fixed random hangs in averager.step, improved error handling Fix random hangs in averager.step, improve error handling May 6, 2021
@mryab mryab changed the title Fix random hangs in averager.step, improve error handling Fix random freezes in averager.step, improve error handling May 6, 2021
@@ -199,6 +200,10 @@ async def request_join_group(self, leader: Endpoint, expiration_time: DHTExpirat
if call is not None:
call.cancel()
return None
except (grpc.RpcError, grpc.aio.AioRpcError, InternalError, StopAsyncIteration) as e:
logger.warning(f"{self} - failed to request potential leader {leader}: {e}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.warning(f"{self} - failed to request potential leader {leader}: {e}")
logger.exception(f"{self} - failed to request potential leader {leader}: {e}")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed it, thanks

@justheuristic justheuristic merged commit 94b9db0 into master May 6, 2021
@justheuristic justheuristic deleted the master-patch-shmemory branch May 6, 2021 15:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants