Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster duplicate handler dying from RuntimeError due to dict change while iterating it #160

Closed
michaelweiser opened this issue Jun 10, 2020 · 2 comments · Fixed by #161
Closed
Assignees
Labels
Milestone

Comments

@michaelweiser
Copy link
Contributor

The following traceback has been seen during shutdown at least once with v2.0:

peekaboo[984]: peekaboo.queuing - (Worker-9) - INFO - Worker 9: Stopped
peekaboo[984]: peekaboo.queuing - (MainThread) - DEBUG - 1: 32 workers still running
peekaboo[984]: peekaboo.db - (MainThread) - DEBUG - Clearing database of all in-flight samples of instance 5010.
peekaboo[984]: peekaboo.db - (MainThread) - DEBUG - Clearing database of all stale in-flight samples (900 seconds)
peekaboo[984]: Exception in thread ClusterDuplicateHandler:
peekaboo[984]: Traceback (most recent call last):
peekaboo[984]:   File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
peekaboo[984]:     self.run()
peekaboo[984]:   File "/opt/peekaboo/local/lib/python3.6/site-packages/peekaboo/queuing.py", line 327, in run
peekaboo[984]:     self.job_queue.submit_cluster_duplicates()
peekaboo[984]:   File "/opt/peekaboo/local/lib/python3.6/site-packages/peekaboo/queuing.py", line 168, in submit_cluster_duplicates
peekaboo[984]:     for sample_hash, sample_duplicates in self.cluster_duplicates.items():
peekaboo[984]: RuntimeError: dictionary changed size during iteration
peekaboo[984]: peekaboo.daemon - (MainThread) - DEBUG - Removing PID file /var/run/peekaboo/peekaboo.pid
systemd[1]: Stopped Peekaboo Extended Email Attachment Behavior Observation Owl.

There seems to be some kind of a race in Queue.shut_down() between cluster duplicate handler and queue shutdown which is odd because duplicate handler shutdown is the very first thing triggered, so it should not do another cleanup run while the queue is shutting down workers.

@michaelweiser michaelweiser added this to the 2.1 milestone Jun 10, 2020
@michaelweiser michaelweiser self-assigned this Jun 10, 2020
@michaelweiser michaelweiser changed the title Cluster duplicate backlog can be corrupted Cluster duplicate handler exception on shutdown Jun 10, 2020
@michaelweiser
Copy link
Contributor Author

This also happens during normal operation and is due to the changed implementation of dict.items() as a view in python 3 which does not respond well to changes to the dictionary during iteration.

@michaelweiser
Copy link
Contributor Author

michaelweiser commented Jun 10, 2020

In our case the cluster duplicate handler thread seems to die mid-operation due to the runtime exception but the backtrace is only logged at shutdown. (May be also a thing of caching/delaying stderr in systemd - but it was quite a long time in the one case I observed. When run interactively from the command line the backtrace appears immediately.)

michaelweiser added a commit to michaelweiser/PeekabooAV that referenced this issue Jun 10, 2020
With python3 the cluster duplicate handler would die from RuntimeErrors
due to the items() accessor of the duplicate backlog dict being a
view/iterator that doesn't respond well to the dict changing while being
iterated. Prevent the RuntimeError by iterating over the items of a copy
of the dict while changing the original, similar to what we're doing in
the cuckoo job tracke for alomst the same reason already.

Fixes scVENUS#160.
@michaelweiser michaelweiser changed the title Cluster duplicate handler exception on shutdown Cluster duplicate handler dying from RuntimeError due to dict change while iterating it Jun 10, 2020
michaelweiser added a commit that referenced this issue Jun 10, 2020
With python3 the cluster duplicate handler would die from RuntimeErrors
due to the items() accessor of the duplicate backlog dict being a
view/iterator that doesn't respond well to the dict changing while being
iterated. Prevent the RuntimeError by iterating over the items of a copy
of the dict while changing the original, similar to what we're doing in
the cuckoo job tracke for alomst the same reason already.

Fixes #160.
michaelweiser added a commit that referenced this issue Jun 11, 2020
With python3 the cluster duplicate handler would die from RuntimeErrors
due to the items() accessor of the duplicate backlog dict being a
view/iterator that doesn't respond well to the dict changing while being
iterated. Prevent the RuntimeError by iterating over the items of a copy
of the dict while changing the original, similar to what we're doing in
the cuckoo job tracke for alomst the same reason already.

Fixes #160.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant