Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel map function can hang during interruption or externally killed workers #212

Open
Purg opened this issue Mar 3, 2016 · 7 comments
Assignees
Labels

Comments

@Purg
Copy link
Member

Purg commented Mar 3, 2016

When Ctrl-C'ing a parallel-map in progress, an dead-lock can occur.

It has also been seen that if the workers are doing web-requests, they can lock up, possibly due to an infinite wait issue with the request. Then the threads or processes are killed externally, the function dead-locks and can't clean itself up properly.

@Purg Purg added the bug label Mar 3, 2016
@Purg Purg self-assigned this Mar 3, 2016
@danlamanna
Copy link
Member

danlamanna commented Nov 16, 2016

Since this can happen in the middle of GPU work, it can be left in a state where the GPU doesn't get to free its memory.

FWIW, this seems to be the best course of action: stopping X, calling nvidia-smi --gpu-reset, and starting X again.

@Purg
Copy link
Member Author

Purg commented Nov 16, 2016

Haven't seen that yet... I'm assuming it happened to you?

@danlamanna
Copy link
Member

Yes.

@Purg
Copy link
Member Author

Purg commented Nov 16, 2016

Welp, more reason to fix this thing again...

@chrismattmann
Copy link
Contributor

i'm seeing something similar here @danlamanna and @Purg when trying the SMQTK quickstart and docker. I have 50 images and it just hangs building the network...sometimes it gets to batch 2, sometimes stays in batch 1:

I0422 04:53:40.881229    18 net.cpp:752] Ignoring source layer loss
  DEBUG - 2018-04-22 04:53:40,950 - smqtk.algorithms.descriptor_generator.caffe_descriptor.CaffeDescriptorGenerator._setup_network - Network data shape: (10, 3, 227, 227)
  DEBUG - 2018-04-22 04:53:40,950 - smqtk.algorithms.descriptor_generator.caffe_descriptor.CaffeDescriptorGenerator._setup_network - Initializing data transformer
  DEBUG - 2018-04-22 04:53:40,950 - smqtk.algorithms.descriptor_generator.caffe_descriptor.CaffeDescriptorGenerator._setup_network - Initializing data transformer -> {'data': (10, 3, 227, 227)}
  DEBUG - 2018-04-22 04:53:40,951 - smqtk.algorithms.descriptor_generator.caffe_descriptor.CaffeDescriptorGenerator._setup_network - Loading image mean
  DEBUG - 2018-04-22 04:53:40,952 - smqtk.algorithms.descriptor_generator.caffe_descriptor.CaffeDescriptorGenerator._setup_network - Image mean file not a numpy array, assuming protobuf binary.
  DEBUG - 2018-04-22 04:53:41,325 - smqtk.algorithms.descriptor_generator.caffe_descriptor.CaffeDescriptorGenerator._setup_network - Initializing data transformer -- mean
  DEBUG - 2018-04-22 04:53:41,325 - smqtk.algorithms.descriptor_generator.caffe_descriptor.CaffeDescriptorGenerator._setup_network - Initializing data transformer -- transpose
  DEBUG - 2018-04-22 04:53:41,325 - smqtk.algorithms.descriptor_generator.caffe_descriptor.CaffeDescriptorGenerator._setup_network - Initializing data transformer -- channel swap
   INFO - 2018-04-22 04:53:41,329 - __main__.run_file_list - Computing descriptors
  DEBUG - 2018-04-22 04:53:41,330 - smqtk.compute_functions.compute_many_descriptors - Using single async call
  DEBUG - 2018-04-22 04:53:41,331 - smqtk.compute_functions.compute_many_descriptors - Computing descriptors
  DEBUG - 2018-04-22 04:53:41,331 - smqtk.algorithms.descriptor_generator.caffe_descriptor.CaffeDescriptorGenerator.compute_descriptor_async - Checking content types; aggregating data/descriptor elements.
  DEBUG - 2018-04-22 04:53:41,332 - smqtk.utils.parallel[check-file-type].parallel_map - Using all cores (2)
  DEBUG - 2018-04-22 04:53:42,613 - smqtk.algorithms.descriptor_generator.caffe_descriptor.CaffeDescriptorGenerator.report_progress - Loops per second 29.597158 (avg 29.597158) (31 this interval / 31 total)
  DEBUG - 2018-04-22 04:53:43,505 - smqtk.algorithms.descriptor_generator.caffe_descriptor.CaffeDescriptorGenerator.compute_descriptor_async - Given 49 unique data elements
  DEBUG - 2018-04-22 04:53:43,912 - smqtk.algorithms.descriptor_generator.caffe_descriptor.CaffeDescriptorGenerator.compute_descriptor_async - 0 descriptors already computed
  DEBUG - 2018-04-22 04:53:43,912 - smqtk.algorithms.descriptor_generator.caffe_descriptor.CaffeDescriptorGenerator.compute_descriptor_async - Converting deque to tuple for segmentation
  DEBUG - 2018-04-22 04:53:43,912 - smqtk.algorithms.descriptor_generator.caffe_descriptor.CaffeDescriptorGenerator.compute_descriptor_async - Processing 6 batches of size 8
  DEBUG - 2018-04-22 04:53:43,912 - smqtk.algorithms.descriptor_generator.caffe_descriptor.CaffeDescriptorGenerator.compute_descriptor_async - Processing tail group of size 1
  DEBUG - 2018-04-22 04:53:43,912 - smqtk.algorithms.descriptor_generator.caffe_descriptor.CaffeDescriptorGenerator.compute_descriptor_async - Starting batch: 1 of 6
  DEBUG - 2018-04-22 04:53:43,912 - smqtk.algorithms.descriptor_generator.caffe_descriptor.CaffeDescriptorGenerator._process_batch - Updating network data layer shape (8 images)
  DEBUG - 2018-04-22 04:53:43,912 - smqtk.algorithms.descriptor_generator.caffe_descriptor.CaffeDescriptorGenerator._process_batch - Loading image pixel arrays
  DEBUG - 2018-04-22 04:53:43,912 - smqtk.utils.parallel.parallel_map - Using all cores (2)

Any ideas?

@chrismattmann
Copy link
Contributor

BTW I'm using SMQTK and Image Space qiuckstart dockers...the ones that ref one another.

@chrismattmann
Copy link
Contributor

FWIW I was able to get this working but only by repetitively stopping and starting smqtk-services docker...over and over....and randomly it works all the way sometimes for my 6 batches of ~50 images, and 90% of the time it just hangs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants