Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Accuracy Layer on GPU interferes with training #5981

Closed
vlomonaco opened this issue Oct 15, 2017 · 9 comments · Fixed by #6202
Closed

New Accuracy Layer on GPU interferes with training #5981

vlomonaco opened this issue Oct 15, 2017 · 9 comments · Fixed by #6202
Assignees
Labels

Comments

@vlomonaco
Copy link

vlomonaco commented Oct 15, 2017

Issue summary

Using the "Accuracy" layer in the "Training net" on GPU breaks the training. The layer somehow interferes with the gradient. Loss explodes quickly and Train/Test Accuracies stall to 1.

Steps to reproduce

  1. Download the the latest version of Caffe (commit 691febc)
  2. Compile it with this Makefile.config
  3. Run a network with Accuracy layer in the TRAIN phase.

My system configuration

Operating system: Ubuntu 14.04.5 LTS
CUDA version (if applicable): 7.0
CUDNN version (if applicable): 4.7
BLAS: libblas.so.3
Python version: 3.5

How to fix it

All these three solutions can fix it:

  • Remove the Accuracy Layer from the "Training net", not problems with the phase: TEST.
  • Change back-end to CPU.
  • Rollback before commit 62e0c85 (which I suspect caused the issue).
@Noiredd
Copy link
Member

Noiredd commented Oct 16, 2017

I can reproduce this behavior for the GPU implementation. CPU does not seem to be affected. I'll be looking into this today; in the meantime, could you take a look too, @shaibagon?

EDIT: My suspicion after half an hour of tinkering with this: could it be that this memory actually is used for something? That is, we use it as a temporary memory but Caffe actually does propagate back from here?

@Noiredd Noiredd added the bug label Oct 16, 2017
@shaibagon
Copy link
Member

@Noiredd Is it possible this causes the issue? I will look into it.

@Noiredd
Copy link
Member

Noiredd commented Oct 16, 2017

Removing the if and replacing with an unconditional NOT_IMPLEMENTED; did not change anything.

However, forcing:
caffe_gpu_set(bottom[0]->count(), Dtype(0), acc_data);
caffe_gpu_set(bottom[0]->count(), Dtype(0), counts);
at the end of Forward_gpu() fixes the problem, supporting my guess that Caffe indeed propagates from there.
This is not counter-intuitive. Think of intermediate classifiers - if we have a prediction blob A that is a bottom to one Accuracy layer but it's also a bottom to, let's say, InnerProduct, we want the IP to propagate. By reusing the blob's gradient memory, we effectively overrode the other gradients.

@shaibagon
Copy link
Member

@Noiredd if you set counts and acc_data to zero - you are setting the gradients to zero. Thus, if caffe does propagate from there - you just killed the gradients.
I suppose it would require allocation of an internal blob to be used as a buffer.

@Noiredd
Copy link
Member

Noiredd commented Oct 16, 2017

@shaibagon Of course, this was just to prove that the problem is indeed there. I can come up with a fix in a while - unless you want to take it from here? Since you fathered this PR ;)

@shaibagon
Copy link
Member

@Noiredd if it is okay with you, I'd appreciate if you can take it from here. I am not as available as I used to be for caffe :(

@Noiredd
Copy link
Member

Noiredd commented Oct 16, 2017

@vlomonaco Check PR #5987 - does it solve the issue for you?

@vlomonaco
Copy link
Author

Hi @Noiredd thank you for the fix in less than 24hrs! It works!

@duygusar
Copy link

duygusar commented Jan 12, 2018

Hi, I have the exact same problem, somehow @Noiredd 's fix didn't work for me. Besides, I have my Accuracy layer only for the Test phase. I don't know why I am having this problem. My batchsize is not small and I have enough space, which rules out other reasons I have come across.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants