New Accuracy Layer on GPU interferes with training #5981

vlomonaco · 2017-10-15T22:16:18Z

Issue summary

Using the "Accuracy" layer in the "Training net" on GPU breaks the training. The layer somehow interferes with the gradient. Loss explodes quickly and Train/Test Accuracies stall to 1.

Steps to reproduce

Download the the latest version of Caffe (commit 691febc)
Compile it with this Makefile.config
Run a network with Accuracy layer in the TRAIN phase.

My system configuration

Operating system: Ubuntu 14.04.5 LTS
CUDA version (if applicable): 7.0
CUDNN version (if applicable): 4.7
BLAS: libblas.so.3
Python version: 3.5

How to fix it

All these three solutions can fix it:

Remove the Accuracy Layer from the "Training net", not problems with the phase: TEST.
Change back-end to CPU.
Rollback before commit 62e0c85 (which I suspect caused the issue).

The text was updated successfully, but these errors were encountered:

Noiredd · 2017-10-16T09:41:48Z

I can reproduce this behavior for the GPU implementation. CPU does not seem to be affected. I'll be looking into this today; in the meantime, could you take a look too, @shaibagon?

EDIT: My suspicion after half an hour of tinkering with this: could it be that this memory actually is used for something? That is, we use it as a temporary memory but Caffe actually does propagate back from here?

shaibagon · 2017-10-16T10:44:42Z

@Noiredd Is it possible this causes the issue? I will look into it.

Noiredd · 2017-10-16T11:13:45Z

Removing the if and replacing with an unconditional NOT_IMPLEMENTED; did not change anything.

However, forcing:
caffe_gpu_set(bottom[0]->count(), Dtype(0), acc_data);
caffe_gpu_set(bottom[0]->count(), Dtype(0), counts);
at the end of Forward_gpu() fixes the problem, supporting my guess that Caffe indeed propagates from there.
This is not counter-intuitive. Think of intermediate classifiers - if we have a prediction blob A that is a bottom to one Accuracy layer but it's also a bottom to, let's say, InnerProduct, we want the IP to propagate. By reusing the blob's gradient memory, we effectively overrode the other gradients.

shaibagon · 2017-10-16T11:36:15Z

@Noiredd if you set counts and acc_data to zero - you are setting the gradients to zero. Thus, if caffe does propagate from there - you just killed the gradients.
I suppose it would require allocation of an internal blob to be used as a buffer.

Noiredd · 2017-10-16T12:57:09Z

@shaibagon Of course, this was just to prove that the problem is indeed there. I can come up with a fix in a while - unless you want to take it from here? Since you fathered this PR ;)

shaibagon · 2017-10-16T12:59:18Z

@Noiredd if it is okay with you, I'd appreciate if you can take it from here. I am not as available as I used to be for caffe :(

Noiredd · 2017-10-16T14:41:00Z

@vlomonaco Check PR #5987 - does it solve the issue for you?

vlomonaco · 2017-10-16T16:30:39Z

Hi @Noiredd thank you for the fix in less than 24hrs! It works!

duygusar · 2018-01-12T15:42:53Z

Hi, I have the exact same problem, somehow @Noiredd 's fix didn't work for me. Besides, I have my Accuracy layer only for the Test phase. I don't know why I am having this problem. My batchsize is not small and I have enough space, which rules out other reasons I have come across.

vlomonaco mentioned this issue Oct 15, 2017

upgrading Accuracy layer #5836

Merged

Noiredd added the bug label Oct 16, 2017

Noiredd self-assigned this Oct 16, 2017

Noiredd mentioned this issue Oct 16, 2017

Hotfix for accuracy interfering with training #5987

Closed

cyrilhsu mentioned this issue Dec 28, 2017

Loss output increases and always stops at 87.3365 while learning in GPU-mode, however it decreases and I can get quite good accuracy in CPU-mode. #6130

Closed

Noiredd mentioned this issue Jan 3, 2018

Put the acc_data in a new syncedmemory block #6141

Closed

shelhamer mentioned this issue Jan 29, 2018

Clear Scratch Diffs to Prevent Contaminating Backward through Splits #6202

Merged

shelhamer closed this as completed in #6202 Jan 29, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Accuracy Layer on GPU interferes with training #5981

New Accuracy Layer on GPU interferes with training #5981

vlomonaco commented Oct 15, 2017 •

edited

Loading

Noiredd commented Oct 16, 2017 •

edited

Loading

shaibagon commented Oct 16, 2017

Noiredd commented Oct 16, 2017

shaibagon commented Oct 16, 2017

Noiredd commented Oct 16, 2017

shaibagon commented Oct 16, 2017

Noiredd commented Oct 16, 2017

vlomonaco commented Oct 16, 2017

duygusar commented Jan 12, 2018 •

edited

Loading

New Accuracy Layer on GPU interferes with training #5981

New Accuracy Layer on GPU interferes with training #5981

Comments

vlomonaco commented Oct 15, 2017 • edited Loading

Issue summary

Steps to reproduce

My system configuration

How to fix it

Noiredd commented Oct 16, 2017 • edited Loading

shaibagon commented Oct 16, 2017

Noiredd commented Oct 16, 2017

shaibagon commented Oct 16, 2017

Noiredd commented Oct 16, 2017

shaibagon commented Oct 16, 2017

Noiredd commented Oct 16, 2017

vlomonaco commented Oct 16, 2017

duygusar commented Jan 12, 2018 • edited Loading

vlomonaco commented Oct 15, 2017 •

edited

Loading

Noiredd commented Oct 16, 2017 •

edited

Loading

duygusar commented Jan 12, 2018 •

edited

Loading