Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestPowerGradientShiftZero, TestPowerGradient fail with certain boost #1252

Closed
shelhamer opened this issue Oct 10, 2014 · 14 comments
Closed

Comments

@shelhamer
Copy link
Member

The PowerLayer::Backward checks seem to fail with certain versions of boost on OS X / ubuntu.

boost 1.55 passes, but boost 1.56 and and 1.57 fail.

[ RUN      ] PowerLayerTest/0.TestPowerGradientShiftZero
./include/caffe/test/test_gradient_check_util.hpp:166: Failure
The difference between computed_gradient and estimated_gradient is 0.16171693801879883, which exceeds threshold_ * scale, where
computed_gradient evaluates to 6.6543664932250977,
estimated_gradient evaluates to 6.8160834312438965, and
threshold_ * scale evaluates to 0.068160831928253174.
debug: (top_id, top_data_id, blob_id, feat_id)=0,65,0,65; feat = 0.027440188452601433; objective+ = 0.55363553762435913; objective- = 0.41731387376785278
[  FAILED  ] PowerLayerTest/0.TestPowerGradientShiftZero, where TypeParam = caffe::FloatCPU (3 ms)

#######################################################################################################################
[ RUN      ] PowerLayerTest/1.TestPowerGradientShiftZero
./include/caffe/test/test_gradient_check_util.hpp:166: Failure
The difference between computed_gradient and estimated_gradient is 0.66645549482483268, which exceeds threshold_ * scale, where
computed_gradient evaluates to 9.0545713301684909,
estimated_gradient evaluates to 9.7210268249933236, and
threshold_ * scale evaluates to 0.097210268249933243.
debug: (top_id, top_data_id, blob_id, feat_id)=0,66,0,66; feat = 0.016829367669263143; objective+ = 0.48941214282974049; objective- = 0.29499160632987403
./include/caffe/test/test_gradient_check_util.hpp:166: Failure
The difference between computed_gradient and estimated_gradient is 0.48462038369962279, which exceeds threshold_ * scale, where
computed_gradient evaluates to 8.4754232528829139,
estimated_gradient evaluates to 8.9600436365825367, and
threshold_ * scale evaluates to 0.089600436365825362.
debug: (top_id, top_data_id, blob_id, feat_id)=0,71,0,71; feat = 0.01869104873835549; objective+ = 0.50171265479916738; objective- = 0.32251178206751663
./include/caffe/test/test_gradient_check_util.hpp:166: Failure
The difference between computed_gradient and estimated_gradient is 0.24489777273781588, which exceeds threshold_ * scale, where
computed_gradient evaluates to 7.3061654292184715,
estimated_gradient evaluates to 7.5510632019562873, and
threshold_ * scale evaluates to 0.075510632019562873.
debug: (top_id, top_data_id, blob_id, feat_id)=0,99,0,99; feat = 0.023657563288965969; objective+ = 0.53224239845290788; objective- = 0.38122113441378214
[  FAILED  ] PowerLayerTest/1.TestPowerGradientShiftZero, where TypeParam = caffe::DoubleCPU (4 ms)

#######################################################################################################################
[ RUN      ] PowerLayerTest/1.TestPowerGradient
./include/caffe/test/test_gradient_check_util.hpp:166: Failure
The difference between computed_gradient and estimated_gradient is 1.206900511134485, which exceeds threshold_ * scale, where
computed_gradient evaluates to 10.15816285551514,
estimated_gradient evaluates to 11.365063366649625, and
threshold_ * scale evaluates to 0.11365063366649626.
debug: (top_id, top_data_id, blob_id, feat_id)=0,57,0,57; feat = 2.9055876775560447; objective+ = 0.46979585546340097; objective- = 0.24249458813040844
[  FAILED  ] PowerLayerTest/1.TestPowerGradient, where TypeParam = caffe::DoubleCPU (3 ms)
@mprat
Copy link
Contributor

mprat commented Nov 13, 2014

This failed for me with the native BLAS provided by OSX 10.9, so I tried with openBLAS and it gave me the same 3 errors. Does anyone have any suggestions for getting openBLAS to work?

@II-Matto
Copy link

I built caffe with MKL and also encountered such failures. To be specific, there were actually six failed tests. The boost library used is of the newest version, i.e. 1.57.0, and the Anaconda Python 2.7.

[----------] Global test environment tear-down
[==========] 838 tests from 169 test cases ran. (1664414 ms total)
[ PASSED ] 832 tests.
[ FAILED ] 6 tests, listed below:
[ FAILED ] PowerLayerTest/0.TestPowerGradientShiftZero, where TypeParam = caffe::FloatCPU
[ FAILED ] PowerLayerTest/1.TestPowerGradientShiftZero, where TypeParam = caffe::DoubleCPU
[ FAILED ] PowerLayerTest/1.TestPowerGradient, where TypeParam = caffe::DoubleCPU
[ FAILED ] PowerLayerTest/2.TestPowerGradientShiftZero, where TypeParam = caffe::FloatGPU
[ FAILED ] PowerLayerTest/3.TestPowerGradientShiftZero, where TypeParam = caffe::DoubleGPU
[ FAILED ] PowerLayerTest/3.TestPowerGradient, where TypeParam = caffe::DoubleGPU

Do the failures indicate that caffe will not work appropriately? How should I deal with them?

@mprat
Copy link
Contributor

mprat commented Nov 14, 2014

I also just tried compiling with MKL for all my libraries and I am still getting the errors in TestPowerGradient. @II-Matto , you have 6 errors and I have 3 because you are using GPU compilation and I am not.

@mprat
Copy link
Contributor

mprat commented Nov 14, 2014

I got it to work with MKL and Boost 1.55.

@geekan
Copy link

geekan commented Nov 16, 2014

I've faced same problem and I've solved it.
I tried to uninstall Boost 1.56 and install Boost 1.55, then reinstall caffe, all tests passed! (with openblas)

@relh
Copy link

relh commented Jan 12, 2015

Still having the same errors with Boost 1.57, downgrading to 1.55 solved the problem.

@lou-k
Copy link

lou-k commented Jan 14, 2015

I think the BLAS issue here is a red herring; the tests passed for me with Atlas and Boost 1.55.

Boost 1.56 failed with both OpenBLAS and Atlas.

@relh
Copy link

relh commented Jan 14, 2015

Agreed, a boost problem then
On Jan 14, 2015 3:41 PM, "lou-k" [email protected] wrote:

I think the BLAS issue here is a red herring; the tests passed for me with
Atlas and Boost 1.55.

Boost 1.56 failed with both OpenBLAS and Atlas.


Reply to this email directly or view it on GitHub
#1252 (comment).

@svanschalkwyk
Copy link

I'm getting it with boost1.54.0. Ubuntu 14.04, boost1.54.0, mkl from intel version 15 c++.
Any other ideas?

@shelhamer shelhamer changed the title TestPowerGradientShiftZero, TestPowerGradient fail with vecLib on OS X TestPowerGradientShiftZero, TestPowerGradient fail with vecLib with certain boost Jan 20, 2015
@shelhamer shelhamer changed the title TestPowerGradientShiftZero, TestPowerGradient fail with vecLib with certain boost TestPowerGradientShiftZero, TestPowerGradient fail with certain boost Jan 20, 2015
@lazywei
Copy link

lazywei commented Jan 31, 2015

Confirm the same problem here. CentOS6, boost 1.57, mkl-203

[----------] Global test environment tear-down
[==========] 838 tests from 169 test cases ran. (98290 ms total)
[  PASSED  ] 832 tests.
[  FAILED  ] 6 tests, listed below:
[  FAILED  ] PowerLayerTest/0.TestPowerGradientShiftZero, where TypeParam = caffe::FloatCPU
[  FAILED  ] PowerLayerTest/1.TestPowerGradient, where TypeParam = caffe::DoubleCPU
[  FAILED  ] PowerLayerTest/1.TestPowerGradientShiftZero, where TypeParam = caffe::DoubleCPU
[  FAILED  ] PowerLayerTest/2.TestPowerGradientShiftZero, where TypeParam = caffe::FloatGPU
[  FAILED  ] PowerLayerTest/3.TestPowerGradient, where TypeParam = caffe::DoubleGPU
[  FAILED  ] PowerLayerTest/3.TestPowerGradientShiftZero, where TypeParam = caffe::DoubleGPU

@dgolden1
Copy link
Contributor

dgolden1 commented Feb 3, 2015

@shelhamer, as users, should we be concerned about the test failures with Boost 1.57? Will Caffe give erroneous results? Or can we ignore the failures for now?

@blackyang
Copy link

Same problem(6 failed test) at first, on OSX 10.9.5 with atlas, boost 1.57 and Anaconda Python 2.7. After downgrading boost to 1.55(others remain unchanged) and reinstall caffe, it works now

@shelhamer
Copy link
Member Author

While I can't dismiss these numerical errors, the consolation is that these are isolated to PowerLayer, and quite rare at that with only 1-3 out of 120 elements out of tolerance, so only models that (1) define a POWER layer or make the rare choice of the WITHIN_CHANNEL mode of the LRN layer are at risk -- and even these might be ok. That said these errors are worth resolving.

shelhamer added a commit to shelhamer/caffe that referenced this issue Feb 6, 2015
The gradient checker fails on certain elements of the PowerLayer checks,
but only 1-3 sometimes fail out of the 120 elements tested. This is not
due to any numerical issue in the PowerLayer, but the distribution of
the random inputs for the checks.

boost 1.56 switched the normal distribution RNG engine from Box-Muller
to Ziggurat.
@shelhamer
Copy link
Member Author

I looked into this a little and @jeffdonahue was quick to note that boost RNG is used by all the fillers regardless of mode -- and I found this boost thread on RNG that notes the normal distribution RNG was rewritten for the 1.56 release. A little good old fashioned hand calculation confirmed this is nothing more than a precision error, so #1840 fixes this by reducing the step size for the finite-differencing.

There's no need to keep to boost 1.55.

(For those who like RNG the switch was from Box-Muller to Ziggurat.)

pannous pushed a commit to pannous/caffe that referenced this issue Feb 6, 2015
The gradient checker fails on certain elements of the PowerLayer checks,
but only 1-3 sometimes fail out of the 120 elements tested. This is not
due to any numerical issue in the PowerLayer, but the distribution of
the random inputs for the checks.

boost 1.56 switched the normal distribution RNG engine from Box-Muller
to Ziggurat.
slayton58 pushed a commit to slayton58/caffe that referenced this issue Mar 4, 2015
The gradient checker fails on certain elements of the PowerLayer checks,
but only 1-3 sometimes fail out of the 120 elements tested. This is not
due to any numerical issue in the PowerLayer, but the distribution of
the random inputs for the checks.

boost 1.56 switched the normal distribution RNG engine from Box-Muller
to Ziggurat.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants