[MXNET-560] Add temperature parameter in Softmax operator #11466

apeforest · 2018-06-28T18:20:52Z

Description

This PR is to address the request to have a native temperature parameter in the softmax functions. See Issue for more detailed discussion.

I have added the temperature parameter to softmax operator. By default the temperature parameter value is 1.0 and both functions should behave the same as not specifying the temperature parameter.

Verified the change using the following code in Python:

import mxnet as mx

data = mx.sym.Variable('data')
net = mx.sym.softmax(data=data, temperature=10)

x = mx.nd.array([ 1,  2,  3])

ex = net.bind(mx.cpu(), args={'data': x, 'softmax2_label': 'softmax2'})
ex.forward()

should expect return

[
[ 0.30060961  0.33222499  0.3671654 ]
 <NDArray 3 @cpu(0)>]

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

[X ] The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
[X ] All changes have test coverage:
Added a unit test to cover the new parameter
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Softmax operator with new temperature parameter(unittest at test_operator.py:test_softmax_with_temperature)

Comments

This change is backward compatible. The default value of temperature is 1.0f
Because 90% of the time (empirically) user will use this operator using the default temperature, I have added a "if else" branch in CPU computation to optimize the runtime by getting rid of the unnecessary 'divide-by-one' operation. This is not done in GPU kernel because 1) adding if branch will have negative runtime impact in GPU, 2) CUDA compiler may perform additional optimization on this edge case.
I have use check_speed function to verify this optimization in CPU. In my experiment, I choose a vector of size 10000 and run 10 times. The runtime reduction is about 25%. Experiment code is pasted below:

import mxnet as mx
from mxnet.test_utils import check_speed
data_shape=(1,10000)
data = mx.sym.Variable(name='data',shape=data_shape)
ctx=mx.cpu(0)
x = mx.nd.random.normal(0, 1.0, shape=data_shape, ctx=ctx)
net = mx.sym.softmax(data=data)
softmax_time = check_speed(sym=net, location={'data': x}, ctx=ctx, N=10, grad_req='null', typ='forward') * 1000
print(softmax_time)

…forest/incubator-mxnet into feature/operator_enhancement

apeforest · 2018-06-28T18:23:26Z

@rahul003 @haojin2 @eric-haibin-lin @sandeep-krishnamurthy @Roshrini @szha

Thanks all for your reviews on my previous [WIP] PR. I have created this new one for review due to the mess-up caused by my previous merge. This PR also addressed the comments raised in previous one.

szha · 2018-06-29T17:44:20Z

src/operator/nn/softmax-inl.h

@@ -53,7 +53,7 @@ struct log_softmax_fwd {

 template<typename OP, typename DType, int ndim>
 inline void Softmax(Stream<cpu> *s, DType *in, DType *out,
-                    Shape<ndim> shape, int axis) {
+                    Shape<ndim> shape, int axis, const float temperature) {


should we support double precision and half precision too?

This is a parameter, so it should be of a specific type. I think to ensure the highest precision we should take this parameter in as a double type and cast it to the DType during runtime.

Thanks for the suggestion. I have changed the data type from float to generic DType

azai91 · 2018-06-29T21:36:36Z

@pengzhao-intel @zheng-da do you know if MKLDNN supports the temperature parameter? If not then we will just have to fallback to fcompute

…enhancement

eric-haibin-lin

Thx for the contribution. looks in good shape

eric-haibin-lin · 2018-07-02T06:58:31Z

tests/python/unittest/test_operator.py

+            data = np.random.uniform(-2, 2, size=shape)
+            sym = mx.sym.softmax(axis=axis, temperature=temp)
+            expected = np_softmax(data, axis=axis, temperature=temp)
+            check_symbolic_forward(sym, [data], [expected])


Any reason why check_symbolic_backward is not tested?

Added check_symbolic_backward test

eric-haibin-lin · 2018-07-02T07:00:00Z

src/operator/nn/softmax.cc

@@ -77,10 +78,13 @@ MXNET_OPERATOR_REGISTER_UNARY(softmax)
 The resulting array contains elements in the range (0,1) and the elements along the given axis sum up to 1.

 .. math::
+   softmax(\mathbf{z/t})_j = \frac{e^{z_j/t}}{\sum_{k=1}^K e^{z_k/t}}


Why not remove line 82?

apeforest · 2018-07-04T01:03:14Z

@eric-haibin-lin Addressed your comments. Please review at your convenience again. Thx

sxjscience · 2018-07-04T01:08:14Z

Is mx.sym.softmax(data=data, temperature=10) the same as mx.sym.softmax(data=data / 10) ?

apeforest · 2018-07-04T04:26:29Z

@slitsey Could you please check if this enhancement meet your request?

apeforest · 2018-07-05T15:56:03Z

@sxjscience You raised a good point. I should have thought it through earlier. Running forward() function is the same if you just specify data/t as the input to data. However, I am not sure if the backward() function will behave the same since taking the derivative of the softmax function will be different with the temperature parameter in the exponential term. Need more expert opinion on this.

slitsey · 2018-07-05T17:22:49Z

@apeforest @sxjscience I think backpropagation will be different for a SoftmaxOutput final layer, because the derivative of the softmax with temperature will have an extra factor of 1/temperature. This leads to an extra factor of 1/temperature in all the d error/d weight terms. (Someone is welcome to double-check my math.) So it's not the same as uniformly scaling the data. Another way to think about this is that the derivatives are still with respect to the unscaled data. In other words, d f(x/T) /d x is what is used in backpropagation, not d f(x/T) /d (x/T), where f is the usual softmax function (temperature of 1). See also the comment here by isarandi on the first answer.

haojin2 · 2018-07-05T17:43:21Z

@slitsey It's actually doable with existing layers: do an _mul_scalar first and the feed the output to normal softmax, the backward of _mul_scalar will add the 1/T to the grad during back propagation.

slitsey · 2018-07-05T18:05:43Z

@haojin2 That's a good point. Would that still be as computationally efficient? Would adding an extra layer like this make things appreciably slower than having a native parameter, or would it be essentially the same?

haojin2 · 2018-07-05T18:11:29Z

@slitsey I don't think it would be introducing a lot of overhead, plus we're not performing that many changes if using existing ops.

slitsey · 2018-07-05T18:31:23Z

@haojin2 That's what I figured. In that case, I leave it up to the community to decide how to proceed. If we decide not to incorporate the parameter natively and go with the _mul_scalar layer solution, I would at least request a documentation change to SoftmaxOutput pointing this out as an example use case of chaining these two layers, given how prominent it is.

haojin2 · 2018-07-05T18:34:22Z

@slitsey Actually going in the direction I just described would require some change in the python code for SoftmaxOutput, or we can make a new API named something like SoftmaxWithTemperature, and we can then have the proper documentation there.

apeforest · 2018-07-06T17:16:30Z

@slitsey I am leaning towards @haojin2 's second suggestion to make a new API named SoftmaxWithTemperature. This way it is easier to use and does not impact the performance of regular Softmax operator. Please let me know what you think.

slitsey · 2018-07-06T21:29:47Z

@apeforest I'm fine with that idea. Would we maintain the softmax vs SoftmaxOutput distinction too? So two new API additions?

haojin2 · 2018-07-06T21:40:58Z

Well we also have the SoftmaxActivation thing... so I guess 3 new API additions then. But however we do this there'll be some API changes, and since the usage of this is not as big as the T=1 case, I would prefer that we add this in the form of new APIs. Just want to be clear here that I'm not opposed to adding a new option to the existing APIs, but I also want to make sure we have minimal regression of performance and introduce the least confusions for our users.

slitsey · 2018-07-06T22:06:59Z

@haojin2 That makes sense. To play devil's advocate, I would argue that having six different variations of softmax is a bit ridiculous, and more confusing than adding a single parameter to existing functions. Since the default behavior is unchanged, most users won't even know the difference from a UX perspective. You can't create a new function every time someone wants to add a new feature to an existing function; that kind of function proliferation generates far more user confusion than adding a parameter they can safely ignore.

The performance regression argument is much more significant. However, while I'm not an expert on this, I suspect that adding an if statement to a function's execution is about as minimal a performance drop as you can get. (Especially given that's its just deciding whether or not to perform a floating point division operation!) Most other modifications would have a more substantial impact. Feel free to correct me if I'm mistaken on this point.

Finally, making an enhancement to an important function like softmax can itself induce use of that feature. I wouldn't be surprised if the fraction of softmax users that employ this feature ends up being higher than one may think, given that it's a hyperparameter applicable to most use cases that tends to be ignored only due to ignorance or to maintain simplicity.

Anyway, I'll let everyone else weigh in. For my own use case, having new SoftmaxWithTemperature functions would be fine, but I suspect for the library as a whole, it may be better to incorporate the change within the existing functions.

haojin2 · 2018-07-06T22:42:00Z

Actually something I did not mention is compatibility issues, modified APIs will have effects on the exported json file for the models and may cause problems for earlier versions of MXNet. As per the performance issue, it's more than an additional if statement, you can see that it also affects MKLDNN's version.

apeforest · 2018-07-09T19:01:30Z

@haojin2 If we decided to just call _mul_scalar under the hood without actually modifying the backend implementation, then we don't have the compatibility issue right? If that's the case, it seems adding an option to the existing Softmax function to be a more desirable approach.

slitsey · 2018-07-10T21:47:52Z

That sounds like a reasonable approach. How would that affect performance?

apeforest · 2018-07-10T23:08:05Z

If we add an if statement up front instead of inside every operation, the performance impact should be minimal.

slitsey · 2018-07-11T21:53:23Z

@haojin2 Does this approach sound like something you would be comfortable with? It seems like it resolves the issues we've discussed in an elegant and efficient way.

apeforest · 2018-07-12T21:59:07Z

@haojin2 Changed parameter temperature to optional as we discussed. Added unittest test_softmax_with_temperature() to test temperature value ranging from 1 to 10. Please help review again. Thanks!

haojin2 · 2018-07-13T22:07:37Z

@piiswrong @reminisce @anirudh2290 @eric-haibin-lin Please give a review on this when you have time, thanks!

apeforest · 2018-07-17T21:53:18Z

@eric-haibin-lin Whom should I ask to merge this PR if it's approved?

nswamy · 2018-07-19T15:41:00Z

src/operator/nn/softmax-inl.h

-    for (index_t j = 0; j < M; ++j) {
-      sum += std::exp(in[base + j*sa] - mmax);
-    }
+    if (temperature == 1.0) {


could you add a comment why you are branching for 1.0. And also the fact this is not useful for GPU.
Generally its a good practice to add comments around obvious code that need special handling or code that you might have discovered after scratching your head :)

By default the value of temperature is 1.0. Users will use other values only during reinforcement training cases. For CPU, the compiler cannot optimize this "divide-by-1.0" computation at runtime. Therefore I added a branch here. The performance difference is calibrated using an example shown in the Description of this PR. This branch is not added in GPU kernel because branching will add extra overhead for GPU.

I meant in the code :)

Thanks to your suggestion, I have added two comments in the code to make it clear for other developers in the future.

* Add temperature parameter in softmax operator and add a unit test * Optimize runtime when temperature is set to default 1.0 * Add temperature parameter in softmax operator and add a unit test

apeforest added 5 commits June 28, 2018 10:14

Add temperature parameter in softmax operator and add a unit test

26a5623

Optimize runtime when temperature is set to default 1.0

bc2b9ff

Add temperature parameter in softmax operator and add a unit test

38ac907

Optimize runtime when temperature is set to default 1.0

ce2af77

Merge branch 'feature/operator_enhancement' of https://github.com/ape…

ab7fbbd

…forest/incubator-mxnet into feature/operator_enhancement

apeforest added 4 commits June 28, 2018 11:27

Fix lint error

6dc4e42

Add my name to contributor list

c4e2de7

Fix build error in CUDA

595130d

Fix build error in CUDA

fd83576

szha reviewed Jun 29, 2018

View reviewed changes

apeforest added 5 commits June 29, 2018 15:15

Fall back to regular CPU when setting temperature parameter in MKLDNN

8bef2df

Change temperature param type to generic DType

0146650

Merge remote-tracking branch 'upstream/master' into feature/operator_…

cf1f38c

…enhancement

Fix build error in CUDA

ec94e07

Fix lint error

631a58a

eric-haibin-lin mentioned this pull request Jul 2, 2018

[MXNET-560][WIP] Add temperature parameter in Softmax and SoftmaxOutput operator #11356

Closed

7 tasks

eric-haibin-lin reviewed Jul 2, 2018

View reviewed changes

apeforest added 4 commits July 2, 2018 11:20

Merge branch 'master' into feature/operator_enhancement

e1332a7

Fix build error in GPU

a072d66

Remove redundant line in description

44c8b0a

Add check_symbolic_backward in unittest

5c7f8d5

Merge branch 'master' into feature/operator_enhancement

bd6a868

apeforest requested a review from anirudh2290 as a code owner July 12, 2018 17:44

apeforest added 2 commits July 12, 2018 11:46

Make temperature argument optional for backward compatibility:w

affa46d

Fix build error in cpp package due to dmlc::optional<double>

3a40247

apeforest requested a review from nswamy as a code owner July 12, 2018 21:06

eric-haibin-lin approved these changes Jul 17, 2018

View reviewed changes

nswamy reviewed Jul 19, 2018

View reviewed changes

apeforest added 2 commits July 19, 2018 09:03

Add a comment in code

f739a58

Fix lint error

d20bfde

nswamy merged commit b1f2f44 into apache:master Jul 19, 2018

apeforest deleted the feature/operator_enhancement branch August 23, 2019 17:10

[MXNET-560] Add temperature parameter in Softmax operator #11466

[MXNET-560] Add temperature parameter in Softmax operator #11466

Conversation

apeforest commented Jun 28, 2018

Description

Checklist

Essentials

Changes

Comments

apeforest commented Jun 28, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

azai91 commented Jun 29, 2018

eric-haibin-lin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apeforest commented Jul 4, 2018

sxjscience commented Jul 4, 2018

apeforest commented Jul 4, 2018

apeforest commented Jul 5, 2018

slitsey commented Jul 5, 2018

haojin2 commented Jul 5, 2018 • edited Loading

slitsey commented Jul 5, 2018

haojin2 commented Jul 5, 2018

slitsey commented Jul 5, 2018

haojin2 commented Jul 5, 2018

apeforest commented Jul 6, 2018

slitsey commented Jul 6, 2018

haojin2 commented Jul 6, 2018

slitsey commented Jul 6, 2018 • edited Loading

haojin2 commented Jul 6, 2018 • edited Loading

apeforest commented Jul 9, 2018

slitsey commented Jul 10, 2018

apeforest commented Jul 10, 2018

slitsey commented Jul 11, 2018

apeforest commented Jul 12, 2018 • edited Loading

haojin2 commented Jul 13, 2018

apeforest commented Jul 17, 2018

nswamy Jul 19, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apeforest commented Jun 28, 2018 •

edited

Loading

haojin2 commented Jul 5, 2018 •

edited

Loading

slitsey commented Jul 6, 2018 •

edited

Loading

haojin2 commented Jul 6, 2018 •

edited

Loading

apeforest commented Jul 12, 2018 •

edited

Loading

nswamy Jul 19, 2018 •

edited

Loading