-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[MXNET-560] Add temperature parameter in Softmax operator #11466
[MXNET-560] Add temperature parameter in Softmax operator #11466
Conversation
…forest/incubator-mxnet into feature/operator_enhancement
@rahul003 @haojin2 @eric-haibin-lin @sandeep-krishnamurthy @Roshrini @szha Thanks all for your reviews on my previous [WIP] PR. I have created this new one for review due to the mess-up caused by my previous merge. This PR also addressed the comments raised in previous one. |
src/operator/nn/softmax-inl.h
Outdated
@@ -53,7 +53,7 @@ struct log_softmax_fwd { | |||
|
|||
template<typename OP, typename DType, int ndim> | |||
inline void Softmax(Stream<cpu> *s, DType *in, DType *out, | |||
Shape<ndim> shape, int axis) { | |||
Shape<ndim> shape, int axis, const float temperature) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we support double precision and half precision too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a parameter, so it should be of a specific type. I think to ensure the highest precision we should take this parameter in as a double type and cast it to the DType during runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. I have changed the data type from float to generic DType
@pengzhao-intel @zheng-da do you know if MKLDNN supports the temperature parameter? If not then we will just have to fallback to fcompute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx for the contribution. looks in good shape
data = np.random.uniform(-2, 2, size=shape) | ||
sym = mx.sym.softmax(axis=axis, temperature=temp) | ||
expected = np_softmax(data, axis=axis, temperature=temp) | ||
check_symbolic_forward(sym, [data], [expected]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason why check_symbolic_backward is not tested?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added check_symbolic_backward test
src/operator/nn/softmax.cc
Outdated
@@ -77,10 +78,13 @@ MXNET_OPERATOR_REGISTER_UNARY(softmax) | |||
The resulting array contains elements in the range (0,1) and the elements along the given axis sum up to 1. | |||
|
|||
.. math:: | |||
softmax(\mathbf{z/t})_j = \frac{e^{z_j/t}}{\sum_{k=1}^K e^{z_k/t}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not remove line 82?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
@eric-haibin-lin Addressed your comments. Please review at your convenience again. Thx |
Is |
@slitsey Could you please check if this enhancement meet your request? |
@sxjscience You raised a good point. I should have thought it through earlier. Running forward() function is the same if you just specify data/t as the input to data. However, I am not sure if the backward() function will behave the same since taking the derivative of the softmax function will be different with the temperature parameter in the exponential term. Need more expert opinion on this. |
@apeforest @sxjscience I think backpropagation will be different for a SoftmaxOutput final layer, because the derivative of the softmax with temperature will have an extra factor of 1/temperature. This leads to an extra factor of 1/temperature in all the d error/d weight terms. (Someone is welcome to double-check my math.) So it's not the same as uniformly scaling the data. Another way to think about this is that the derivatives are still with respect to the unscaled data. In other words, d f(x/T) /d x is what is used in backpropagation, not d f(x/T) /d (x/T), where f is the usual softmax function (temperature of 1). See also the comment here by isarandi on the first answer. |
@slitsey It's actually doable with existing layers: do an _mul_scalar first and the feed the output to normal softmax, the backward of _mul_scalar will add the 1/T to the grad during back propagation. |
@haojin2 That's a good point. Would that still be as computationally efficient? Would adding an extra layer like this make things appreciably slower than having a native parameter, or would it be essentially the same? |
@slitsey I don't think it would be introducing a lot of overhead, plus we're not performing that many changes if using existing ops. |
@haojin2 That's what I figured. In that case, I leave it up to the community to decide how to proceed. If we decide not to incorporate the parameter natively and go with the _mul_scalar layer solution, I would at least request a documentation change to SoftmaxOutput pointing this out as an example use case of chaining these two layers, given how prominent it is. |
@slitsey Actually going in the direction I just described would require some change in the python code for SoftmaxOutput, or we can make a new API named something like SoftmaxWithTemperature, and we can then have the proper documentation there. |
@apeforest I'm fine with that idea. Would we maintain the softmax vs SoftmaxOutput distinction too? So two new API additions? |
Well we also have the SoftmaxActivation thing... so I guess 3 new API additions then. But however we do this there'll be some API changes, and since the usage of this is not as big as the T=1 case, I would prefer that we add this in the form of new APIs. Just want to be clear here that I'm not opposed to adding a new option to the existing APIs, but I also want to make sure we have minimal regression of performance and introduce the least confusions for our users. |
@haojin2 That makes sense. To play devil's advocate, I would argue that having six different variations of softmax is a bit ridiculous, and more confusing than adding a single parameter to existing functions. Since the default behavior is unchanged, most users won't even know the difference from a UX perspective. You can't create a new function every time someone wants to add a new feature to an existing function; that kind of function proliferation generates far more user confusion than adding a parameter they can safely ignore. The performance regression argument is much more significant. However, while I'm not an expert on this, I suspect that adding an if statement to a function's execution is about as minimal a performance drop as you can get. (Especially given that's its just deciding whether or not to perform a floating point division operation!) Most other modifications would have a more substantial impact. Feel free to correct me if I'm mistaken on this point. Finally, making an enhancement to an important function like softmax can itself induce use of that feature. I wouldn't be surprised if the fraction of softmax users that employ this feature ends up being higher than one may think, given that it's a hyperparameter applicable to most use cases that tends to be ignored only due to ignorance or to maintain simplicity. Anyway, I'll let everyone else weigh in. For my own use case, having new SoftmaxWithTemperature functions would be fine, but I suspect for the library as a whole, it may be better to incorporate the change within the existing functions. |
Actually something I did not mention is compatibility issues, modified APIs will have effects on the exported json file for the models and may cause problems for earlier versions of MXNet. As per the performance issue, it's more than an additional if statement, you can see that it also affects MKLDNN's version. |
@haojin2 If we decided to just call _mul_scalar under the hood without actually modifying the backend implementation, then we don't have the compatibility issue right? If that's the case, it seems adding an option to the existing Softmax function to be a more desirable approach. |
That sounds like a reasonable approach. How would that affect performance? |
If we add an if statement up front instead of inside every operation, the performance impact should be minimal. |
@haojin2 Does this approach sound like something you would be comfortable with? It seems like it resolves the issues we've discussed in an elegant and efficient way. |
@haojin2 Changed parameter temperature to optional as we discussed. Added unittest |
@piiswrong @reminisce @anirudh2290 @eric-haibin-lin Please give a review on this when you have time, thanks! |
@eric-haibin-lin Whom should I ask to merge this PR if it's approved? |
src/operator/nn/softmax-inl.h
Outdated
for (index_t j = 0; j < M; ++j) { | ||
sum += std::exp(in[base + j*sa] - mmax); | ||
} | ||
if (temperature == 1.0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you add a comment why you are branching for 1.0. And also the fact this is not useful for GPU.
Generally its a good practice to add comments around obvious code that need special handling or code that you might have discovered after scratching your head :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default the value of temperature is 1.0. Users will use other values only during reinforcement training cases. For CPU, the compiler cannot optimize this "divide-by-1.0" computation at runtime. Therefore I added a branch here. The performance difference is calibrated using an example shown in the Description of this PR. This branch is not added in GPU kernel because branching will add extra overhead for GPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant in the code :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks to your suggestion, I have added two comments in the code to make it clear for other developers in the future.
* Add temperature parameter in softmax operator and add a unit test * Optimize runtime when temperature is set to default 1.0 * Add temperature parameter in softmax operator and add a unit test
* Add temperature parameter in softmax operator and add a unit test * Optimize runtime when temperature is set to default 1.0 * Add temperature parameter in softmax operator and add a unit test
Description
This PR is to address the request to have a native temperature parameter in the softmax functions. See Issue for more detailed discussion.
I have added the temperature parameter to softmax operator. By default the temperature parameter value is 1.0 and both functions should behave the same as not specifying the temperature parameter.
Verified the change using the following code in Python:
should expect return
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes
Comments