-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Losses to implement and losses naming conventions #38
Comments
This comment has been minimized.
This comment has been minimized.
But that is one thing which I meant. This is what I don't like. CE is not just a loss. It is just CE. You can use it as a loss. But so can you also use anything else as a loss. And you can also use CE for other purpose, not just as a loss. I made this comment to say that we maybe should not exactly follow the PyTorch naming scheme in this case. Or at least we should think about it.
Yea sure, but this would just be our internal structuring, so this doesn't really matter too much. The user would simply use
I don't quite understand. What exactly do you want to test? That e.g. the |
Ohh I missread that, my bad.
I think we should think about good tests for the losses since they are crucial part of optimization. My suggestion assumed we see the |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This is the Loss class which then refers to the loss in
For the implementation of the function yes, but we should have a Module that just calls this function with the
I did not put that much thought into it yet, I just could not come up with a quick idea on how to test, but if you say its not hard then its fine. |
This comment has been minimized.
This comment has been minimized.
My suggestion was to implement the loss function without loss postfix as a standard function (like you suggested) and then have a class with the loss name similar to PyTorch, because while I agree with you that the functions can be used without their Loss intention, I feel like for a lot of users it would be more straight forward to actually have |
But I don't understand that reasoning.
|
Oh now I think I see your Problem. No I don't necessarily want one Module and one Function (that was bad wording I guess), in general this could also be 2x Module or 2x Function, but I would argue for having that "Non-Loss" Version wrapped in a "Loss" Version which handles the potential Loss settings. |
I still don't really follow that reasoning.
Or:
What is the point in that? And what do you mean by "potential loss settings"? There is never anything specific to a loss. Whatever you do is just some mathematical calculation. Just at the end, you declare it as a loss. |
Yes you are right about that. My thinking was, how I would approach a situation where I would have to build a model and use a loss. And while I agree with you that once "thinking" about doing something like 'loss = loss_scale * CE(x)' that seems fine, I am not sure if I would come up with that right away, while when having 'loss = CELoss(x ,scale=loss_scale)`, which internally does this, would make it more clear for a user. That could also then mark the output as a loss etc. I think most of all this is a design decision. And I am not sure if in the end users do this wrapping by themselves. Because if they then we could also already do it here in my opinion. |
Note that the I don't think that
How is that not totally obvious? Although your code example misses the main thing which makes it a loss, which is the
And surely, our documentation would also just have a small part on the losses. |
I also think calling the losses just Regarding cross-entropy in particular, RETURNN currently has some heuristic for this where you can pass it a probability distribution generated from a softmax, and then will compute the CE on the logits before the softmax. I heard this is important for numerical stability. |
Ah, this is another thing which I found somewhat inconsistent on the PyTorch loss module names: Some of them actually get logits (unnormalized), while others get log-prob or prob. E.g. Surely, we can make the heuristic work. Also not so complicated. We can check (e.g. on RETURNN side) whether the input is a tensor right after a Or, we can make a variant of Or (just like But yes, this is a good point which we should decide. |
Another thing: The targets of (Similar as e.g. |
I would like to have as few heuristics as possible. RETURNN already has many hidden heuristics for the losses that are non-verbose to the user, and I think it would be good to have this as explicit as possible (in the name and documentation of the module). In the extreme case maybe something like Another topic, how would we write Modules that have optimized calls in Tensorflow? From my current knowledge, I would write L2 like this: class L2(Module):
def __init__(self, reduction=None):
"""
:param str|None reduction: None, "mean" or "sum", will reduce over everything except batch
"""
super().__init__()
assert reduction in [None, "mean", "sum"]
self.reduction = reduction
def forward(self, inp1, inp2) -> LayerRef:
out = eval(concat(inp1, inp2), eval="tf_compat.v1.squared_difference(source(0),source(1))")
if self.reduction:
out = reduce(out, mode=self.reduction, axes="except_batch")
return out So here is also the question, should L2 be written "explicit", should it only "wrap" the TF call, should it include the reduction or should this be separate? |
Note that reduction over batch (for framewises losses also over time) is sth which RETURNN handles. This is also important to be this way because RETURNN must see the batch dim and time dim to be able to properly accumulate losses over the epoch. If we want to move that logic somehow over here, it's not clear how we would be able to handle that. Reduction over any other axes (in case of L2 or so) should (could?) be reduced explicitly before (as mean, sum, or what you like). (I think the current behavior in RETURNN is, whenever there are other axes in addition to batch or time, this is allowed, and it will just accumulate them with sum (edit or mean).) Btw, some comments to your code example:
We should make use of TF optimized functions when possible. Efficiency is one main principle of RETURNN. |
All axes are flattened and then combined with reduce_mean, so this is not the sum, right? from returnn/tf/layers/basic.py:8904 :
Yes of course... ooops |
Btw, on terminology: L2 is actually a norm. Or more specifically, it is the Lebesgue space using the 2-norm. Also, it is actually L2-distance is also common though, and means ``sqrt(sum((x - y)**2))`. For me, I would expect that a function We should very clearly reflect in our function name when we actually mean the squared-difference (so just |
Some open questions, and my current opinion: Module or function, or both? I would vote for function only, to keep consistent with #21, where we have the simple convention: Modules are used when it contains some parameter, functions otherwise. Do we want the Do we want Do we want Do we want How would we handle label smoothing?
I tend to prefer the last option with explicit code, although I also see that this is so common that maybe the second option is also fine? I don't know... What functions do we want to introduce, and using what names? (PyTorch loss functions)
Others? (We don't really need to have all potential ones right away. This can just be extended later. We only need the most important ones for the beginning.) |
We want
In general this comes down to the argument I think how much of the code is easy for the user to think of and where the border for this is. I think we should answer these kind of questions always by asking ourselves: "Is it reasonable to assume that someone who will use this can write it him/herself or would it require quite some knowledge/searching to come up with?" Once we agree on one of the two this then directly leads to whether we should implement it I think.
Agreed, so either 2nd option or 3rd in my opinion. For label smoothing I might tend to 2 right now, but maybe others can also comment.
If we name functions error doesn't this go into the same direction as giving functions the |
Ok. So we would have 3 functions then:
This can vary a lot, depending on the user and his/her background. Some people also might not even know about cross entropy and they just want to train it in whatever way. In practice, people will probably anyway still use some small code snippets from someone. This can be in the documentation, some small code snippet how you would do label smoothing (just the code snippet I posted above). Automatic code completion and IDE support is another thing though, where your IDE can easily show you the variants. And as label smoothing is really so common now, maybe we can anyway also add Or I wonder how good GitHub Copilot will be to come up with the snippet when I just put the prefix |
Just as a random reference, Fairseq has |
Another question: Should it be When you look at the mathematical definition of cross entropy
|
Note: We also should add (non-differential) functions for frame-error, edit-distance, etc. (See #57.) |
Related is also #17 on dim tags. Many of these losses will reduce over some axis, usually the feature dim axis. But maybe that should be explicit? |
I want to ask again about whether these functions should deal with both dense and sparse targets, or whether this should be separated and explicit as well? If this is separate, then we would have now already 6 functions. Is this good? And maybe there are other things? Also note, if we don't separate sparse inputs, then we need to implement it such that it can handle both dense and sparse inputs. This implies that we need one of these:
|
Note that And we also have shape information and But I would just use However, we should somehow make sure that it uses |
We have I decided to go with a (required) option It also wraps
One important aspect which is missing is the flattening (#68). This is also for efficiency because we don't want to do unnecessary softmax computations. But in any case, we should deal with this separately (#68). (Edit Done as well.) |
I think we now have a good starting point with One (for us) relevant remaining loss is CTC. Here is a bit the question on how generic we want to have it. In any case there should be a Once we have |
We also have So I will close this now. In case sth is buggy, please open a new issue. In case you want to have a specific missing loss function, please open a new issue. |
Some modules we should implement:
CrossEntropy
orCE
. Should this cover both dense and sparse targets, or do we want a separate module for the sparse case, likeSparseCrossEntropy
or so? Should this potentially also allow for logits? log-probs?KL
orKullbackLeiblerDivergence
BinaryCrossEntropy
orBCE
L2Dist
(absolute or mean?) OrMSE
orMeanSquarredError
? (The mean reduction is over the feature axis. Not over time or batch.)L1Dist
(absolute or mean?) OrMeanL1Dist
?Ctc
orCTC
orCtcLogProb
CosineSimilarity
I don't like the naming of the PyTorch losses too much here.
They have the postfix
Loss
on all of them, although these modules are generic and not necessarily just for loss computation (although that's probably their most common usage).Also
CrossEntropyLoss
is actually log-softmax + CE together. So very much like the TFtf.nn.sparse_softmax_cross_entropy_with_logits
.And there is a separate
NLLLoss
. Which is just likeCrossEntropyLoss
but it doesn't take logits but log-prob instead. I find this naming to be confusing.Also the question is how we should handle things like label smoothing. On RETURNN side (and also in TF), it is just an option to the CE-loss. On PyTorch side, it is not implemented yet as part of the official PyTorch API. Some background here. It was only very recently added (pytorch/pytorch#7455, pytorch/pytorch#63122). This also adds it as an option
label_smoothing
toCrossEntropyLoss
. An alternative would be that the user makes this more explicit, like:Although label smoothing has become very common, so maybe it makes sense to have this also just as an option.
Note also that the loss accumulation over the dataset and handling of calculating the correct average (mean) is handled by RETURNN. All such losses would just yield a vector of shape [B] or [B,T].
The text was updated successfully, but these errors were encountered: