Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write custom loss function code to match Yang's loss function #39

Closed
mwinton opened this issue Nov 6, 2018 · 4 comments
Closed

Write custom loss function code to match Yang's loss function #39

mwinton opened this issue Nov 6, 2018 · 4 comments
Assignees
Labels

Comments

@mwinton
Copy link
Owner

mwinton commented Nov 6, 2018

No description provided.

@mwinton mwinton added the models label Nov 6, 2018
@mwinton
Copy link
Owner Author

mwinton commented Nov 6, 2018

https://github.com/zcyang/imageqa-san/blob/master/src/san_att_conv_twolayer_theano.py#L369-L372

prob_y = prob[T.arange(prob.shape[0]), label]
pred_label = T.argmax(prob, axis=1)
# sum or mean?
cost = -T.mean(T.log(prob_y))

prob is the output of the final softmax, so prob_y is the array of probabilities for each of the 1000 labels. Then they take the mean of the log of this array.

@mwinton mwinton added the P1 label Nov 9, 2018
@mwinton mwinton added this to the Build Initial Model milestone Nov 9, 2018
@mwinton mwinton removed the P1 label Nov 11, 2018
@mwinton mwinton self-assigned this Nov 13, 2018
@mwinton
Copy link
Owner Author

mwinton commented Nov 16, 2018

Yang's code is actually a shortcut to categorical cross-entropy loss. The first line of code above -- the lookup of at the label's index in prob basically eliminates all terms from the softmax output that would be zero'd out by the true labels when you multiply [0|1] * log (predicted probability).

Then T.mean() is the same as tf.reduce_mean(), just taking the mean over the batch dimension (ie. over axis=0). From what I read, this can be helpful to normalize loss by the batch when dealing with batches of different sizes.

In Keras code, categorical cross-entropy does a tf.reduce_sum(), but that is the sum of p log q for all classes for a particular sample to reduce to one loss number per sample (ie. over axis=-1). This function doesn't show how Keras/TF handle the batch dimension, but presumably they take a mean too.

@mwinton
Copy link
Owner Author

mwinton commented Nov 16, 2018

Confirmed that Keras is taking averages over batches:

"For training loss, keras does a running average over the batches. For validation loss, a conventional average over all the batches in validation data is performed. The training accuracy is the average of the accuracy values for each batch of training data during training."

keras-team/keras#10426

@mwinton mwinton closed this as completed Nov 16, 2018
@chjatala
Copy link

Confirmed that Keras is taking averages over batches:

"For training loss, keras does a running average over the batches. For validation loss, a conventional average over all the batches in validation data is performed. The training accuracy is the average of the accuracy values for each batch of training data during training."

keras-team/keras#10426

"running average over the batches": what are the parameters for the running average?

Is it simple moving average? In that case, how many previous batches average is calculated?
Or is it cumulative moving average? In that case, is it computed from the start of training till current training step, or over a single epoch?
Or is it exponential moving average? What is discount factor?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants