Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjusting softmax function #8

Closed
jakkarn opened this issue Mar 25, 2021 · 4 comments
Closed

Adjusting softmax function #8

jakkarn opened this issue Mar 25, 2021 · 4 comments

Comments

@jakkarn
Copy link

jakkarn commented Mar 25, 2021

The preference for each pair of video clips is calculated based on a softmax over the predicted latent reward values for each clip. In the paper, "Rather than applying a softmax directly...we assume there is a 10% chance that the human responds uniformly at random. Conceptually this adjustment is needed because human raters have a constant probability of making an error, which doesn’t decay to 0 as the difference in reward difference becomes extreme." I wasn't sure how to implement this - at least, I couldn't see a way to implement it that would actually affect the gradients - so we just do the softmax directly.

After talking about this with a friend that has greater knowledge in the statistics field, my understanding is this:

  • They adjust the predictor model's probability p1 before using it in the loss function:
  • 90% of the human's decisions are rational
  • 10% of decisions made are: random or simply wrongly made

And the adjusted probability should then be:

p2 = 0.9*p1 + 0.1*0.5 = 0.9*p1 + 0.05

I'm not completely sure this is the correct way, but I think so. Great work implementing this btw!

@mrahtz
Copy link
Owner

mrahtz commented Mar 25, 2021

Thanks for commenting!

p2 = 0.9*p1 + 0.1*0.5 = 0.9*p1 + 0.05

I think my initial intuition for how to implement it was something similar, but since in this case the cross-entropy loss is only being applied over p2 and 1 - p2, the loss is either p2 or 1 - p2 - so the extra 0.05 wouldn't affect the gradients, would it? (I guess it'll make some difference at inference time, but I don't think it'll affect training, will it?)

(I was about to post this reply, then thought "Wait, what about the softmax?" - but the softmax has already been applied in order to get to p1 in the first place.)

@jakkarn
Copy link
Author

jakkarn commented Mar 26, 2021

Hm... There is this mention in the article:

Conceptually this adjustment is needed because human raters have a constant probability of making an error, which doesn’t
decay to 0 as the difference in reward difference becomes extreme.

So I'm thinking, does this have a significant impact on the loss from each entry estimate u(1)log(p1) + u(2)log(1-p1) when the network is more certain of rewards? Having only one estimate (of one entry) decay to zero (or close to zero) could potentially have a huge impact on the resulting sum (cross-entropy loss) since it is the logarithmic function with negative infinity as limit. Couldn't it?

image


Also, the article actually reads "rather than applying a softmax directly...". Does that imply that they adjust it before the softmax? I couldn't get that to make sense in my mind, so that's why I assumed it was like we both seem to have thought.

@mrahtz
Copy link
Owner

mrahtz commented Mar 29, 2021

Sooooo I asked Jan Leike nicely and it turns out he still had a copy of the original code for the paper lying around :)

Searching for '0.1' and '0.9' in the codebase, the only relevant thing I could find is:

def __init__(self, *args, epsilon=0.1, **kwargs):
        self.epsilon = epsilon
...

def p_preferred(self, obs1=None, act1=None, obs2=None, act2=None):
    reward = [prediction1, prediction2]
    mean_rewards = tf.reduce_mean(tf.stack(reward, axis=1), axis=2)
    p_preferred_raw = tf.nn.softmax(mean_rewards)
    return (1 - self.epsilon) * p_preferred_raw + self.epsilon * 0.5

This is the approach you originally suggested, which, yeah, I'm pretty sure doesn't affect gradients...so either a) this just isn't necessary and they never found out because they didn't do an ablation, or b) it only makes a difference at inference time in situations where, as you say, when the reward predictor is particularly sure. (Though...considering that the most important inference function is predicted reward, rather than predicted preference, I lean towards a). shrug Everyone makes mistakes. Or it serves some completely different purpose that's not ocurring to me right now...)

case_closed

@mrahtz mrahtz closed this as completed Mar 29, 2021
@jakkarn
Copy link
Author

jakkarn commented Mar 30, 2021

Okay! Nice that you took the time. I'm playing around with this a bit now, so I guess I'll give you a comment if I find something substantial related to this.

I love Sherlock btw. Made me laugh when I saw your meme.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants