Values used for normalized score calculation #1

Randl · 2021-10-31T05:56:53Z

I couldn't find the values used for normalized score calculation neither in paper nor in repo. It would be convenient if we'd be able to compare new methods based on the same metric (mean normalized return). Also the values themselves do not appear anywhere in the paper, only on figures, which is a bit confusing.

MishaLaskin · 2021-11-01T15:23:06Z

Good catch! Here are the expert scores, we'll update the repo + paper to include these numbers.

walker_stand 984
walker_walk 971
walker_run 796
walker_flip 799
quadruped_walk 866
quadruped_run 888
quadruped_stand 920
quadruped_jump 888
jaco_reach_top_left 191
jaco_reach_top_right 223
jaco_reach_bottom_left 193
jaco_reach_bottom_right 203

Randl · 2021-11-02T09:15:44Z

@MishaLaskin Also, how error bars for Fig. 3 are calculated? I tried to use std over tasks for https://paperswithcode.com/task/unsupervised-reinforcement-learning but it's definitely too large. Did you take into account the std of the expert scores? Then probably under normal approximation for the ratio distribution the usual std of sum can be used? In that case, can you share stds too?

MishaLaskin · 2021-11-03T19:50:35Z

The error bars are standard errors (so taking # of seeds run into account to get a tighter estimate of the mean). Because expert scores are only there for normalization purposes, we just divide by the expert expert score without considering its standard deviation.

Randl · 2021-11-03T19:58:56Z

I think that expert std should be taken into account (if it has very high std then it is pretty clear that estimates are less confident) -- https://en.wikipedia.org/wiki/Ratio_distribution#Uncorrelated_noncentral_normal_ratio says the distribution of ratio can be estimated with normal one, taking into account both stds (and converging to normalized std if expert is very low std)

MishaLaskin · 2021-11-03T20:05:18Z

That's a fair point, but we are using expert scores only as a way to display scores (it's just a scaling factor). We could instead just use the raw scores. Additionally, for all envs considered, the expert scores which are results of running supervised RL for 2M steps have low variance (see @denisyarats pytorch_sac and DrQ / DrQv2 repos) so this shouldn't be an issue.

MishaLaskin closed this as completed Dec 13, 2021

MouseHu mentioned this issue May 10, 2022

Questions on numerical results #23

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Values used for normalized score calculation #1

Values used for normalized score calculation #1

Randl commented Oct 31, 2021

MishaLaskin commented Nov 1, 2021

Randl commented Nov 2, 2021

MishaLaskin commented Nov 3, 2021

Randl commented Nov 3, 2021

MishaLaskin commented Nov 3, 2021 •

edited

Loading

Values used for normalized score calculation #1

Values used for normalized score calculation #1

Comments

Randl commented Oct 31, 2021

MishaLaskin commented Nov 1, 2021

Randl commented Nov 2, 2021

MishaLaskin commented Nov 3, 2021

Randl commented Nov 3, 2021

MishaLaskin commented Nov 3, 2021 • edited Loading

MishaLaskin commented Nov 3, 2021 •

edited

Loading