-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RAD results may be incorrect. #5
Comments
Hi, Thanks for using rliable! The RAD scores reported in the paper correspond to the RAD results in the NeurIPS'21 paper Tactical Optimism and Pessimism for Deep Reinforcement Learning (see Table 2). These scores were provided by @jparkerholder and @tedmoskovitz and they confirmed that they used the RAD codebase released by original authors, the only difference being using a batch size of 128 as opposed to 512 and reporting results for 10 seeds. They also confirmed that these results are much better than original reported results of the RAD paper. Please let me know if you have any other questions. |
Thank you @agarwl! I will run the experiments again with a batch size of 128. |
I found a paper on ArXiv KSL that rerun RAD with a batch size of 128. Their results are very close to ours. Specifically, in the previous Table 2 from TOP, the results of RAD at 100k corresponds to the policy steps in KSL. For Walker and Finger, 100k steps in TOP = 200k steps in KSL (env_steps). For Cheetah, Cup and Reacher, 100k steps in TOP = 400k steps in KSL (env_steps). For Cartpole, 100k steps in TOP = 800k steps in KSL (env_steps). For further demonstrations, you can also refer to the appendix part of DrQ where the batch size is set as 128. We find that simply reducing batch size will degenerate the performance. Besides, their 100k performances (with size 128) are consistent with the results reported in KSL. This further demonstrates that RAD results in rliable may be imprecise. I attached my reproduced results with a batch size of 128 here. Note "step" represents policy steps. So all the following results are evaluated at 100k environment steps (=policy steps times action repeat).
Therefore, I think the current score of RAD is evaluated with policy steps rather than environment steps. This explains why RAD greatly outperforms DrQ (I think they should hold similar performances) in the paper at 100k on DMC. If this is the case, do we need to evaluate RAD again with the proper protocol? |
Thanks @TaoHuang13 for the update! This is indeed confusing as I previously confirmed with the authors of the TOP-RAD about reporting results at 100k and 500k environment steps rather than agent steps. I'd also like to wait for their clarification on the discrepancy in their reported RAD results and yours. In the meanwhile, if you can please provide me your raw results for RAD for at least 5 seeds (preferably 10) on the 6 tasks in DMC 100k and 500k, that would be great. This will allow me to update the figures / RAD scores in our NeurIPS paper. Re reporting, I'd suggest using the results which you can replicate in your setup and using the protocols in rliable (perf profiles, aggregate metrics with CIs, probability of improvement etc). |
Sure! I'd like to share my raw results with you a few days later, since we have only test DMC 100k now. Meanwhile, please let me know if there are any updates about the performance discrepancy:) |
Hi @TaoHuang13 and @agarwl -- Thanks for bringing this to our attention! We've just looked at the code again, and short answer: we think you're right, and we sincerely apologize to both of you for the confusion. To briefly explain: We built TOP on top of the original RAD code: https://github.com/tedmoskovitz/TOP/blob/master/dmc/top_train.py. Thank you for your understanding, and apologies once again for the confusion. |
Thanks for the clarification, Ted! Since you are going to update the results, can you please also provide the updated scores and upload the raw scores here? |
Thank you very much for understanding, and absolutely--I'm not at home at the moment but I'll post them by the end of the day. |
Hi all, so I've attached the raw result files for 0-100k agent steps to this post. It turns out that I logged evaluation every 10k agent steps, so I don't actually have saved results for exactly 12.5k/25k agent steps (aka 100k env steps for walker and finger). The scores for those ranges do seem much more consistent with what you've found, @TaoHuang13. Each file is a 10 x 11 csv, where each row is a run and each run contains the score for 0, 10k, 20k, ...,100k agent steps. I'll have to re-run the code for the other environments to get exactly 100k environment steps, and I'll post those results when I have them, though it may be a few days due to the holidays. Thank you once again for bringing this to our attention! rad_finger.csv |
Thanks @tedmoskovitz for the clarification! It is indeed hard to notice the step setting in the original code RAD. We found it due to our computational limitation hah. Now there will be a larger room for improving performances. We are looking forward to your new results. But now, let us have a good holiday:) Merry Christmas @agarwl @tedmoskovitz~ |
Hi @TaoHuang13 and @agarwl -- Thanks so much for your patience! I'm attaching the results for cartpole, cheetah, cup, and reacher to this comment. Each csv contains a 10 (seeds) x 6 array with eval results for 0, 100k, ..., 500k environment steps. I will post again once we've updated our paper as well. Happy holidays to both of you, and thank you once again! |
Thanks @tedmoskovitz! If you can also post the results for |
Right--here they are! Apologies for the delay. In this case, the arrays are 10 x 26, where each row contains the eval results for 0, 20k, 40k, ..., 100k, ..., 500k environment steps. |
Thank you @tedmoskovitz! The current results seem more reasonable:) |
Cool! Thanks, Rishabh. I'm attaching the updated TOP-RAD results btw, if either of you are interested. Results relative to regular RAD are pretty analogous to the original results, but actually even better (relative to RAD) for the 100k regime, so that's nice. We'll be updating the paper in the next few days, just need to re-run some ablations. top-rad_cartpole.csv |
Hi @tedmoskovitz! Thank you for sharing your results. We are considering whether to add TOP-RAD as a baseline. One quick question is: which batch size are you using for RAD and TOP-RAD (128 or 512)? |
Of course! That would be great. We used a batch size of 128 for both. |
Hi guys-- just wanted to let you know that we've posted the updated paper for TOP: https://arxiv.org/abs/2102.03765. Thank you once again @TaoHuang13 for finding this mistake (we've added you in the acknowledgments), and @agarwl thank you for your understanding! |
Hi @agarwl. I found that the 'step' in RAD's 'eval.log' refers to the policy step. But the 'step' in 'xxx--eval_scores.npy' refers to the environment step. We know that 'environment step = policy step * action_repreat'.
Here comes a problem: if you use the results of 100k steps in 'eval.log', then you actually evaluate the scores at 100k*action_repeat steps. This will lead to the overestimation of RAD. And I wonder whether you do such incorrect evaluations, or you take the results in 'xxx--eval_scores.npy', which are correct in terms of 'steps'. You may refer to a similar question in MishaLaskin/rad#15.
I reproduced the results of RAD locally, and I found my results are much worse than the reported ones (in your paper). I list them in the following figure.
I compare the means of each task. Obviously, there is a huge gap, and my results are close to the ones reported by DrQ authors (see the Table in MishaLaskin/rad#1). I guess you may evaluate scores at incorrect environment steps? So, could you please offer more details when evaluating RAD? Thanks :)
The text was updated successfully, but these errors were encountered: