From 9d905b4a93065357e6b53993ec37862e6afc5852 Mon Sep 17 00:00:00 2001
From: Joe Vincent Comparing Policies
Here we apply our statistical bounds to the recent results from the RT-2 paper, where the authors compare their RT-2 policy to a VC-1 policy in three settings designed to test emergent capabilities in symbol understanding, reasoning, and human recognition.
For each setting we find the 95% confidence intervals for policy success rate are disjiont, and we conclude with 95% confidence that RT-2 outperforms VC-1.