-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The same action prediction gets different evaluation metrics in different runs #20
Comments
Hi, can you share the inference and Also I can understand why you think the bug is with metrics but just so that I have enough info to reproduce this, can you tell me whether you trained a new ET model or used one of the released checkpoints? Also, could you share the exact commands used to run inference in the two cases? Did anything change (for example random seed)? I realize randomness is probably not the issue here but if the bug is elsewhere resulting in a corrupted agent state, I would need to be able to replicate how the exact agent state you got in both cases gets created. |
Hi, Sorry for the late response! The inference and prediction files are available here. The inference is based on your released ET model using the following command:
|
Hi @594zyc apologies for the delay in response. After taking a look at your inference files, I think this behaviour is likely caused by a bug we also found internally where some object properties do not get properly reset between episodes. We have fixed this in commit 974b3f1013e1cacc4b21d6eb65e84a1d33f82c18 so hopefully if you pull the latest mainline, you shouldn't see this issue anymore. |
Hi,
I ran the baseline ET model and found that two different runs get significantly different evaluation metrics. (might relate to this issue #10)
Run1:
Run2:
After taking a close look at the output I find in some episodes the same set of prediction actions results in different evaluation metrics in different runs. For example in this
66957a984ae5a714_f28d.edh4
, the inference output for the first run is:While for the second run it is:
So basically the first evaluation result does not make sense since there should be no chance for the model to succeed without performing any manipulative actions.
The first run is done using an AWS ec2 p3.8 instance while the second run using a p3.16. All the other settings are the same. The full evaluation logs are available here: [run 1] [run 2]
Do you have any idea about the cause? Thanks
The text was updated successfully, but these errors were encountered: