Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The same action prediction gets different evaluation metrics in different runs #20

Open
594zyc opened this issue Jun 14, 2022 · 3 comments

Comments

@594zyc
Copy link

594zyc commented Jun 14, 2022

Hi,

I ran the baseline ET model and found that two different runs get significantly different evaluation metrics. (might relate to this issue #10)
Run1:

SR: 77/608 = 0.127
GC: 487/3526 = 0.138
PLW SR: 0.026
PLW GC: 0.093

Run2:

SR: 52/608 = 0.086
GC: 321/3526 = 0.091
PLW SR: 0.007
PLW GC: 0.034

After taking a close look at the output I find in some episodes the same set of prediction actions results in different evaluation metrics in different runs. For example in this 66957a984ae5a714_f28d.edh4, the inference output for the first run is:

"66957a984ae5a714_f28d.edh4": {
        "instance_id": "66957a984ae5a714_f28d.edh4",
        "game_id": "66957a984ae5a714_f28d",
        "completed_goal_conditions": 2,
        "total_goal_conditions": 2,
        "goal_condition_success": 1,
        "success_spl": 0.55,
        "path_len_weighted_success_spl": 12.100000000000001,
        "goal_condition_spl": 0.55,
        "path_len_weighted_goal_condition_spl": 12.100000000000001,
        "gt_path_len": 22,
        "reward": 0,
        "success": 1,
        "traj_len": 40,
        "predicted_stop": 0,
        "num_api_fails": 30,
        "error": 0,
        "init_success": true,
        "pred_actions": [
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ]
        ]
    }

While for the second run it is:

"66957a984ae5a714_f28d.edh4": {
        "instance_id": "66957a984ae5a714_f28d.edh4",
        "game_id": "66957a984ae5a714_f28d",
        "completed_goal_conditions": 0,
        "total_goal_conditions": 2,
        "goal_condition_success": 0.0,
        "success_spl": 0.0,
        "path_len_weighted_success_spl": 0.0,
        "goal_condition_spl": 0.0,
        "path_len_weighted_goal_condition_spl": 0.0,
        "gt_path_len": 22,
        "reward": 0.0,
        "success": 0,
        "traj_len": 40,
        "predicted_stop": 0,
        "num_api_fails": 30,
        "error": 0,
        "init_success": true,
        "pred_actions": [
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ],
            [
                "Forward",
                null
            ]
        ]
    }

So basically the first evaluation result does not make sense since there should be no chance for the model to succeed without performing any manipulative actions.

The first run is done using an AWS ec2 p3.8 instance while the second run using a p3.16. All the other settings are the same. The full evaluation logs are available here: [run 1] [run 2]

Do you have any idea about the cause? Thanks

@aishwaryap
Copy link
Contributor

Hi, can you share the inference and pred_actions files generated for that particular EDH instance in the two runs. I need those to debug issues with metrics calculation.

Also I can understand why you think the bug is with metrics but just so that I have enough info to reproduce this, can you tell me whether you trained a new ET model or used one of the released checkpoints?

Also, could you share the exact commands used to run inference in the two cases? Did anything change (for example random seed)? I realize randomness is probably not the issue here but if the bug is elsewhere resulting in a corrupted agent state, I would need to be able to replicate how the exact agent state you got in both cases gets created.

@594zyc
Copy link
Author

594zyc commented Jun 29, 2022

Hi,

Sorry for the late response! The inference and prediction files are available here.

The inference is based on your released ET model using the following command:

teach_inference \
    --data_dir $DATA_DIR \
    --output_dir /home/ubuntu/teach-eval/et/predictions \
    --metrics_file /home/ubuntu/teach-eval/et/metrics/metrics_seen.txt \
    --images_dir $IMAGE_DIR \
    --split valid_seen \
    --model_module teach.inference.et_model \
    --model_class ETModel \
    --model_dir ./models/baseline_models/et \
    --visual_checkpoint ./models/et_pretrained_models/fasterrcnn_model.pth \
    --object_predictor ./models/et_pretrained_models/maskrcnn_model.pth \
    --seed 4 \
    --num_processes 16

@aishwaryap
Copy link
Contributor

Hi @594zyc apologies for the delay in response. After taking a look at your inference files, I think this behaviour is likely caused by a bug we also found internally where some object properties do not get properly reset between episodes. We have fixed this in commit 974b3f1013e1cacc4b21d6eb65e84a1d33f82c18 so hopefully if you pull the latest mainline, you shouldn't see this issue anymore.
I would appreciate it if you can update here after you test this so that I can decide whether this issue has been fully resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants