Much higher scores when evaluating Episodic Transformer baselines for EDH instances #10

yingShen-ys · 2022-02-03T14:57:48Z

Hello,

I have finished the evaluation of the Episodic Transformer baselines for the TEACh Benchmark Challenge on the valid_seen.

However, one weird thing I found is that our reproduced result is much higher than what is reported in the paper. The result is shown below (All values are percentages). There is a total of 608 EDH instances (valid_seen) in the metric file which matches the number in the paper.

	SR [TLW]	GC [TLW]
Reproduced	13.8 [3.2]	14 [8.7]
Reported in the paper	5.76 [0.90]	7.99 [1.65]

I believe I am using the correct checkpoints. And the only change I made to the code is mentioned in #9.

I am running on an AWS instance. I have started the X-server and installed all requirements and prerequisites without bugs. And the inference process is bugfree.

Here is the script I used for evaluation.

#!/bin/sh

export AWS_ROOT=/home/ubuntu/workplace
export ET_DATA=$AWS_ROOT/data
export TEACH_ROOT_DIR=$AWS_ROOT/teach
export TEACH_SRC_DIR=$TEACH_ROOT_DIR/src
export ET_ROOT=$TEACH_SRC_DIR/teach/modeling/ET
export ET_LOGS=$TEACH_ROOT_DIR/src/teach/modeling/ET/checkpoints
export INFERENCE_OUTPUT_PATH=$TEACH_ROOT_DIR/inference_output
export PYTHONPATH=$TEACH_SRC_DIR:$ET_ROOT:$PYTHONPATH
export SPLIT=valid_seen

cd $TEACH_ROOT_DIR
python src/teach/cli/inference.py \
            --model_module teach.inference.et_model \
                --model_class ETModel \
                    --data_dir $ET_DATA \
                        --output_dir $INFERENCE_OUTPUT_PATH/inference__teach_et_trial_$SPLIT \
                            --split $SPLIT \
                                --metrics_file $INFERENCE_OUTPUT_PATH/metrics__teach_et_trial_$SPLIT.json \
                                    --seed 4 \
                                        --model_dir $ET_DATA/baseline_models/et \
                                            --object_predictor $ET_LOGS/pretrained/maskrcnn_model.pth \
                                            --visual_checkpoint $ET_LOGS/pretrained/fasterrcnn_model.pth \
                                                --device "cpu" \
                                                --images_dir $INFERENCE_OUTPUT_PATH/images

I wonder if the data split provided in the dataset is the same as the paper. And if so, what would be the possible explanation for this?

Please let me know if someone else is getting similar results. Thank you!

aishwaryap · 2022-02-07T22:39:45Z

Hi @yingShen-ys

That sounds like a reasonable result. I will leave the issue open however, so that we can see if others are able to reproduce it.
While the dataset split itself has not changed, we have made some improvements to the inference code which has resulted in higher scores. If you want to run inference the way it was done in the paper, add the argument --skip_edh_history to your inference command. You can see what this argument does by checking ETModel.start_new_edh_instance().

Best,
Aishwarya

yingShen-ys · 2022-02-08T20:07:37Z

Hi @yingShen-ys

That sounds like a reasonable result. I will leave the issue open however, so that we can see if others are able to reproduce it. While the dataset split itself has not changed, we have made some improvements to the inference code which has resulted in higher scores. If you want to run inference the way it was done in the paper, add the argument --skip_edh_history to your inference command. You can see what this argument does by checking ETModel.start_new_edh_instance().

Best, Aishwarya

Got it. Thank you for the clarification.

Ji4chenLi · 2022-02-10T07:36:48Z

Hi @yingShen-ys
That sounds like a reasonable result. I will leave the issue open however, so that we can see if others are able to reproduce it. While the dataset split itself has not changed, we have made some improvements to the inference code which has resulted in higher scores. If you want to run inference the way it was done in the paper, add the argument --skip_edh_history to your inference command. You can see what this argument does by checking ETModel.start_new_edh_instance().
Best, Aishwarya

Got it. Thank you for the clarification.

We got a similar result, and actually the results can be significantly different when training on different machines.

aishwaryap · 2022-02-12T04:14:09Z

I believe differences are less likely to be due to the machine used for training but rather the effect of the random seed. We have also seen this behavior with training the ET model on ALFRED.

aleSuglia mentioned this issue Feb 11, 2022

Trajectories that raise an error are ignored #13

Open

594zyc mentioned this issue Jun 14, 2022

The same action prediction gets different evaluation metrics in different runs #20

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Much higher scores when evaluating Episodic Transformer baselines for EDH instances #10

Much higher scores when evaluating Episodic Transformer baselines for EDH instances #10

yingShen-ys commented Feb 3, 2022 •

edited

Loading

aishwaryap commented Feb 7, 2022

yingShen-ys commented Feb 8, 2022

Ji4chenLi commented Feb 10, 2022

aishwaryap commented Feb 12, 2022

Much higher scores when evaluating Episodic Transformer baselines for EDH instances #10

Much higher scores when evaluating Episodic Transformer baselines for EDH instances #10

Comments

yingShen-ys commented Feb 3, 2022 • edited Loading

aishwaryap commented Feb 7, 2022

yingShen-ys commented Feb 8, 2022

Ji4chenLi commented Feb 10, 2022

aishwaryap commented Feb 12, 2022

yingShen-ys commented Feb 3, 2022 •

edited

Loading