Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Much higher scores when evaluating Episodic Transformer baselines for EDH instances #10

Open
yingShen-ys opened this issue Feb 3, 2022 · 4 comments

Comments

@yingShen-ys
Copy link

yingShen-ys commented Feb 3, 2022

Hello,

I have finished the evaluation of the Episodic Transformer baselines for the TEACh Benchmark Challenge on the valid_seen.

However, one weird thing I found is that our reproduced result is much higher than what is reported in the paper. The result is shown below (All values are percentages). There is a total of 608 EDH instances (valid_seen) in the metric file which matches the number in the paper.

  SR [TLW] GC [TLW]
Reproduced 13.8 [3.2] 14 [8.7]
Reported in the paper 5.76 [0.90] 7.99 [1.65]

I believe I am using the correct checkpoints. And the only change I made to the code is mentioned in #9.

I am running on an AWS instance. I have started the X-server and installed all requirements and prerequisites without bugs. And the inference process is bugfree.

Here is the script I used for evaluation.

#!/bin/sh

export AWS_ROOT=/home/ubuntu/workplace
export ET_DATA=$AWS_ROOT/data
export TEACH_ROOT_DIR=$AWS_ROOT/teach
export TEACH_SRC_DIR=$TEACH_ROOT_DIR/src
export ET_ROOT=$TEACH_SRC_DIR/teach/modeling/ET
export ET_LOGS=$TEACH_ROOT_DIR/src/teach/modeling/ET/checkpoints
export INFERENCE_OUTPUT_PATH=$TEACH_ROOT_DIR/inference_output
export PYTHONPATH=$TEACH_SRC_DIR:$ET_ROOT:$PYTHONPATH
export SPLIT=valid_seen

cd $TEACH_ROOT_DIR
python src/teach/cli/inference.py \
            --model_module teach.inference.et_model \
                --model_class ETModel \
                    --data_dir $ET_DATA \
                        --output_dir $INFERENCE_OUTPUT_PATH/inference__teach_et_trial_$SPLIT \
                            --split $SPLIT \
                                --metrics_file $INFERENCE_OUTPUT_PATH/metrics__teach_et_trial_$SPLIT.json \
                                    --seed 4 \
                                        --model_dir $ET_DATA/baseline_models/et \
                                            --object_predictor $ET_LOGS/pretrained/maskrcnn_model.pth \
                                            --visual_checkpoint $ET_LOGS/pretrained/fasterrcnn_model.pth \
                                                --device "cpu" \
                                                --images_dir $INFERENCE_OUTPUT_PATH/images

I wonder if the data split provided in the dataset is the same as the paper. And if so, what would be the possible explanation for this?

Please let me know if someone else is getting similar results. Thank you!

@aishwaryap
Copy link
Contributor

Hi @yingShen-ys

That sounds like a reasonable result. I will leave the issue open however, so that we can see if others are able to reproduce it.
While the dataset split itself has not changed, we have made some improvements to the inference code which has resulted in higher scores. If you want to run inference the way it was done in the paper, add the argument --skip_edh_history to your inference command. You can see what this argument does by checking ETModel.start_new_edh_instance().

Best,
Aishwarya

@yingShen-ys
Copy link
Author

Hi @yingShen-ys

That sounds like a reasonable result. I will leave the issue open however, so that we can see if others are able to reproduce it. While the dataset split itself has not changed, we have made some improvements to the inference code which has resulted in higher scores. If you want to run inference the way it was done in the paper, add the argument --skip_edh_history to your inference command. You can see what this argument does by checking ETModel.start_new_edh_instance().

Best, Aishwarya

Got it. Thank you for the clarification.

@Ji4chenLi
Copy link

Hi @yingShen-ys
That sounds like a reasonable result. I will leave the issue open however, so that we can see if others are able to reproduce it. While the dataset split itself has not changed, we have made some improvements to the inference code which has resulted in higher scores. If you want to run inference the way it was done in the paper, add the argument --skip_edh_history to your inference command. You can see what this argument does by checking ETModel.start_new_edh_instance().
Best, Aishwarya

Got it. Thank you for the clarification.

We got a similar result, and actually the results can be significantly different when training on different machines.

@aishwaryap
Copy link
Contributor

I believe differences are less likely to be due to the machine used for training but rather the effect of the random seed. We have also seen this behavior with training the ET model on ALFRED.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants