Yes, each step-by-step instruction has a corresponding subgoal in the training and validation trajectories. If you use this alignment during training, please see the submission guidelines for leaderboard submissions.
You should be able to achieve >99% success rate on training and validation tasks with the ground-truth actions and masks from the dataset. Occasionally, some non-determistic behaviors in THOR can lead to failures, but they are extremely rare.
Mask prediction is an important part of the ALFRED challenge. Unlike non-interactive environments (e.g vision-language navigation) here it's necessary for the agent to specify what exactly it wants to interact with.
Why do feat_conv.pt
in Full Dataset have 10 more frames than the number of images?
The last 10 frames are copies of the features from the last image frame.
Yes. Run the training script with --use_templated_goals
.
You can use augment_trajectories.py to replay all the trajectories and augment the visual observations. At each step, use the THOR API to look around and take 6-12 shots of the surrounding. Then stitch together these shots to create a panoramic image for each frame. You might have to set 'forceAction': True
for smooth moveahead/rotate/look. Note that getting panoramic images during test time would incur the additional cost of looking around with the agent.
Why do feat_conv.pt
in Modeling Quickstart contain fewer frames than in Full Dataset
The Full Dataset contains extracted Resnet features for each frame in ['images']
which include filler frames inbetween each low-action (used to generate smooth videos), whereas Modeling Quickstart only contains features for each low_idx
that correspond to frames after taking each low-level action.
Yes, run the training script with --fast_epoch
.