Download steps closely adapted from SynSin.
Download from RealEstate10K.
Store the files in the following structure. The ${REAL_ESTATE_10K}/test/
and ${REAL_ESTATE_10K}/train
folders store the original text files.
The frames need to be extracted based on the text files; we extract them to: ${REAL_ESTATE_10K}/frames
. There may be some missing videos, so we use some additional files as described below.
We use a file ${REAL_ESTATE_10K}/frames/train/video_loc.txt
and ${REAL_ESTATE_10K}/frames/test/video_loc.txt
to store the location of the extracted videos. Finally, for each extracted video located at ${REAL_ESTATE_10K}/frames/train/${path_totrain_vid1}/*.png
, we create a new text file ${REAL_ESTATE_10K}/frames/train/${path_totrain_vid1}.txt
which stores the metadata for each frame (this is necessary as there may be some errors in the extraction process). The ${REAL_ESTATE_10K}/frames/train/${path_totrain_vid1}.txt
file is in the same structure as the original text file, except all rows containing images that were not extracted, have been removed.
After following the above, you should have the following structure:
- ${REAL_ESTATE_10K}/test/*.txt
- ${REAL_ESTATE_10K}/train/*.txt
- ${REAL_ESTATE_10K}/frames/train/
- ${REAL_ESTATE_10K}/frames/train/video_loc.txt
- ${REAL_ESTATE_10K}/frames/train/${path_totrain_vid1}/*.jpg
- ${REAL_ESTATE_10K}/frames/train/${path_totrain_vid1}.txt
...
- ${REAL_ESTATE_10K}/frames/train/${path_totrain_vidN}/*.jpg
- ${REAL_ESTATE_10K}/frames/train/${path_totrain_vidN}.txt
- ${REAL_ESTATE_10K}/frames/test/
- ${REAL_ESTATE_10K}/frames/test/video_loc.txt
- ${REAL_ESTATE_10K}/frames/test/${path_totest_vid1}/*.jpg
- ${REAL_ESTATE_10K}/frames/test/${path_totest_vid1}.txt
...
- ${REAL_ESTATE_10K}/frames/test/${path_totest_vidN}/*.jpg
- ${REAL_ESTATE_10K}/frames/test/${path_totest_vidN}.txt
where ${REAL_ESTATE_10K}/frames/train/video_loc.txt
contains:
${path_totrain_vid1}
...
${path_totrain_vidN}
We test on a small subset (3.6k) of frames from RealEstate10K. To facilitate replicability, we will share these frames privately with users who have already agreed to terms for downloaded of RealEstate10K and agreed to the terms of this Google Form.
Update the paths in ./options/options.py
for the dataset being used.
Some variables containing data in shell scripts have been set to my local directory /x/cnris/...
as an example. CUDA_VISIBLE_DEVICES is also used to show required GPUs of each script. Change both as needed.
Training is split into 3 components:
- Train VQ-VAE
sh scripts/extract_vqvae_dset_realestate.sh
: select subset of images from RealEstate10K to use for training (32k) & validation (8k).sh scripts/train_vqvae_realestate.sh
: train for 150 epochs with batch size 120sh scripts/extract_code_realestate.sh
: with trained vqvae, extract codes for 40k set
- Train the depth, projection, and refinement module (using frozen VQ-VAE)
sh scripts/train_dpr_realestate.sh
: train for 250 epochs with batch size 12 on full RealEstate10K datasetsh scripts/extract_pixcnn_orders_realestate.sh
: with trained depth model, extract orderings used for outpainting on 40k set
- Train Custom-Order PixelCNN++ (using VQ-VAE embeddings and orderings from depth model)
sh scripts/train_lmconv_realestate.sh
: train for 150 epochs with batch size 120
TIP: Autoregressive sampling is slow - especially if using many samples! Below evaluation code is run on a single GPU; we recommend running multiple times across GPUs with different splits of the evaluation set for faster inference.
Our pretrained model is available for download here. SynSin - 6X (Our main baseline - SynSin trained on rotations consistent with our model) is available here. SynSin and several other baselines are available for download at its Github. Place these models in new directory modelcheckpoints/realestate.
Evaluating quality evaluates single predicted output given an input image and camera transform. Evaluating consists of two steps:
sh scripts/eval_quality_realestate.sh
: predict output for all images.python calc_errors_quality.py
: compare outputs to ground truth using FID, Perc Sim and PSNR. Adjust the variablesnames
,base
,imagedir
,copy
, andsampled
as needed
Updated versions of Pytorch, seeding, etc. mean results will closely match paper but not exactly. Results generated by our model precisely replicating paper results are available here.
Evaluating consistency evaluates the consistency of two predicted outputs given an input image and camera transform. The first output is this full rotation and translation, the second is halfway between this and the input.
We use two methods of evaluating consistency.
- Using camera transformations involving both rotation and translation via
sh scripts/eval_consistency_realestate.sh
. There is no clear automated metric to evaluate, so we rely on A/B testing to compare consistency across models. - Keeping position fixed and applying only rotation. This allows us to use homography to compare consistency of overlapping predicted regions using Perc Sim and PSNR. First, download relevant information to compute homographies (here, here, and here), and put these in
data/
and unzip. Next, runsh scripts/eval_consistency_homography_realestate.sh
, thenpython calc_errors_consistency_homography_realestate.py
, changing variables similar tocalc_errors.py
as needed.
Updated versions of Pytorch, seeding, etc. mean results will closely match paper but not exactly. Results generated by our model precisely replicating paper results are available here for (1) and here for homography (2).