Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RAFT model for optical flow #5022

Merged
merged 18 commits into from
Dec 6, 2021
Merged

Conversation

NicolasHug
Copy link
Member

@NicolasHug NicolasHug commented Dec 2, 2021

Towards #4644

This PR adds the RAFT model, along with its basic building blocks (feature encoder, correlation block, update block, etc.) and two model builder function: raft_large() and raft_small().

The architecture (not the code!) is exactly the same as the original implementation from https://github.com/princeton-vl/RAFT, which will allow us support the original paper's weights if we want to. This architecture differs slightly from what is described in the paper. I have annotated the paper with these differences, hoping this can help the review:
RAFT- Recurrent All-Pairs Field Transforms forOptical Flow (1).pdf

A summary is this:

image


API

The RAFT class accepts torch.nn.Module instances as input and offers a low-level API.

RAFT(
        feature_encoder=feature_encoder,
        context_encoder=context_encoder,
        corr_block=corr_block,
        update_block=update_block,
        mask_predictor=mask_predictor,
    )

The model builder functions raft_large() and raft_small() are higher-level and do not require any parameter. They can however take as input the same parameters as the RAFT class, so as to override their defaults. E.g.:

model = raft_large(mask_predictor=MyOwnMaskPredictor())  # all the other building blocks will use the default architecture

The building blocks like FeatureEncoder, ResidualBlock, UpdateBlock, etc. are (sort of publicly) available in torchvision.models.optical_flow.raft, but are not exposed in __all__.


Still left TODO, here or in follow-up PRs:

  • Add an expect test for both raft() and raft_small()
  • Add better docs
  • Add type annotations (🥲)

In follow up PRs I will also submit the training reference and associated transforms.

I will write review comments below to hightlight important bits, or things where I'm not to sure what the best way is.

cc @datumbox

@facebook-github-bot
Copy link

facebook-github-bot commented Dec 2, 2021

💊 CI failures summary and remediations

As of commit f077d7c (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

Copy link
Member Author

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are my initial comments @fmassa @datumbox @haooooooqi !

@@ -0,0 +1 @@
from ._raft.raft import RAFT, raft, raft_small
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should expose the building blocks as well, stuff like ResidualBlock, FeatureEncoder, etc.

Should we expose these in a models.optical_flow.raft namespace?

I remember @datumbox mentioning a few issues when we have both a module and a function with the same name. Ideally, I'd like to keep the raft() name for the function builder, but I'm happy to get your thoughts on this

Copy link
Contributor

@datumbox datumbox Dec 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking:

See @pmeier's #4867 (comment) around this. We might need to rename this to raft_large similar to mobilenet_v3_large and mobilenet_v3_small.

Question:

Can you talk about the choice to create a second private submodule _raft? Why is that?

FYI:

We should expose the building blocks as well, stuff like ResidualBlock, FeatureEncoder, etc.

If you check existing models, such as resnet and faster_rcnn you will see we don't expose these on the __all__ or anywhere else. We had discussions about what this means but we didn't have a consensus (public API vs developer API discussion). I would be in favour of not exposing these publicly to be consistent with everywhere else. We should discuss and resolve this once we have time in the roadmap.

Copy link
Member Author

@NicolasHug NicolasHug Dec 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback

Can you talk about the choice to create a second private submodule _raft? Why is that?

I wrote it like this out of habit but I had no intention of keeping it this way

Instead of renaming raft() to raft_large(), would it be OK instead to rename the .py file to something like raft_core.py ? Or raft_implem.py?

We would have raft(), raft_small() and RAFT available from torchvision.models.optical_flow and then we would be able to access the other building blocks like ResidualBlock in torchvision.models.optical_flow.raft_core -- I would make sure to exclude them from __all__.

Would that be OK?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though calling it raft_large would make it consistent with other small/large models, I can see that on the paper they don't actually call it large. They do use Raft-s or "small" for the small version, so that's canonical at least. So I understand why you want to avoid naming it like that.

Renaming the file should be OK but @pmeier should confirm. Though I'm not sure what the actual name should be. We typically name the files after their algorithm and we let the model builder have extra info on the variant.

@fmassa thoughts on this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll use raft.py and raft_large() for now to move forward. We can revisit later if needed, thanks for the input

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not commenting on the name, because I don't have an opinion on that. We just need to make sure that we don't have an attribute with the same name as a module in the same namespace. Otherwise we can no longer access the module. Thus having raft.py and def raft() is problematic, but raft.py and raft_large() is fine.



class ResidualBlock(nn.Module):
# This is pretty similar to resnet.BasicBlock except for one call to relu, and the bias terms
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably merge this with the resnet implementation (same for BottleneckBlock below)... but this might make the API awkward and bloated. For now I'd say it's simpler and safer to keep these implementations separate

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: Agreed. One more reason why we shouldn't expose this on the __all__.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep this here as is

Comment on lines 150 to 151
self.convflow1 = ConvNormActivation(2, flow_layers[0], norm_layer=None, kernel_size=7)
self.convflow2 = ConvNormActivation(flow_layers[0], flow_layers[1], norm_layer=None, kernel_size=3)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to use a ConvNormActivation while passing norm_layer=None? No strong opinion from me

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: It's up to you. The class was recently updated to support it but I checked and so far all the uses of ConvNormActivation currently pass a non-None value. So you would be the first to use this idiom. It saves you a few lines of code, but not too much.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with this.

flow_head_hidden_size,
# Mask predictor
use_mask_predictor,
**kwargs,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kwargs are here to override the RAFT class parameters, e.g. to override the entire feature_encoder, or the whole update_block.



class FeatureEncoder(nn.Module):
def __init__(self, *, block=ResidualBlock, layers=(64, 64, 96, 128, 256), norm_layer=nn.BatchNorm2d):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I named this layers to be sort of consistent with the ResNet class. I'm happy to consider other names though



class RecurrentBlock(nn.Module):
def __init__(self, *, input_size, hidden_size, kernel_size=((1, 5), (5, 1)), padding=((0, 2), (2, 0))):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where possible I tried to write defaults that will correspond to the normal RAFT model (kernel_size, padding). I refrained from doing that for the input and output shapes though. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: I don't know how much "standard" these values are. If you think multiple input_sizes and hidden_sizes will keep using the same kernels, then that's OK. From other models I think that's reasonable.



class MaskPredictor(nn.Module):
def __init__(self, *, in_channels, hidden_size, multiplier=0.25):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm keeping the 0.25 default for consistency with the original code, but I'm tempted to set it to 1 to encourage users not to use it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: Given that TorchVision is supposed to follow very closely the papers, using the 0.25 aligns with this principle. It will also allow you to port weights easier. Personally I'm fine with how you implement and document it. I would also recommend writing a blogpost about your implementation (similar to 1, 2) as this will allow you to discuss such details more thoroughly and help new joiners understand the implementation.

Comment on lines 365 to 367
# As in the original paper, the actual output of the context encoder is split in 2 parts:
# - one part is used to initialize the hidden state of the reccurent units of the update block
# - the rest is the "actual" context.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit that I did not see mentioned in the paper. I wish we could separate the initialization of the hidden state from the context encoder, but this would likely prevent us from exactly reproduce the original implementation.


batch_size, _, h, w = image1.shape
torch._assert((h, w) == image2.shape[-2:], "input images should have the same shape")
torch._assert((h % 8 == 0) and (w % 8 == 0), "input image H and W should be divisible by 8")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This "8" downscaling factor is hard-coded in different places. We should be able to generalize it in future versions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: other models have similar requirements. An example is MobileNets which use the _make_divisible() method located here. You could create a similar helper method and generalize but that's a NIT.

Copy link
Contributor

@datumbox datumbox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NicolasHug Great work.

I'm reviewing code practices and API, not ML validity as this is something you checked already with @haooooooqi.

I clearly marked my comments as FYI (which are just for discussion), Questions (where I just want more info), NIT (which are non-blocking and could optionally be addressed on follow up PRs) and Blocking (which I think need to be done here).

My Blocking comments are minimal as you can see below and can be addressed easily. After addressing them, ping me to approve but I would wait for the review of @fmassa prior merging.

torchvision/models/optical_flow/_raft/raft.py Outdated Show resolved Hide resolved
torchvision/models/optical_flow/_raft/raft.py Outdated Show resolved Hide resolved
@@ -0,0 +1 @@
from ._raft.raft import RAFT, raft, raft_small
Copy link
Contributor

@datumbox datumbox Dec 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking:

See @pmeier's #4867 (comment) around this. We might need to rename this to raft_large similar to mobilenet_v3_large and mobilenet_v3_small.

Question:

Can you talk about the choice to create a second private submodule _raft? Why is that?

FYI:

We should expose the building blocks as well, stuff like ResidualBlock, FeatureEncoder, etc.

If you check existing models, such as resnet and faster_rcnn you will see we don't expose these on the __all__ or anywhere else. We had discussions about what this means but we didn't have a consensus (public API vs developer API discussion). I would be in favour of not exposing these publicly to be consistent with everywhere else. We should discuss and resolve this once we have time in the roadmap.



class ResidualBlock(nn.Module):
# This is pretty similar to resnet.BasicBlock except for one call to relu, and the bias terms
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: Agreed. One more reason why we shouldn't expose this on the __all__.

torchvision/models/optical_flow/_raft/raft.py Outdated Show resolved Hide resolved


class MaskPredictor(nn.Module):
def __init__(self, *, in_channels, hidden_size, multiplier=0.25):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: Given that TorchVision is supposed to follow very closely the papers, using the 0.25 aligns with this principle. It will also allow you to port weights easier. Personally I'm fine with how you implement and document it. I would also recommend writing a blogpost about your implementation (similar to 1, 2) as this will allow you to discuss such details more thoroughly and help new joiners understand the implementation.

torchvision/models/optical_flow/_raft/raft.py Outdated Show resolved Hide resolved
torchvision/models/optical_flow/_raft/raft.py Outdated Show resolved Hide resolved
torchvision/models/optical_flow/_raft/raft.py Outdated Show resolved Hide resolved

batch_size, _, h, w = image1.shape
torch._assert((h, w) == image2.shape[-2:], "input images should have the same shape")
torch._assert((h % 8 == 0) and (w % 8 == 0), "input image H and W should be divisible by 8")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: other models have similar requirements. An example is MobileNets which use the _make_divisible() method located here. You could create a similar helper method and generalize but that's a NIT.

Copy link
Member

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @NicolasHug !
Approving to unblock, but check @datumbox blocking comments as well before merging.

torchvision/models/optical_flow/_raft/raft.py Outdated Show resolved Hide resolved
torchvision/models/optical_flow/_raft/raft.py Outdated Show resolved Hide resolved
torchvision/models/optical_flow/_raft/raft.py Outdated Show resolved Hide resolved


def grid_sample(img, absolute_grid, *args, **kwargs):
"""Same as torch's grid_sample, with absolute pixel coordinates instead of normalized coordinates."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Food for thought: would it be a significant limitation if we were to return the flows in normalized coordinates, and perform all computations in normalized coordinates?

Or is it standard that flow images are in absolute coordinates?

Some of the benefits of keeping it in relative coordinates is that you don't need to multiply by the scaling factor when upsampling an image.

def raft_large(*, pretrained=False, progress=True, **kwargs):

if pretrained:
raise NotImplementedError("Pretrained weights aren't available yet")
Copy link
Contributor

@datumbox datumbox Dec 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps using NotImplementedError is indeed the right exception here but I would recommend for now to throw:

Suggested change
raise NotImplementedError("Pretrained weights aren't available yet")
raise ValueError(f"No checkpoint is available for model.")

This is because some tests check for this specific exception:

if "No checkpoint is available" in msg:
pytest.skip(msg)

@@ -818,5 +818,31 @@ def test_detection_model_trainable_backbone_layers(model_fn, disable_weight_load
assert n_trainable_params == _model_tests_values[model_name]["n_trn_params_per_layer"]


@needs_cuda
@pytest.mark.parametrize("model_builder", (models.optical_flow.raft_large, models.optical_flow.raft_small))
@pytest.mark.parametrize("scripted", (False, True))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm parametrizing over this because testing with _check_jit_scriptable unfortunately fails on very few entries, e.g.:

Mismatched elements: 153 / 11520 (1.3%)
Greatest absolute difference: 0.0002608299255371094 at index (0, 0, 79, 45) (up to 0.0001 allowed)
Greatest relative difference: 0.021354377198448304 at index (0, 1, 53, 68) (up to 0.0001 allowed)

I could add tol parameters to the check, but I feel like this current test is just as fine

@NicolasHug NicolasHug mentioned this pull request Dec 3, 2021
12 tasks
@NicolasHug NicolasHug merged commit 01ffb3a into pytorch:main Dec 6, 2021
@NicolasHug
Copy link
Member Author

Test failures are unrelated, merging. Thanks a lot for the reviews!!

facebook-github-bot pushed a commit that referenced this pull request Dec 9, 2021
Reviewed By: NicolasHug

Differential Revision: D32950937

fbshipit-source-id: 7e024dad4c3d55bc832beadfd1b3ffe867f238f3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants