-
Notifications
You must be signed in to change notification settings - Fork 27.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add VideoMAE #17821
Add VideoMAE #17821
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this model, left a few comments.
src/transformers/models/videomae/convert_videomae_to_pytorch.py
Outdated
Show resolved
Hide resolved
src/transformers/models/videomae/convert_videomae_to_pytorch.py
Outdated
Show resolved
Hide resolved
def resize_video(self, video, size, resample="bilinear"): | ||
return [self.resize(frame, size, resample) for frame in video] | ||
|
||
def crop_video(self, video, size): | ||
return [self.center_crop(frame, size) for frame in video] | ||
|
||
def normalize_video(self, video, mean, std): | ||
return [self.normalize(frame, mean, std) for frame in video] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For tensors, we should implement something using PyTorch here, as iterating through the frames will be super slow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original implementation also iterates over the frames (they use cv2 instead of Pillow for each frame): https://github.com/MCG-NJU/VideoMAE/blob/bd18ef559b31bc69c6c2dc91e3fdd09343016f00/functional.py#L26
Each frame can either be a NumPy array or a PIL image.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or do you mean when you provide a single tensor of shape (B, T, C, H, W)? Cause currently the feature extractor only accepts lists of PIL images or tensors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normalization would be way faster if done on a big tensor, if we have tensors here. Likewise it would be faster done once an a big NumPy array if we have a NumPy arrays.
If we have a list of PIL images, it's converted anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes sorry you probably only meant normalization.
Should we add a normalize_video
method to image_utils.py, that accepts either a list of NumPy arrays or PIL images?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That works too.
@@ -0,0 +1,18 @@ | |||
# import torch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file should not be added to the PR.
@@ -0,0 +1,55 @@ | |||
import numpy as np |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one shouldn't as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still should not be there.
@NielsRogge do you have any ETA on this feature? I am developing a video classification fine-tuning framework, would love to use this model if it gets merged into main! Currently only video model is PerceiverIO, right? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! 🎥 Just one small nit comment.
Only question I have is about the videomae/test.py
and videomae/test_model.py
files and if they're in the right place, as they look more like scripts
>>> model = VideoMAEForPreTraining.from_pretrained("nanjing/vit-mae-base") | ||
|
||
>>> pixel_values = feature_extractor(video, return_tensors="pt").pixel_values | ||
>>> bool_masked_pos = ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definition missing here
bc4d8c3
to
078b6a8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure to remove all test scripts before merging (they really shouldn't be added in the first place, please take more care of the files you add with git
).
@@ -0,0 +1,55 @@ | |||
import numpy as np |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still should not be there.
from .utils.constants import ( # noqa: F401 | ||
IMAGENET_DEFAULT_MEAN, | ||
IMAGENET_DEFAULT_STD, | ||
IMAGENET_STANDARD_MEAN, | ||
IMAGENET_STANDARD_STD, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sgugger is this ok?
There seems to remain an issue with the docs:
|
@LysandreJik yes I was aware of that, should be fixed now. Don't merge already please, I'm transferring checkpoints and updating the conversion script. |
f5a3d58
to
8ef0d69
Compare
The documentation is not available anymore as the PR was closed or merged. |
* First draft * Add VideoMAEForVideoClassification * Improve conversion script * Add VideoMAEForPreTraining * Add VideoMAEFeatureExtractor * Improve VideoMAEFeatureExtractor * Improve docs * Add first draft of model tests * Improve VideoMAEForPreTraining * Fix base_model_prefix * Make model take pixel_values of shape (B, T, C, H, W) * Add loss computation of VideoMAEForPreTraining * Improve tests * Improve model testsé * Make all tests pass * Add VideoMAE to main README * Add tests for VideoMAEFeatureExtractor * Add integration test * Improve conversion script * Rename patch embedding class * Remove VideoMAELayer from init * Update design of patch embeddings * Improve comments * Improve conversion script * Improve conversion script * Add conversion of pretrained model * Add loss verification of pretrained model * Add loss verification of unnormalized targets * Add integration test for pretraining model * Apply suggestions from code review * Fix bug to make feature extractor resize only shorter edge * Address more comments * Improve normalization of videos * Add doc examples * Move constants to dedicated script * Remove scripts * Transfer checkpoints, fix docs * Update script * Update image mean and std * Fix doc tests * Set return_tensors to NumPy by default * Revert the previous change Co-authored-by: Niels Rogge <[email protected]>
What does this PR do?
This PR adds VideoMAE, which extends ViTMAE to videos.
The only difference between VideoMAE and ViT is that you need to replace
nn.Conv2d
bynn.Conv3d
in the patch embedding class. 😂To do:
VideoMAEFeatureExtractor
(should we keep it, or rename toVideoMAEProcessor
,VideoMAEPreprocessor
?)pixel_values
of shape (batch_size, num_frames, num_channels, height, width). The original implementation uses (B, C, T, H, W)