Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NOMRG] TransformsV2 questions / comments #7092

Closed
wants to merge 15 commits into from

Conversation

NicolasHug
Copy link
Member

Just writing down my thoughts / questions below. Some of the points are already addressed / tracked in #7082.

@NicolasHug
Copy link
Member Author

As discussed offline with @pmeier , we should bring back the get_params() method for BC reason and to minimize adoption friction. The main use-case of these methods was for all inputs (e.g. images and masks) to be transformed with the same RNG. Users had to call get_params() on the class, and pass the returned value to the functional version of the class. Something like:

class RandomCrop:
def __init__(self, size):
self.size = size
def __call__(self, image, target):
image = pad_if_smaller(image, self.size)
target = pad_if_smaller(target, self.size, fill=255)
crop_params = T.RandomCrop.get_params(image, (self.size, self.size))
image = F.crop(image, *crop_params)
target = F.crop(target, *crop_params)
return image, target

In V2 this is all seamlessly supported, but in order to minimize adoption friction of the new transforms, it might be a good idea to keep those methods around (as deprecated).


Another use-case was to support transforming different batches (available at different times) with the same RNG as in #3001 (comment). This is something that should be supported in the future in a much better way, once there is a more fine-grained RNG support #7027. For now, this can still be supported if we bring back the get_params() methods.

@pmeier
Copy link
Collaborator

pmeier commented Jan 16, 2023

Re get_params: we have 9 transforms currently in v1 that follow that have such a static method:

I'm going to send a draft PR as a basis for discussion of how I envision that we could reinstate them.

format: BoundingBoxFormat
format: BoundingBoxFormat # TODO: do not use a builtin?
# TODO: This is the size of the image, not the box. Maybe make this explicit in the name?
# Note: if this isn't user-facing, the TODO is not critical at all
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is user facing. In general, the metadata of the datapoints is considered public.

Copy link
Collaborator

@vfdev-5 vfdev-5 Jan 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spatial_size was renamed from image_size once we added support to Videos if I recall correctly. #6736

But I agree it can be unclear if spatial_size refers to bbox or something else...

@@ -268,4 +302,4 @@ def gaussian_blur(self, kernel_size: List[int], sigma: Optional[List[float]] = N


InputType = Union[torch.Tensor, PIL.Image.Image, Datapoint]
InputTypeJIT = torch.Tensor
InputTypeJIT = torch.Tensor # why alias it?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To have an easier time looking up what the actual type should be. Meaning if you see *JIT as annotation, you can simply look up * to see what the actual type should be. If we would use InputType and torch.Tensor, this relation is gone.


# Do we want to keep these public?
# This is probalby related to internal customer needs. TODO for N: figure that out
# This is also related to allow user-defined subclasses and transforms (in anticipation of)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, resize is the most problematic one, since it actually overrides the (deprecated) tensor method with a completely different behavior.

@@ -11,6 +11,12 @@
L = TypeVar("L", bound="_LabelBase")


# Do we have transforms that change the categories?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No

@@ -11,6 +11,12 @@
L = TypeVar("L", bound="_LabelBase")


# Do we have transforms that change the categories?
# Why do we need the labels to be datapoints?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because some transformations need this information. Examples are

Comment on lines +72 to +74
# Saw this from [Feedback] thread (https://github.com/pytorch/vision/issues/6753#issuecomment-1308295943)
# Not sure I understand the need to return a Tensor instead of and Image
# Is the inconsistency "worth" it? How can we make sure this isn't too unexpected?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem here is that most of our image kernels assume that an image is in the range [0, 1.0 if image.is_floating_point() else torch.iinfo(image.dtype).max] and this assumption is violated after normalization. The image is no longer an RGB image and we felt we needed to make this explicit. Of course any ops not related to color like crop will still work as before.

As for the impact, I feel this is fine. Normalization is usually one of the last steps if not the last step before something is passed into the model. Plus, since we have the tensor to image fallback, users can still use ops like crop afterwards without issues.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the details!

For me to better understand the trade-off, I'll try to find reasons to keep the output as Image instead of tensor. LMK what you think.

we have the tensor to image fallback

I guess that also means one can do e.g. Compose(Normalize(), SomeColorTransform()) and while Normalize will output a float tensor, SomeColorTransform() will treat it as a 0-1 image? Basically, the unwrapping may not be preventing any bug?

I'm also wondering whether this might become obsolete / overkill if we remove the ColorSpace metadata, which as discussed previously is currently redundant with num_channels?

Of course any ops not related to color like crop will still work as before.

Do we have transforms that rely on the ColorSpace meta-data right now? I could only find RandomPhotometricDistort (and it can probably do without)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that also means one can do e.g. Compose(Normalize(), SomeColorTransform()) and while Normalize will output a float tensor, SomeColorTransform() will treat it as a 0-1 image? Basically, the unwrapping may not be preventing any bug?

Yes and yes. This is more of an idealistic choice. Value range checks are too expensive at runtime and this we can only assume the range. Meaning, this unwrapping (or rather the absent re-wrapping) will basically have no effect at runtime.

I'm also wondering whether this might become obsolete / overkill if we remove the ColorSpace metadata, which as discussed previously is currently redundant with num_channels?

I don't think so. datapoints.Image (and datapoints.Video) carry more implicit assumptions than just the public metadata. An Image instance implicitly communicates that its values are in the correct range, whereas torch.Tensor is completely free. Of course if one uses it as image the same implicit assumptions apply, but this comes from the context rather than the type.

Do we have transforms that rely on the ColorSpace meta-data right now? I could only find RandomPhotometricDistort (and it can probably do without)

For reference:

if isinstance(inpt, (datapoints.Image, datapoints.Video)):
output = inpt.wrap_like(inpt, output, color_space=datapoints.ColorSpace.OTHER) # type: ignore[arg-type]

First of all, it is an oversight that we are using two different idioms. Both Normalize and RandomPhotometricDistort should either return a torch.Tensor or an Image with color_space=ColorSpace.OTHER.

To answer the question, yes it can do without. Apart from that, ConvertColorSpace also uses it, but we could also guess it from the channels. Although that would drop the guard that we currently have of transforming arbitrary "images", i.e. the normalized ones, to a different color space. Whether this is a good thing is probably open for discussion.

Copy link
Collaborator

@vfdev-5 vfdev-5 Jan 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both Normalize and RandomPhotometricDistort should either return a torch.Tensor or an Image with color_space=ColorSpace.OTHER.

I think the reason for RandomPhotometricDistort to return an Image is that adjusting hue, saturation, etc should not go outside the original color space (RGB -> RGB), however Normalize returns a tensor due to the fact that float Image is supposed to be between [0, 1] but output normalized tensor can have any kind of ranges... So, I think it is fine that they do not return the same type of object

Copy link
Member Author

@NicolasHug NicolasHug Jan 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should not go outside the original color space (RGB -> RGB),

But then why setting it to ColorSpace.OTHER instead of keeping as ColorSpace.RGB?


It seems like there are 2 different concepts:

  • The color-space of an image: RGB vs RGBA vs greyscale (... vs. YUV vs CMYK which aren't supported in torchvision)
  • The range of values, and the current assumption that it is [0, 1.0 if image.is_floating_point() else torch.iinfo(image.dtype).max]

Conceptually these are 2 different things to me. E.g. I would argue that an RGB image in [0-1] is still an RGB image even after it is normalized: the range assumption may be broken, but it's still an RGB image - just in a non-standard range. Not sure I'm convincing but:

  • normalize (out = (X - mean) / std) is "just" a generalization of convertImageDtype (out = X / 255), and convertImageDtype preserves the ColorSpace attribute.
  • passing an RGB image to normalize changes its range, but it doesn't convert it to a greyscale or RGBA or YUV / CMYK.
  • Right now there's a 1:1 mapping between the ColorSpace and num_channels, so unless a transform changes num_channels there's no reason it should also change the color space

So I think there's an argument to be made to return Images instead of Tensors after normalize: I agree that the assumption about the range is broken, but a) this assumption isn't encoded in the ColorSpace anyway (it's encoded in the dtype) and b) returning a Tensor doesn't prevent any more issue than returning an Image.

Nor sure if I'm too clear but happy to chat more about this!

Copy link
Collaborator

@vfdev-5 vfdev-5 Jan 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, these are colorspace and data range are two different things but there are relations between them especially when we have to convert from one colorspace to another. For example, RGB to HSV is well-defined for [0-256] and [0-1] ranges, produced HSV image has predefined ranges for 8-bit, 16-bit or float32 dtypes (ref famous opencv link).
So, I think once RGB image is normalized with mean, std we can't convert it correctly to HSV space or to Gray scale...

But then why setting it to ColorSpace.OTHER instead of keeping as ColorSpace.RGB?

Maybe, due to the fact that we do not have BGR or GRB and other permutations of RGB...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's definitely true that these 2 concepts start to overlap when we're talking about more color-spaces like HSV / CYMK / YUV etc.

But these aren't supported in torchvision, and there is no immediate plan to support them either (I'm not saying never, but still). So at this point in time, the distinction may be somewhat premature and may lock us into a design that we might never actually need?

Copy link
Collaborator

@vfdev-5 vfdev-5 Jan 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true that we do not have HSV and other colorspaces right now and I agree that we probably wont support them since tomorrow. However, we are already doing RGB to grayscale and adjusting hue/saturation by internally doing rgb to hsv conversion. I'm happy to return Image from Normalize if it helps with API adoption etc but we have to pay attention on output correctness and potential issues this step could bring...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is definitely valid, we should also make sure that returing an Image doesn't lock us either. Do we foresee any hard issue by returning an Image?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we foresee any hard issue by returning an Image?

I don't. Like I stated above, this was an idealistic choice in the first place. Due to our fallback, it made no difference at runtime whatsoever.

Comment on lines +44 to +46
# Also this Seems like a subset of query_chw():
# - can we just have query_chw()?
# - if not, can we share code between the 2?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH I lost track of what exactly is the status here. We have quite few helpers for this and I'm not sure of the state. Worth looking into.

Comment on lines +59 to +62
# (This comment applies to the other query_() utils):
# Does this happen?
# What can go terribly wrong if we don't raise an error?
# Should we just document that this returns the size of the first inpt to avoid an error?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this happen?

Hopefully not. If it does, the input data is setup incorrectly. For example two images in the same sample of different shape, or an bounding_box.spatial_size attribute out of sync with the corresponding image.

What can go terribly wrong if we don't raise an error?

The first problem is that we needed to decide which value we return. sizes.pop() is not deterministic, so that would be quite bad.

Even if we can decide on a strategy to return a value, best case scenario is that the transform that requested the size fails loudly, albeit with a less expressive error message, when trying to transform an item that doesn't match the returned value. Worst case, the transform silently passes on bad input data.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first problem is that we needed to decide which value we return. sizes.pop() is not deterministic, so that would be quite bad.

If we decide to not check for duplicates we don't need to create a set, so we don't need to pop() and we can return the size of the first input which is deterministic.

My first instinct would be to be defensive and check for duplicates as done in the current code. I wonder however if there's value in applying the "define errors out of existence" principle here. As you mentioned, the main risk is that the transforms silently returns wrong results - although it's not clear to me yet in which specific scenario that would happen.

(I'm not advocating for either option at this point, I'm merely trying to figure out the tradeoffs)

Do we have a sense of the performance hit incurred by the duplicate checks? If this gets called on large-ish batches (256 samples) with multiple inputs (images, masks, bbox, etc.), does it become noticeable?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a sense of the performance hit incurred by the duplicate checks?

Nope. We only tried and benchmarked on our references and there we didn't notice anything.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep it this way then, thanks

raise ValueError(f"Found multiple HxW dimensions in the sample: {sequence_to_str(sorted(sizes))}")
h, w = sizes.pop()
return h, w


# when can types_or_checks be a callable?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_simple_tensor was the motivating example here.

if isinstance(obj, type_or_check) if isinstance(type_or_check, type) else type_or_check(obj):
return True
return False


# Are these public because they are needed for users to implement custom transforms?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are not strictly needed, but make the input checking a lot easier. If users come and look at our source how we do it, we should give them the same tools.

@@ -94,7 +94,7 @@ def _get_params(self, flat_inputs: List[Any]) -> Dict[str, Any]:
def _transform(
self, inpt: Union[datapoints.ImageType, datapoints.VideoType], params: Dict[str, Any]
) -> Union[datapoints.ImageType, datapoints.VideoType]:
if params["v"] is not None:
if params["v"] is not None: # What is this?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a value tensor or None used to erase the image

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically it is the replacement that is put in the "erased" area. In v1, in case we didn't find an area to erase, we return the bounding box of the whole image as well as the image

return 0, 0, img_h, img_w, img

With that we call F.erase unconditionally, which ultimately leads to replacing every value in the original image with itself:

if not inplace:
img = img.clone()
img[..., i : i + h, j : j + w] = v
return img

Since that is quite nonsensical, we opted to also allow None as a return value and use it as a sentinel to do nothing. I think the previous implementation came from a time were JIT didn't support Union (or Optional for that matter) and thus we couldn't return Optional[torch.Tensor].

Comment on lines +32 to +33
# is "feature" now "datapoint" - or is it something else?
# A: yes, TODO: update name
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #7117.

@@ -41,13 +43,17 @@ def __new__(
tensor = cls._to_tensor(data, dtype=dtype, device=device, requires_grad=requires_grad)
return tensor.as_subclass(Datapoint)

# Is this still needed, considering we won't be releasing the prototype datasets anytime soon?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #7154.

@NicolasHug NicolasHug mentioned this pull request Feb 10, 2023
49 tasks
@NicolasHug
Copy link
Member Author

Closing - we'll keep track of the progress in other places like #7217 (possibly coming back to this PR)

@NicolasHug NicolasHug closed this Feb 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants