Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussing PWC Section #1

Open
mhamilton723 opened this issue Jun 16, 2022 · 4 comments
Open

Discussing PWC Section #1

mhamilton723 opened this issue Jun 16, 2022 · 4 comments

Comments

@mhamilton723
Copy link

Hello, congrats on the release of your fantastic work. I love the fact that you can use language to prompt the segmentation, and we appreciate you citing and comparing against STEGO!

Wanted to quickly reach out with regards to how you want to collectively manage the Papers with code section on unsupervised segmentation. Because CLIP is trained with image-language pairs and you use this to generate the attention maps, I think this might fall under weakly supervised methods such as either of these:

https://paperswithcode.com/task/weakly-supervised-object-localization
https://paperswithcode.com/task/weakly-supervised-semantic-segmentation

let me know what you think about this proposal and I'm happy to discuss it further. Congrats again on making your work public!

Best,
Mark

@hq-deng
Copy link

hq-deng commented Jun 16, 2022

I have the same confusion as you. It seems that this work is avoiding the confusion with the unsupervised learning setup, because it's claimed as zero-shot adaptation. But, the experiment show comparison with unsupervised segmentation method.
If only use CLIP without labels, it may close to unsupervised setting. But the label for each image is provided, it should be weakly-supervised setting. Besides, the DenseCLIP is trained with pixel-level annotation. It could be a zero-shot task, not an unsupervised task.
I'm thinking that if we only use a CLIP model (without sample or pixel label), how should we define the task? It's unfair for comparison with both unsupervised task and weakly-supervised task.

@noelshin
Copy link
Owner

Hi both,

Thank you for your input.

First, to clarify a couple of points raised by @hq-deng's comment:

  1. The DenseCLIP model we use is not trained with pixel-level labels. Note that there are two models called DenseCLIP: "DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting" https://arxiv.org/pdf/2112.01518.pdf (which does use pixel-level labels) and "DenseCLIP: Extract Free Dense Labels from CLIP" https://arxiv.org/pdf/2112.01071.pdf (which does not). We use the latter.
  2. We do not use image-level labels in the process of ReCo inference or training the segmentation model (ReCo+). One of the backbones models we use in our experiments is pre-trained for classification on ImageNet with labels. Here we are following the use of the word "unsupervised" similarly to previous literature (such as PiCIE https://arxiv.org/pdf/2103.17070.pdf and Segsort https://arxiv.org/pdf/1910.06962.pdf).

@mhamilton723, we did not consider our work to be weakly supervised because we are not training for segmentation on images with class labels (in the same way that PiCIE does not refer to itself as weakly supervised). On the other hand, we recognise that there is a spectrum of supervision from zero supervision up to fully supervised, and that by using CLIP we are not at the "zero" end of the spectrum. As such, a precise name would be useful to avoid confusion.

In response to your question, we thought that perhaps "Unsupervised Semantic Segmentation with Language-Image Pre-training" could be a better fit for the task setting considered by ReCo (and the DenseCLIP baseline we consider in our paper). If this name seems appropriate to you both (feedback is highly welcome - it would be good for us to get the right name), we will create a branch for the task in Papers with code.

Gyungin

@hq-deng
Copy link

hq-deng commented Jun 19, 2022

Hello @noelshin ,

Thanks for your comprehensive answer. That is a novel and interesting concept of segmentation. Although this approach is difficult to define, you are bravely exploring it. Congratulations on your groundbreaking work.

@mhamilton723
Copy link
Author

Hey @noelshin thanks for the detailed reply. I think it might be a good idea to split this leaderboard out for one that uses supervised pre-training as you suggested. In some sense text labels provide even more supervision than classes or tags which is why i originally suggested weakly supervised methods. Thanks for being flexible and understanding on this topic :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants