Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data2Vec] Add data2vec vision #16760

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
bfc336b
save intermediate
patrickvonplaten Apr 13, 2022
9c5909f
add vision
patrickvonplaten Apr 13, 2022
bf943a2
add vision
patrickvonplaten Apr 13, 2022
4c98167
save
patrickvonplaten Apr 14, 2022
3d9d6cb
finish models
patrickvonplaten Apr 14, 2022
694de11
finish models
patrickvonplaten Apr 14, 2022
d0a68f9
continue
patrickvonplaten Apr 14, 2022
392069c
finish
patrickvonplaten Apr 14, 2022
2eb555c
up
patrickvonplaten Apr 14, 2022
70ae17d
up
patrickvonplaten Apr 14, 2022
e98cfd8
up
patrickvonplaten Apr 14, 2022
e809dfe
tests all pass
patrickvonplaten Apr 14, 2022
335a2ec
clean up
patrickvonplaten Apr 14, 2022
ec8bc47
up
patrickvonplaten Apr 14, 2022
4c5b840
up
patrickvonplaten Apr 14, 2022
e4136f4
fix bugs in beit
patrickvonplaten Apr 14, 2022
2b65921
correct docs
patrickvonplaten Apr 14, 2022
ede8b78
finish
patrickvonplaten Apr 14, 2022
ac6e79d
finish docs
patrickvonplaten Apr 14, 2022
333d844
make style
patrickvonplaten Apr 14, 2022
6865b02
up
patrickvonplaten Apr 14, 2022
4b17c19
more fixes
patrickvonplaten Apr 14, 2022
b44a7ab
fix type hint
patrickvonplaten Apr 14, 2022
077441e
Merge branch 'main' of https://github.com/huggingface/transformers in…
patrickvonplaten Apr 14, 2022
4951bf9
make style
patrickvonplaten Apr 14, 2022
e6bc294
Apply suggestions from code review
patrickvonplaten Apr 18, 2022
6276bde
Update tests/data2vec/test_modeling_data2vec_vision.py
patrickvonplaten Apr 18, 2022
5bc2aac
Merge branch 'main' of https://github.com/huggingface/transformers in…
patrickvonplaten Apr 18, 2022
f67ba98
fix test
patrickvonplaten Apr 18, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/en/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -190,6 +190,7 @@ Flax), PyTorch, and/or TensorFlow.
| CTRL | ✅ | ❌ | ✅ | ✅ | ❌ |
| Data2VecAudio | ❌ | ❌ | ✅ | ❌ | ❌ |
| Data2VecText | ❌ | ❌ | ✅ | ❌ | ❌ |
| Data2VecVision | ❌ | ❌ | ✅ | ❌ | ❌ |
| DeBERTa | ✅ | ✅ | ✅ | ✅ | ❌ |
| DeBERTa-v2 | ✅ | ❌ | ✅ | ✅ | ❌ |
| Decision Transformer | ❌ | ❌ | ✅ | ❌ | ❌ |
Expand Down
30 changes: 26 additions & 4 deletions docs/source/en/model_doc/data2vec.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -33,10 +33,13 @@ Models and code are available at www.github.com/pytorch/fairseq/tree/master/exam

Tips:

- Both Data2VecAudio and Data2VecText have been trained using the same self-supervised learning method.
In the case of Data2VecAudio, preprocessing is identical to [`RobertaModel`], including tokenization.
- Data2VecAudio, Data2VecText, and Data2VecVision have all been trained using the same self-supervised learning method.
- For Data2VecAudio, preprocessing is identical to [`Wav2Vec2Model`], including feature extraction
- For Data2VecText, preprocessing is identical to [`RobertaModel`], including tokenization.
- For Data2VecVision, preprocessing is identical to [`BeitModel`], including feature extraction.

This model was contributed by [edugp](https://huggingface.co/edugp) and [patrickvonplaten](https://huggingface.co/patrickvonplaten)

This model was contributed by [edugp](https://huggingface.co/edugp).
The original code can be found [here](https://github.com/pytorch/fairseq/tree/main/examples/data2vec).


Expand All @@ -48,12 +51,16 @@ The original code can be found [here](https://github.com/pytorch/fairseq/tree/ma

[[autodoc]] Data2VecAudioConfig

## Data2VecVisionConfig

[[autodoc]] Data2VecVisionConfig


## Data2VecAudioModel

[[autodoc]] Data2VecAudioModel
- forward


## Data2VecAudioForAudioFrameClassification

[[autodoc]] Data2VecAudioForAudioFrameClassification
Expand Down Expand Up @@ -108,3 +115,18 @@ The original code can be found [here](https://github.com/pytorch/fairseq/tree/ma

[[autodoc]] Data2VecTextForQuestionAnswering
- forward

## Data2VecVisionModel

[[autodoc]] Data2VecVisionModel
- forward

## Data2VecVisionForImageClassification

[[autodoc]] Data2VecVisionForImageClassification
- forward

## Data2VecVisionForSemanticSegmentation

[[autodoc]] Data2VecVisionForSemanticSegmentation
- forward
1 change: 1 addition & 0 deletions docs/source/en/serialization.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,7 @@ Ready-made configurations include the following architectures:
- BlenderbotSmall
- CamemBERT
- Data2VecText
- Data2VecVision
- DistilBERT
- ELECTRA
- FlauBERT
Expand Down
26 changes: 24 additions & 2 deletions src/transformers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -170,7 +170,13 @@
"models.convnext": ["CONVNEXT_PRETRAINED_CONFIG_ARCHIVE_MAP", "ConvNextConfig"],
"models.cpm": ["CpmTokenizer"],
"models.ctrl": ["CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP", "CTRLConfig", "CTRLTokenizer"],
"models.data2vec": ["DATA2VEC_TEXT_PRETRAINED_CONFIG_ARCHIVE_MAP", "Data2VecAudioConfig", "Data2VecTextConfig"],
"models.data2vec": [
"DATA2VEC_TEXT_PRETRAINED_CONFIG_ARCHIVE_MAP",
"DATA2VEC_VISION_PRETRAINED_CONFIG_ARCHIVE_MAP",
"Data2VecAudioConfig",
"Data2VecTextConfig",
"Data2VecVisionConfig",
],
"models.deberta": ["DEBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP", "DebertaConfig", "DebertaTokenizer"],
"models.deberta_v2": ["DEBERTA_V2_PRETRAINED_CONFIG_ARCHIVE_MAP", "DebertaV2Config"],
"models.decision_transformer": ["DECISION_TRANSFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", "DecisionTransformerConfig"],
Expand Down Expand Up @@ -868,6 +874,7 @@
[
"DATA2VEC_AUDIO_PRETRAINED_MODEL_ARCHIVE_LIST",
"DATA2VEC_TEXT_PRETRAINED_MODEL_ARCHIVE_LIST",
"DATA2VEC_VISION_PRETRAINED_MODEL_ARCHIVE_LIST",
"Data2VecAudioForAudioFrameClassification",
"Data2VecAudioForCTC",
"Data2VecAudioForSequenceClassification",
Expand All @@ -882,6 +889,10 @@
"Data2VecTextForTokenClassification",
"Data2VecTextModel",
"Data2VecTextPreTrainedModel",
"Data2VecVisionForImageClassification",
"Data2VecVisionForSemanticSegmentation",
"Data2VecVisionModel",
"Data2VecVisionPreTrainedModel",
]
)
_import_structure["models.deberta"].extend(
Expand Down Expand Up @@ -2555,7 +2566,13 @@
from .models.convnext import CONVNEXT_PRETRAINED_CONFIG_ARCHIVE_MAP, ConvNextConfig
from .models.cpm import CpmTokenizer
from .models.ctrl import CTRL_PRETRAINED_CONFIG_ARCHIVE_MAP, CTRLConfig, CTRLTokenizer
from .models.data2vec import DATA2VEC_TEXT_PRETRAINED_CONFIG_ARCHIVE_MAP, Data2VecAudioConfig, Data2VecTextConfig
from .models.data2vec import (
DATA2VEC_TEXT_PRETRAINED_CONFIG_ARCHIVE_MAP,
DATA2VEC_VISION_PRETRAINED_CONFIG_ARCHIVE_MAP,
Data2VecAudioConfig,
Data2VecTextConfig,
Data2VecVisionConfig,
)
from .models.deberta import DEBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, DebertaConfig, DebertaTokenizer
from .models.deberta_v2 import DEBERTA_V2_PRETRAINED_CONFIG_ARCHIVE_MAP, DebertaV2Config
from .models.decision_transformer import (
Expand Down Expand Up @@ -3151,6 +3168,7 @@
from .models.data2vec import (
DATA2VEC_AUDIO_PRETRAINED_MODEL_ARCHIVE_LIST,
DATA2VEC_TEXT_PRETRAINED_MODEL_ARCHIVE_LIST,
DATA2VEC_VISION_PRETRAINED_MODEL_ARCHIVE_LIST,
Data2VecAudioForAudioFrameClassification,
Data2VecAudioForCTC,
Data2VecAudioForSequenceClassification,
Expand All @@ -3165,6 +3183,10 @@
Data2VecTextForTokenClassification,
Data2VecTextModel,
Data2VecTextPreTrainedModel,
Data2VecVisionForImageClassification,
Data2VecVisionForSemanticSegmentation,
Data2VecVisionModel,
Data2VecVisionPreTrainedModel,
)
from .models.deberta import (
DEBERTA_PRETRAINED_MODEL_ARCHIVE_LIST,
Expand Down
10 changes: 9 additions & 1 deletion src/transformers/models/auto/configuration_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@
("layoutlmv2", "LayoutLMv2Config"),
("plbart", "PLBartConfig"),
("beit", "BeitConfig"),
("data2vec-vision", "Data2VecVisionConfig"),
("rembert", "RemBertConfig"),
("visual_bert", "VisualBertConfig"),
("canine", "CanineConfig"),
Expand Down Expand Up @@ -162,6 +163,7 @@
("layoutlmv2", "LAYOUTLMV2_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("plbart", "PLBART_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("beit", "BEIT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("data2vec-vision", "DATA2VEC_VISION_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("rembert", "REMBERT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("visual_bert", "VISUAL_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP"),
("canine", "CANINE_PRETRAINED_CONFIG_ARCHIVE_MAP"),
Expand Down Expand Up @@ -349,12 +351,18 @@
("layoutxlm", "LayoutXLM"),
("data2vec-audio", "Data2VecAudio"),
("data2vec-text", "Data2VecText"),
("data2vec-vision", "Data2VecVision"),
("dit", "DiT"),
]
)

SPECIAL_MODEL_TYPE_TO_MODULE_NAME = OrderedDict(
[("openai-gpt", "openai"), ("data2vec-audio", "data2vec"), ("data2vec-text", "data2vec")]
[
("openai-gpt", "openai"),
("data2vec-audio", "data2vec"),
("data2vec-text", "data2vec"),
("data2vec-vision", "data2vec"),
]
)


Expand Down
3 changes: 3 additions & 0 deletions src/transformers/models/auto/modeling_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,7 @@
("layoutlmv2", "LayoutLMv2Model"),
("plbart", "PLBartModel"),
("beit", "BeitModel"),
("data2vec-vision", "Data2VecVisionModel"),
("rembert", "RemBertModel"),
("visual_bert", "VisualBertModel"),
("canine", "CanineModel"),
Expand Down Expand Up @@ -290,6 +291,7 @@
("vit", "ViTForImageClassification"),
("deit", ("DeiTForImageClassification", "DeiTForImageClassificationWithTeacher")),
("beit", "BeitForImageClassification"),
("data2vec-vision", "Data2VecVisionForImageClassification"),
("segformer", "SegformerForImageClassification"),
("imagegpt", "ImageGPTForImageClassification"),
(
Expand Down Expand Up @@ -321,6 +323,7 @@
[
# Model for Semantic Segmentation mapping
("beit", "BeitForSemanticSegmentation"),
("data2vec-vision", "Data2VecVisionForSemanticSegmentation"),
("segformer", "SegformerForSemanticSegmentation"),
("dpt", "DPTForSemanticSegmentation"),
]
Expand Down
17 changes: 9 additions & 8 deletions src/transformers/models/beit/modeling_beit.py
Original file line number Diff line number Diff line change
Expand Up @@ -702,7 +702,8 @@ def forward(
pooled_output = self.pooler(sequence_output) if self.pooler is not None else None

if not return_dict:
return (sequence_output, pooled_output) + encoder_outputs[1:]
head_outputs = (sequence_output, pooled_output) if pooled_output is not None else (sequence_output,)
patrickvonplaten marked this conversation as resolved.
Show resolved Hide resolved
return head_outputs + encoder_outputs[1:]

return BeitModelOutputWithPooling(
last_hidden_state=sequence_output,
Expand All @@ -713,7 +714,7 @@ def forward(


class BeitPooler(nn.Module):
def __init__(self, config: BeitModel) -> None:
def __init__(self, config: BeitConfig) -> None:
patrickvonplaten marked this conversation as resolved.
Show resolved Hide resolved
super().__init__()
self.layernorm = (
nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps) if config.use_mean_pooling else None
Expand All @@ -736,7 +737,7 @@ def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
BEIT_START_DOCSTRING,
)
class BeitForMaskedImageModeling(BeitPreTrainedModel):
def __init__(self, config: BeitModel) -> None:
def __init__(self, config: BeitConfig) -> None:
super().__init__(config)

self.num_labels = config.num_labels
Expand Down Expand Up @@ -817,7 +818,7 @@ def forward(
masked_lm_loss = loss_fct(prediction_scores[bool_masked_pos], labels)

if not return_dict:
output = (prediction_scores,) + outputs[2:]
output = (prediction_scores,) + outputs[1:]
patrickvonplaten marked this conversation as resolved.
Show resolved Hide resolved
return ((masked_lm_loss,) + output) if masked_lm_loss is not None else output

return MaskedLMOutput(
Expand All @@ -836,7 +837,7 @@ def forward(
BEIT_START_DOCSTRING,
)
class BeitForImageClassification(BeitPreTrainedModel):
def __init__(self, config: BeitModel) -> None:
def __init__(self, config: BeitConfig) -> None:
super().__init__(config)

self.num_labels = config.num_labels
Expand Down Expand Up @@ -1237,7 +1238,7 @@ def forward(
return_dict=return_dict,
)

encoder_hidden_states = outputs.hidden_states if return_dict else outputs[2]
encoder_hidden_states = outputs.hidden_states if return_dict else outputs[1]

# only keep certain features, and reshape
# note that we do +1 as the encoder_hidden_states also includes the initial embeddings
Expand Down Expand Up @@ -1268,9 +1269,9 @@ def forward(

if not return_dict:
if output_hidden_states:
output = (logits,) + outputs[2:]
output = (logits,) + outputs[1:]
else:
output = (logits,) + outputs[3:]
output = (logits,) + outputs[2:]
return ((loss,) + output) if loss is not None else output

return SemanticSegmenterOutput(
Expand Down
26 changes: 26 additions & 0 deletions src/transformers/models/data2vec/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,11 @@
"Data2VecTextConfig",
"Data2VecTextOnnxConfig",
],
"configuration_data2vec_vision": [
"DATA2VEC_VISION_PRETRAINED_CONFIG_ARCHIVE_MAP",
"Data2VecVisionConfig",
"Data2VecVisionOnnxConfig",
],
}

if is_torch_available():
Expand All @@ -54,6 +59,14 @@
"Data2VecTextModel",
"Data2VecTextPreTrainedModel",
]
_import_structure["modeling_data2vec_vision"] = [
"DATA2VEC_VISION_PRETRAINED_MODEL_ARCHIVE_LIST",
"Data2VecVisionForImageClassification",
"Data2VecVisionForMaskedImageModeling",
"Data2VecVisionForSemanticSegmentation",
"Data2VecVisionModel",
"Data2VecVisionPreTrainedModel",
]

if TYPE_CHECKING:
from .configuration_data2vec_audio import DATA2VEC_AUDIO_PRETRAINED_CONFIG_ARCHIVE_MAP, Data2VecAudioConfig
Expand All @@ -62,6 +75,11 @@
Data2VecTextConfig,
Data2VecTextOnnxConfig,
)
from .configuration_data2vec_vision import (
DATA2VEC_VISION_PRETRAINED_CONFIG_ARCHIVE_MAP,
Data2VecVisionConfig,
Data2VecVisionOnnxConfig,
)

if is_torch_available():
from .modeling_data2vec_audio import (
Expand All @@ -84,6 +102,14 @@
Data2VecTextModel,
Data2VecTextPreTrainedModel,
)
from .modeling_data2vec_vision import (
DATA2VEC_VISION_PRETRAINED_MODEL_ARCHIVE_LIST,
Data2VecVisionForImageClassification,
Data2VecVisionForMaskedImageModeling,
Data2VecVisionForSemanticSegmentation,
Data2VecVisionModel,
Data2VecVisionPreTrainedModel,
)

else:
import sys
Expand Down
Loading