Transformer-in-Vision

Recent Transformer-based CV and related works. Welcome to comment/contribute!

The transformer is now a basic component, adopted in nearly all AI models. Keep updated --> updated irregularly.

New Hope: LLM-in-Vision

Resource

ChatGPT for Robotics: Design Principles and Model Abilities, [Paper], [Code]
DIFFUSIONDB [Page], [Paper]
LAION-5B [Page], [Paper]
LAVIS [Page], [Paper]
Imagen Video [Page], [Paper]
Phenaki [Page], [Paper]
DREAMFUSION [Page], [Paper]
MAKE-A-VIDEO [Page], [Paper]
Stable Difffusion [Page], [Paper]
NUWA-Infinity [Page], [Paper]
Parti [Page], [Code]
Imagen [Page], [Paper]
Gato: A Generalist Agent, [Paper]
PaLM: Scaling Language Modeling with Pathways, [Paper]
DALL·E 2 [Page], [Paper]
SCENIC: A JAX Library for Computer Vision Research and Beyond, [Code]
V-L joint learning study (with good tables): [METER], [Kaleido-BERT]
Attention is all you need, [Paper]
CLIP [Page], [Paper], [Code], [arXiv]
DALL·E [Page], [Code], [Paper]
huggingface/transformers
Kyubyong/transformer, TF
jadore801120/attention-is-all-you-need-pytorch, Torch
krasserm/fairseq-image-captioning
PyTorch Transformers Tutorials
ictnlp/awesome-transformer
basicv8vc/awesome-transformer
dk-liang/Awesome-Visual-Transformer
yuewang-cuhk/awesome-vision-language-pretraining-papers

Survey

(arXiv 2023.2) TRANSFORMER-BASED SENSOR FUSION FOR AUTONOMOUS DRIVING: A SURVEY, [Paper], [Page]
(arXiv 2023.2) Deep Learning for Video-Text Retrieval: a Review, [Paper]
(arXiv 2023.2) Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey, [Paper]
(arXiv 2023.2) Transformer-based Generative Adversarial Networks in Computer Vision: A Comprehensive Survey, [Paper]
(arXiv 2023.2) Knowledge Distillation in Vision Transformers: A Critical Review, [Paper]
(arXiv 2023.2) A Survey on Efficient Training of Transformers, [Paper]
(arXiv 2023.1) ChatGPT is not all you need. A State of the Art Review of large Generative AI models, [Paper]
(arXiv 2022.12) Transformers in Action Recognition: A Review on Temporal Modeling, [Paper]
(arXiv 2022.11) Vision Transformers in Medical Imaging: A Review, [Paper]
(arXiv 2022.11) A survey on knowledge-enhanced multimodal learning, [Paper]
(arXiv 2022.10) Vision-Language Pre-training: Basics, Recent Advances, and Future Trends, [Paper]
(arXiv 2022.10) A Survey on Graph Neural Networks and Graph Transformers in Computer Vision: A Task-Oriented Perspective, [Paper]
(arXiv 2022.09) VISION TRANSFORMERS FOR ACTION RECOGNITION: A SURVEY, [Paper]
(arXiv 2022.09) Transformers in Remote Sensing: A Survey, [Paper], [Code]
(arXiv 2022.08) 3D Vision with Transformers: A Survey, [Paper], [Code]
(arXiv 2022.08) A Survey on Masked Autoencoder for Self-supervised Learning in Vision and Beyond, [Paper]
(arXiv 2022.07) Vision Transformers: State of the Art and Research Challenges, [Paper]
(arXiv 2022.07) SELF-SUPERVISED LEARNING FOR VIDEOS: A SURVEY, [Paper]
(arXiv 2022.06) Multimodal Learning with Transformers: A Survey, [Paper]
(arXiv 2022.05) Vision Transformer: Vit and its Derivatives, [Paper]
(arXiv 2022.05) Transformers in 3D Point Clouds: A Survey, [Paper]
(arXiv 2022.04) Visual Attention Methods in Deep Learning: An In-Depth Survey, [Paper]
(arXiv 2022.04) Vision-and-Language Pretrained Models: A Survey, [Paper]
(arXiv 2022.03) A Roadmap for Big Model, [Paper]
(arXiv 2022.03) Transformers Meet Visual Learning Understanding: A Comprehensive Review, [[Paper]](https://arxiv.org/pdf/2203.12944.pdf）
(arXiv 2022.03) Recent Advances in Vision Transformer: A Survey and Outlook of Recent Work, [Paper], [Project]
(arXiv 2022.02) A Survey of Vision-Language Pre-Trained Models, [Paper]
(arXiv 2022.02) VLP: A Survey on Vision-Language Pre-training, [Paper]
(arXiv 2022.02) Transformer for Graphs: An Overview from Architecture Perspective, [Paper]
(arXiv 2022.01) Video Transformers: A Survey, [Paper]
(arXiv 2021.11) ARE WE READY FOR A NEW PARADIGM SHIFT? A SURVEY ON VISUAL DEEP MLP, [Paper]
(arXiv 2021.11) A Survey of Visual Transformers, [Paper]
(arXiv 2021.09) Survey: Transformer based Video-Language Pre-training, [Paper]
(arXiv 2021.06) A Survey of Transformers, [Paper]
(arXiv 2021.06) Attention mechanisms and deep learning for machine vision: A survey of the state of the art, [Paper]
(arXiv 2021.06) Pre-Trained Models: Past, Present and Future, [Paper]
(arXiv 2021.05) Can Attention Enable MLPs To Catch Up With CNNs? [Paper]
(arXiv 2021.03) A Practical Survey on Faster and Lighter Transformers, [Paper]
(arXiv 2021.03) Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and Vision, [Paper]
(arXiv 2021.01) A Survey on Visual Transformer, [Paper]
(arXiv 2020.9) Efficient Transformers: A Survey, [Paper]
(arXiv 2020.1) Transformers in Vision: A Survey, [Paper]

Recent Papers

2023.8

(arXiv 2023.8) VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control, [Paper], [Project]

2023.5

(arXiv 2023.5) Understanding Gaussian Attention Bias of Vision Transformers Using Effective Receptive Fields, [Paper]

2023.3

(arXiv 2023.3) Query-Dependent Video Representation for Moment Retrieval and Highlight Detection, [Paper], [Code]

2023.2

(arXiv 2023.2) Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities, [Paper]
(arXiv 2023.2) KS-DETR: Knowledge Sharing in Attention Learning for Detection Transformer, [Paper], [Code]
(arXiv 2023.2) HUMAN MOTIONFORMER: TRANSFERRING HUMAN MOTIONS WITH VISION TRANSFORMERS, [Paper], [Code]
(arXiv 2023.2) Aligning Text-to-Image Models using Human Feedback, [Paper]
(arXiv 2023.2) Controlled and Conditional Text to Image Generation with Diffusion Prior, [Paper]
(arXiv 2023.2) Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions? [Paper], [Code]
(arXiv 2023.2) OBJECT-CENTRIC VIDEO PREDICTION VIA DECOUPLING OF OBJECT DYNAMICS AND INTERACTIONS, [Paper], [Project]
(arXiv 2023.2) Distribution Normalization: An “Effortless” Test-Time Augmentation for Contrastively Learned Visual-language Models, [Paper], [Code]
(arXiv 2023.2) Teaching CLIP to Count to Ten, [Paper], [Project]
(arXiv 2023.2) Designing an Encoder for Fast Personalization of Text-to-Image Models, [Paper], [Project]
(arXiv 2023.2) Side Adapter Network for Open-Vocabulary Semantic Segmentation, [Paper], [Code]
(arXiv 2023.2) Learning Visual Representations via Language-Guided Sampling, [Paper]
(arXiv 2023.2) VoxFormer: Sparse Voxel Transformer for Camera-based 3D Semantic Scene Completion, [Paper], [Code]
(arXiv 2023.2) Language-Driven Representation Learning for Robotics, [Paper], [Project]
(arXiv 2023.2) A Convolutional Vision Transformer for Semantic Segmentation of Side-Scan Sonar Data, [Paper], [Code]
(arXiv 2023.2) Lightweight Real-time Semantic Segmentation Network with Efficient Transformer and CNN, [Paper], [Code]
(arXiv 2023.2) VIEWCO: DISCOVERING TEXT-SUPERVISED SEGMENTATION MASKS VIA MULTI-VIEW SEMANTIC CONSISTENCY, [Paper], [Code]
(arXiv 2023.2) CertViT: Certified Robustness of Pre-Trained Vision Transformers, [Paper], [Code]
(arXiv 2023.2) Paparazzi: A Deep Dive into the Capabilities of Language and Vision Models for Grounding Viewpoint Descriptions, [Paper]
(arXiv 2023.2) MaskedKD: Efficient Distillation of Vision Transformers with Masked Images, [Paper]
(arXiv 2023.2) A General Visual Representation Guided Framework with Global Affinity for Weakly Supervised Salient Object Detection, [Paper]
(arXiv 2023.2) ViTA: A Vision Transformer Inference Accelerator for Edge Applications, [Paper]
(arXiv 2023.2) Video Action Recognition Collaborative Learning with Dynamics via PSO-ConvNet Transformer, [Paper], [Code]
(arXiv 2023.2) A Pilot Evaluation of ChatGPT and DALL-E 2 on Decision Making and Spatial Reasoning, [Paper]
(arXiv 2023.2) StyLIP: Multi-Scale Style-Conditioned Prompt Learning for CLIP-based Domain Generalization, [Paper]
(arXiv 2023.2) Meta Style Adversarial Training for Cross-Domain Few-Shot Learning, [Paper]
(arXiv 2023.2) HYNETER: HYBRID NETWORK TRANSFORMER FOR OBJECT DETECTION, [Paper]
(arXiv 2023.2) STOA-VLP: Spatial-Temporal Modeling of Object and Action for Video-Language Pre-training, [Paper]
(arXiv 2023.2) Constraint and Union for Partially-Supervised Temporal Sentence Grounding, [Paper]
(arXiv 2023.2) STB-VMM: Swin Transformer Based Video Motion Magnification, [Paper]
(arXiv 2023.2) Fashion Image Retrieval with Multi-Granular Alignment, [Paper]
(arXiv 2023.2) LayoutDiffuse: Adapting Foundational Diffusion Models for Layout-to-Image Generation, [Paper]
(arXiv 2023.2) CK-Transformer: Commonsense Knowledge Enhanced Transformers for Referring Expression Comprehension, [Paper], [Code]
(arXiv 2023.2) MaskSketch: Unpaired Structure-guided Masked Image Generation, [Paper]
(arXiv 2023.2) Single Motion Diffusion, [Paper], [Code]
(arXiv 2023.2) Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction, [Paper], [Code]
(arXiv 2023.2) ANSEL Photobot: A Robot Event Photographer with Semantic Intelligence, [Paper]
(arXiv 2023.2) ForceFormer: Exploring Social Force and Transformer for Pedestrian Trajectory Prediction, [Paper]
(arXiv 2023.2) Video Probabilistic Diffusion Models in Projected Latent Space, [Paper]
(arXiv 2023.2) Dataset Interfaces: Diagnosing Model Failures Using Controllable Counterfactual Generation, [Paper], [Code]
(arXiv 2023.2) Learning to Substitute Ingredients in Recipes, [Paper]
(arXiv 2023.2) Energy Transformer, [Paper]
(arXiv 2023.2) Efficiency 360: Efficient Vision Transformers, [Paper]
(arXiv 2023.2) A-la-carte Prompt Tuning (APT): Combining Distinct Data Via Composable ` Prompting, [Paper]
(arXiv 2023.2) Effective Data Augmentation With Diffusion Models, [Paper], [Project]
(arXiv 2023.2) PRedItOR: Text Guided Image Editing with Diffusion Prior, [Paper]
(arXiv 2023.2) TcGAN: Semantic-Aware and Structure-Preserved GANs with Individual Vision Transformer for Fast Arbitrary One-Shot Image Generation, [Paper]
(arXiv 2023.2) Hierarchical Cross-modal Transformer for RGB-D Salient Object Detection, [Paper]
(arXiv 2023.2) MINOTAUR: Multi-task Video Grounding From Multimodal Queries, [Paper]
(arXiv 2023.2) Towards Efficient Visual Adaption via Structural Re-parameterization, [Paper], [Code]
(arXiv 2023.2) Efficient 3D Object Reconstruction using Visual Transformers, [Paper]
(arXiv 2023.2) Retrieval-augmented Image Captioning, [Paper]
(arXiv 2023.2) Robust Human Motion Forecasting using Transformer-based Model, [Paper]
(arXiv 2023.2) VQ3D: Learning a 3D-Aware Generative Model on ImageNet, [Paper], [Project]
(arXiv 2023.2) UKnow: A Unified Knowledge Protocol for Common-Sense Reasoning and Vision-Language Pre-training, [Paper], [Code]
(arXiv 2023.2) A THEORETICAL UNDERSTANDING OF SHALLOW VISION TRANSFORMERS: LEARNING, GENERALIZATION, AND SAMPLE COMPLEXITY, [Paper]
(arXiv 2023.2) A Simple Zero-shot Prompt Weighting Technique to Improve Prompt Ensembling in Text-Image Models, [Paper]
(arXiv 2023.2) Generalized Few-Shot Continual Learning with Contrastive Mixture of Adapters, [Paper], [Code]
(arXiv 2023.2) Actional Atomic-Concept Learning for Demystifying Vision-Language Navigation, [Paper]
(arXiv 2023.2) Towards Local Visual Modeling for Image Captioning, [Paper], [Code]
(arXiv 2023.2) CLIP-RR: IMPROVED CLIP NETWORK FOR RELATION-FOCUSED CROSS-MODAL INFORMATION RETRIEVAL, [Paper]
(arXiv 2023.2) Anticipating Next Active Objects for Egocentric Videos, [Paper], [Code]
(arXiv 2023.2) UniAdapter: Unified Parameter-Efficient Transfer Learning for Cross-modal Modeling, [Paper], [Code]
(arXiv 2023.2) TEAM DETR: GUIDE QUERIES AS A PROFESSIONAL TEAM IN DETECTION TRANSFORMERS, [Paper], [Code]
(arXiv 2023.2) ConceptFusion: Open-set Multimodal 3D Mapping, [Paper], [Project]
(arXiv 2023.2) Team Triple-Check at Factify 2: Parameter-Efficient Large Foundation Models with Feature Representations for Multi-Modal Fact Verification, [Paper], [Code]
(arXiv 2023.2) PolyFormer: Referring Image Segmentation as Sequential Polygon Generation, [Paper]
(arXiv 2023.2) Pose-Oriented Transformer with Uncertainty-Guided Refinement for 2D-to-3D Human Pose Estimation, [Paper]
(arXiv 2023.2) TFormer: A Transmission-Friendly ViT Model for IoT Devices, [Paper], [Code]
(arXiv 2023.2) Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction, [Paper], [Code]
(arXiv 2023.2) Adding Conditional Control to Text-to-Image Diffusion Models, [Paper], [Code]
(arXiv 2023.2) Invariant Slot Attention: Object Discovery with Slot-Centric Reference Frames, [Paper]
(arXiv 2023.2) IS MULTI-MODAL VISION SUPERVISION BENEFICIAL TO LANGUAGE? [Paper]
(arXiv 2023.2) Data-Driven Stochastic Motion Evaluation and Optimization with Image by Spatially-Aligned Temporal Encoding, [Paper]
(arXiv 2023.2) Scaling Vision Transformers to 22 Billion Parameters, [Paper]
(arXiv 2023.2) Adapting Pre-trained Vision Transformers from 2D to 3D through Weight Inflation Improves Medical Image Segmentation, [Paper], [Code]
(arXiv 2023.2) Mitigating Bias in Visual Transformers via Targeted Alignment, [Paper]
(arXiv 2023.2) IH-ViT: Vision Transformer-based Integrated Circuit Appearance Defect Detection, [Paper]
(arXiv 2023.2) Re-ViLM: Retrieval-Augmented Visual Language Model for Zero and Few-Shot Image Captioning, [Paper]
(arXiv 2023.2) Learning by Asking for Embodied Visual Navigation and Task Completion, [Paper]
(arXiv 2023.2) Reversible Vision Transformers, [Paper], [Code1], [Code2]
(arXiv 2023.2) Neural Congealing: Aligning Images to a Joint Semantic Atlas, [Paper], [Project]
(arXiv 2023.2) Adversarial Prompting for Black Box Foundation Models, [Paper]
(arXiv 2023.2) Understanding Why ViT Trains Badly on Small Datasets: An Intuitive Perspective, [Paper], [Code]
(arXiv 2023.2) CROSS-LAYER RETROSPECTIVE RETRIEVING VIA LAYER ATTENTION, [Paper], [Code]
(arXiv 2023.2) Convolutional Neural Networks Trained to Identify Words Provide a Good Account of Visual Form Priming Effects, [Paper]
(arXiv 2023.2) Zero-shot Generation of Coherent Storybook from Plain Text Story using Diffusion Models, [Paper]
(arXiv 2023.2) OSRT: Omnidirectional Image Super-Resolution with Distortion-aware Transformer, [Paper]
(arXiv 2023.2) Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image Retrieval, [Paper], [Code]
(arXiv 2023.2) SimCon Loss with Multiple Views for Text Supervised Semantic Segmentation, [Paper]
(arXiv 2023.2) PhysFormer++: Facial Video-based Physiological Measurement with SlowFast Temporal Difference Transformer, [Paper]
(arXiv 2023.2) Scaling Self-Supervised End-to-End Driving with Multi-View Attention Learning, [Paper]
(arXiv 2023.2) HumanMAC: Masked Motion Completion for Human Motion Prediction, [Paper], [Project]
(arXiv 2023.2) LAMPP: Language Models as Probabilistic Priors for Perception and Action, [Paper]
(arXiv 2023.2) Zero-Shot Robot Manipulation from Passive Human Videos, [Paper], [Project]
(arXiv 2023.2) MixFormer: End-to-End Tracking with Iterative Mixed Attention, [Paper], [Code]
(arXiv 2023.2) LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale Image-Text Retrieval, [Paper]
(arXiv 2023.2) V1T: large-scale mouse V1 response prediction using a Vision Transformer, [Paper]
(arXiv 2023.2) AIM: ADAPTING IMAGE MODELS FOR EFFICIENT VIDEO ACTION RECOGNITION, [Paper], [Project]
(arXiv 2023.2) KDEformer: Accelerating Transformers via Kernel Density Estimation, [Paper], [Code]
(arXiv 2023.2) Semantic-Guided Image Augmentation with Pre-trained Models, [Paper]
(arXiv 2023.2) X-ReID: Cross-Instance Transformer for Identity-Level Person Re-Identification, [Paper]
(arXiv 2023.2) MOMA: Distill from Self-Supervised Teachers, [Paper]
(arXiv 2023.2) Learning to Agree on Vision Attention for Visual Commonsense Reasoning, [Paper]
(arXiv 2023.2) Efficient End-to-End Video Question Answering with Pyramidal Multimodal Transformer, [Paper], [Code]
(arXiv 2023.2) LipFormer: Learning to Lipread Unseen Speakers based on Visual-Landmark Transformers, [Paper]
(arXiv 2023.2) Oscillation-free Quantization for Low-bit Vision Transformers, [Paper]
(arXiv 2023.2) Design Booster: A Text-Guided Diffusion Model for Image Translation with Spatial Layout Preservation, [Paper]
(arXiv 2023.2) Contrast with Reconstruct: Contrastive 3D Representation Learning Guided by Generative Pretraining, [Paper], [Code]
(arXiv 2023.2) Leaving Reality to Imagination: Robust Classification via Generated Datasets, [Paper], [Code]
(arXiv 2023.2) CHiLS: Zero-Shot Image Classification with Hierarchical Label Sets, [Paper], [Code]
(arXiv 2023.2) Zero-shot Image-to-Image Translation, [Paper], [Project]
(arXiv 2023.2) Learning a Fourier Transform for Linear Relative Positional Encodings in Transformers, [Paper]
(arXiv 2023.2) EXPLICIT BOX DETECTION UNIFIES END-TO-END MULTI-PERSON POSE ESTIMATION, [Paper], [Code]
(arXiv 2023.2) CFFT-GAN: Cross-domain Feature Fusion Transformer for Exemplar-based Image Translation, [Paper]
(arXiv 2023.2) DEVICE: DEpth and VIsual ConcEpts Aware Transformer for TextCaps, [Paper]
(arXiv 2023.2) CVTNet: A Cross-View Transformer Network for Place Recognition Using LiDAR Data, [Paper], [Code]
(arXiv 2023.2) DilateFormer: Multi-Scale Dilated Transformer for Visual Recognition, [Paper], [Code]
(arXiv 2023.2) HDFormer: High-order Directed Transformer for 3D Human Pose Estimation, [Paper], [Code]
(arXiv 2023.2) IC^3: Image Captioning by Committee Consensus, [Paper], [Code]
(arXiv 2023.2) Boosting Low-Data Instance Segmentation by Unsupervised Pre-training with Saliency Prompt, [Paper]
(arXiv 2023.2) QR-CLIP: Introducing Explicit Open-World Knowledge for Location and Time Reasoning, [Paper]
(arXiv 2023.2) Vision Transformer-based Feature Extraction for Generalized Zero-Shot Learning, [Paper]
(arXiv 2023.2) Multimodal Chain-of-Thought Reasoning in Language Models, [Paper], [Code]
(arXiv 2023.2) CLIPood: Generalizing CLIP to Out-of-Distributions, [Paper]
(arXiv 2023.2) Language Quantized AutoEncoders: Towards Unsupervised Text-Image Alignment, [Paper]
(arXiv 2023.2) The geometry of hidden representations of large transformer models, [Paper]
(arXiv 2023.2) Debiasing Vision-Language Models via Biased Prompts, [Paper], [Code]
(arXiv 2023.2) COMPOSITIONAL PROMPT TUNING WITH MOTION CUES FOR OPEN-VOCABULARY VIDEO RELATION DETECTION, [Paper], [Code]
(arXiv 2023.2) mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video, [Paper], [Code]
(arXiv 2023.2) Transforming CLIP to an Open-vocabulary Video Model via Interpolated Weight Optimization, [Paper]
(arXiv 2023.2) ADAPT: Action-aware Driving Caption Transformer, [Paper], [Code]

2023.1

(arXiv 2023.1) AdaPoinTr: Diverse Point Cloud Completion with Adaptive Geometry-Aware Transformers, [Paper], [Code]
(arXiv 2023.1) EXIF as Language: Learning Cross-Modal Associations Between Images and Camera Metadata, [Paper], [Project]
(arXiv 2023.1) Head-Free Lightweight Semantic Segmentation with Linear Transformer, [Paper], [Code]
(arXiv 2023.1) Geometry-biased Transformers for Novel View Synthesis, [Paper], [Project]
(arXiv 2023.1) Continual Few-Shot Learning Using HyperTransformers, [Paper]
(arXiv 2023.1) SEMPPL: PREDICTING PSEUDO-LABELS FOR BETTER CONTRASTIVE REPRESENTATIONS, [Paper]
(arXiv 2023.1) Learning to Summarize Videos by Contrasting Clips, [Paper]
(arXiv 2023.1) Guiding Text-to-Image Diffusion Model Towards Grounded Generation, [Paper], [Project]
(arXiv 2023.1) Domain Expansion of Image Generators, [Paper], [Code]
(arXiv 2023.1) Scene-centric vs. Object-centric Image-Text Cross-modal Retrieval: A Reproducibility Study, [Paper]
(arXiv 2023.1) Tracr: Compiled Transformers as a Laboratory for Interpretability, [Paper], [Code]
(arXiv 2023.1) CLIP the Gap: A Single Domain Generalization Approach for Object Detection, [Paper]
(arXiv 2023.1) Text to Point Cloud Localization with Relation-Enhanced Transformer, [Paper]
(arXiv 2023.1) GOHSP: A Unified Framework of Graph and Optimization-based Heterogeneous Structured Pruning for Vision Transformer, [Paper]
(arXiv 2023.1) Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks, [Paper]
(arXiv 2023.1) ViTs for SITS: Vision Transformers for Satellite Image Time Series, [Paper], [Code]
(arXiv 2023.1) CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP, [Paper]
(arXiv 2023.1) A Large-Scale Outdoor Multi-modal Dataset and Benchmark for Novel View Synthesis and Implicit Scene Reconstruction, [Paper], [Project]
(arXiv 2023.1) USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text Retrieval, [Paper], [Code]
(arXiv 2023.1) SAT: Size-Aware Transformer for 3D Point Cloud Semantic Segmentation, [Paper]
(arXiv 2023.1) Masked Visual Reconstruction in Language Semantic Space, [Paper], [Code]
(arXiv 2023.1) Vision Learners Meet Web Image-Text Pairs, [Paper], [Code]
(arXiv 2023.1) GLIGEN: Open-Set Grounded Text-to-Image Generation, [Paper], [Project]
(arXiv 2023.1) Learning Customized Visual Models with Retrieval-Augmented Knowledge, [Paper], [Project]
(arXiv 2023.1) UATVR: Uncertainty-Adaptive Text-Video Retrieval, [Paper]
(arXiv 2023.1) Learning Aligned Cross-modal Representations for Referring Image Segmentation, [Paper]
(arXiv 2023.1) T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations, [Paper], [Project]
(arXiv 2023.1) DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets, [Paper], [Code]
(arXiv 2023.1) CMAE-V: Contrastive Masked Autoencoders for Video Action Recognition, [Paper]
(arXiv 2023.1) Generating Templated Caption for Video Grounding, [Paper]
(arXiv 2023.1) Dyna-DepthFormer: Multi-frame Transformer for Self-Supervised Depth Estimation in Dynamic Scenes, [Paper]
(arXiv 2023.1) SwinDepth: Unsupervised Depth Estimation using Monocular Sequences via Swin Transformer and Densely Cascaded Network, [Paper]
(arXiv 2023.1) CLIPTER: Looking at the Bigger Picture in Scene Text Recognition, [Paper]
(arXiv 2023.1) Temporal Perceiving Video-Language Pre-training, [Paper]
(arXiv 2023.1) Joint Representation Learning for Text and 3D Point Cloud, [Paper], [Code]
(arXiv 2023.1) Effective End-to-End Vision Language Pretraining with Semantic Visual Loss, [Paper]
(arXiv 2023.1) PTA-Det: Point Transformer Associating Point cloud and Image for 3D Object Detection, [Paper]
(arXiv 2023.1) Face Recognition in the age of CLIP & Billion image datasets, [Paper]
(arXiv 2023.1) HSTFormer: Hierarchical Spatial-Temporal Transformers for 3D Human Pose Estimation, [Paper], [Code]
(arXiv 2023.1) Towards Models that Can See and Read, [Paper]
(arXiv 2023.1) Embodied Agents for Efficient Exploration and Smart Scene Description, [Paper]
(arXiv 2023.1) Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture, [Paper]
(arXiv 2023.1) Revisiting the Spatial and Temporal Modeling for Few-shot Action Recognition, [Paper]
(arXiv 2023.1) Multimodal Video Adapter for Parameter Efficient Video Text Retrieval, [Paper]
(arXiv 2023.1) Self Supervision Does Not Help Natural Language Supervision at Scale, [Paper]
(arXiv 2023.1) MULTI-TARGET MULTI-CAMERA VEHICLE TRACKING USING TRANSFORMER-BASED CAMERA LINK MODEL AND SPATIAL-TEMPORAL INFORMATION, [Paper]
(arXiv 2023.1) ATMAN: Understanding Transformer Predictions Through Memory Efficient Attention Manipulation, [Paper]
(arXiv 2023.1) DDS: Decoupled Dynamic Scene-Graph Generation Network, [Paper], [Code]
(arXiv 2023.1) Visual Writing Prompts: Character-Grounded Story Generation with Curated Image Sequences, [Paper]
(arXiv 2023.1) Image Memorability Prediction with Vision Transformers, [Paper]
(arXiv 2023.1) HOLISTICALLY EXPLAINABLE VISION TRANSFORMERS, [Paper]
(arXiv 2023.1) FlatFormer: Flattened Window Attention for Efficient Point Cloud Transformer, [Paper]
(arXiv 2023.1) LEGO-Net: Learning Regular Rearrangements of Objects in Rooms, [Paper], [Project]
(arXiv 2023.1) Zorro: the masked multimodal transformer, [Paper]
(arXiv 2023.1) Towards Robust Video Instance Segmentation with Temporal-Aware Transformer, [Paper]
(arXiv 2023.1) Learning Open-vocabulary Semantic Segmentation Models From Natural Language Supervision, [Paper], [Project]
(arXiv 2023.1) Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction Anticipation, [Paper], [Code]
(arXiv 2023.1) Combined Use of Federated Learning and Image Encryption for Privacy-Preserving Image Classification with Vision Transformer, [Paper]
(arXiv 2023.1) Slice Transformer and Self-supervised Learning for 6DoF Localization in 3D Point Cloud Maps, [Paper]
(arXiv 2023.1) IMPROVING ACCURACY OF ZERO-SHOT ACTION RECOGNITION WITH HANDCRAFTED FEATURES, [Paper]
(arXiv 2023.1) Learning to View: Decision Transformers for Active Object Detection, [Paper]
(arXiv 2023.1) Visual Semantic Relatedness Dataset for Image Captioning, [Paper], [Code]
(arXiv 2023.1) VERSATILE NEURAL PROCESSES FOR LEARNING IMPLICIT NEURAL REPRESENTATIONS, [Paper], [Code]
(arXiv 2023.1) RangeViT: Towards Vision Transformers for 3D Semantic Segmentation in Autonomous Driving, [Paper], [Code]
(arXiv 2023.1) Exploiting Optical Flow Guidance for Transformer-Based Video Inpainting, [Paper]
(arXiv 2023.1) Image Super-Resolution using Efficient Striped Window Transformer, [Paper], [Code]
(arXiv 2023.1) Out of Distribution Performance of State of Art Vision Model, [Paper], [Code]
(arXiv 2023.1) Compact Transformer Tracker with Correlative Masked Modeling, [Paper], [Code]
(arXiv 2023.1) Vision-Language Models Performing Zero-Shot Tasks Exhibit Gender-based Disparities, [Paper]
(arXiv 2023.1) Cut and Learn for Unsupervised Object Detection and Instance Segmentation, [Paper], [Code]
(arXiv 2023.1) Explaining Visual Biases as Words by Generating Captions, [Paper], [Code]
(arXiv 2023.1) Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring, [Paper], [Code]
(arXiv 2023.1) Multi-video Moment Ranking with Multimodal Clue, [Paper]
(arXiv 2023.1) SDF-FORMER: MONOCULAR SCENE RECONSTRUCTION WITH 3D SDF TRANSFORMERS, [Paper], [Project]
(arXiv 2023.1) Grounding Language Models to Images for Multimodal Generation, [Paper]
(arXiv 2023.1) Pseudo 3D Perception Transformer with Multi-level Confidence Optimization for Visual Commonsense Reasoning, [Paper]
(arXiv 2023.1) A Modular Multi-stage Lightweight Graph Transformer Network for Human Pose and Shape Estimation from 2D Human Pose, [Paper]
(arXiv 2023.1) Priors are Powerful: Improving a Transformer for Multi-camera 3D Detection with 2D Priors, [Paper]
(arXiv 2023.1) UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers, [Paper]
(arXiv 2023.1) Fairness-aware Vision Transformer via Debiased Self-Attention, [Paper]
(arXiv 2023.1) Anchor-Based Adversarially Robust Zero-Shot Learning Driven by Language, [Paper]
(arXiv 2023.1) Distilling Internet-Scale Vision-Language Models into Embodied Agents, [Paper]
(arXiv 2023.1) 6-DoF Robotic Grasping with Transformer, [Paper]
(arXiv 2023.1) Do Embodied Agents Dream of Pixelated Sheep?: Embodied Decision Making using Language Guided World Modelling, [Paper], [Project]
(arXiv 2023.1) GALIP: Generative Adversarial CLIPs for Text-to-Image Synthesis, [Paper], [Code]
(arXiv 2023.1) STAIR: Learning Sparse Text and Image Representation in Grounded Tokens, [Paper]
(arXiv 2023.1) Aerial Image Object Detection With Vision Transformer Detector (ViTDet), [Paper]
(arXiv 2023.1) Towards Vision Transformer Unrolling Fixed-Point Algorithm: a Case Study on Image Restoration, [Paper]
(arXiv 2023.1) Debiased Fine-Tuning for Vision-language Models by Prompt Regularization, [Paper], [Code]
(arXiv 2023.1) BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models, [Paper], [Code]
(arXiv 2023.1) Tagging before Alignment: Integrating Multi-Modal Tags for Video-Text Retrieval, [Paper]
(arXiv 2023.1) SEAFORMER: SQUEEZE-ENHANCED AXIAL TRANSFORMER FOR MOBILE SEMANTIC SEGMENTATION, [Paper], [Code]
(arXiv 2023.1) Learning 6-DoF Fine-grained Grasp Detection Based on Part Affordance Grounding, [Paper], [Project]
(arXiv 2023.1) Multimodal Event Transformer for Image-guided Story Ending Generation, [Paper]
(arXiv 2023.1) Style-Aware Contrastive Learning for Multi-Style Image Captioning, [Paper]
(arXiv 2023.1) 3DShape2VecSet: A 3D Shape Representation for Neural Fields and Generative Diffusion Models, [Paper]
(arXiv 2023.1) Semi-Parametric Video-Grounded Text Generation, [Paper]
(arXiv 2023.1) Robust Transformer with Locality Inductive Bias and Feature Normalization, [Paper]
(arXiv 2023.1) LEVERAGING THE THIRD DIMENSION IN CONTRASTIVE LEARNING, [Paper]
(arXiv 2023.1) Understanding Self-Supervised Pretraining with Part-Aware Representation Learning, [Paper]
(arXiv 2023.1) Hypergraph Transformer for Skeleton-based Action Recognition, [Paper]
(arXiv 2023.1) CPT-V: A Contrastive Approach to Post-Training Quantization of Vision Transformers, [Paper]
(arXiv 2023.1) InstructPix2Pix: Learning to Follow Image Editing Instructions, [Paper], [Code]
(arXiv 2023.1) OvarNet: Towards Open-vocabulary Object Attribute Recognition, [Paper], [Project]
(arXiv 2023.1) DDS: Decoupled Dynamic Scene-Graph Generation Network, [Paper]
(arXiv 2023.1) Token Transformer: Can class token help window-based transformer build better long-range interactions? [Paper]
(arXiv 2023.1) Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks, [Paper]
(arXiv 2023.1) Multimodal Inverse Cloze Task for Knowledge-based Visual Question Answering? [Paper], [Code]
(arXiv 2023.1) FGAHOI: Fine-Grained Anchors for Human-Object Interaction Detection, [Paper], [Code]
(arXiv 2023.1) Parallel Reasoning Network for Human-Object Interaction Detection, [Paper]
(arXiv 2023.1) In Defense of Structural Symbolic Representation for Video Event-Relation Prediction, [Paper]
(arXiv 2023.1) Scene Synthesis from Human Motion, [Paper], [Project]

2022.12

(arXiv 2022.12) EVA: Exploring the Limits of Masked Visual Representation Learning at Scale, [Paper], [Code]
(arXiv 2022.12) OneFormer: One Transformer to Rule Universal Image Segmentation, [Paper], [Code]
(arXiv 2022.12) MMDialog: A Large-scale Multi-turn Dialogue Dataset Towards Multi-modal Open-domain Conversation, [Paper], [Project]
(arXiv 2022.12) Why is Winoground Hard? Investigating Failures in Visuolinguistic Compositionality, [Paper], [Code]
(arXiv 2022.12) Multimodal Information Bottleneck: Learning Minimal Sufficient Unimodal and Multimodal Representations, [Paper], [Code]
(arXiv 2022.12) CLIP-FLOW: CONTRASTIVE LEARNING BY SEMISUPERVISED ITERATIVE PSEUDO LABELING FOR OPTICAL FLOW ESTIMATION, [Paper]
(arXiv 2022.12) INSTRUCTION-FOLLOWING AGENTS WITH JOINTLY PRE-TRAINED VISION-LANGUAGE MODELS, [Paper], [Code]
(arXiv 2022.12) MetaFormer Baselines for Vision, [Paper], [Code]
(arXiv 2022.12) ViTCoD: Vision Transformer Acceleration via Dedicated Algorithm and Accelerator Co-Design, [Paper], [Code]
(arXiv 2022.12) FROM PLAY TO POLICY: CONDITIONAL BEHAVIOR GENERATION FROM UNCURATED ROBOT DATA, [Paper], [Project]
(arXiv 2022.12) Optimizing Prompts for Text-to-Image Generation, [Paper], [Code]
(arXiv 2022.12) Attentive Mask CLIP, [Paper]
(arXiv 2022.12) Rethinking Cooking State Recognition with Vision Transformers, [Paper]
(arXiv 2022.12) Enhancing Multi-modal and Multi-hop Question Answering via Structured Knowledge and Unified Retrieval-Generation, [Paper], [Code]
(arXiv 2022.12) MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal Contributions in Vision and Language Models & Tasks, [Paper], [Code]
(arXiv 2022.12) RepQ-ViT: Scale Reparameterization for Post-Training Quantization of Vision Transformers, [Paper]
(arXiv 2022.12) WAVENHANCER: UNIFYING WAVELET AND TRANSFORMER FOR IMAGE ENHANCEMENT, [Paper]
(arXiv 2022.12) AUTOENCODERS AS CROSS-MODAL TEACHERS: CAN PRETRAINED 2D IMAGE TRANSFORMERS HELP 3D REPRESENTATION LEARNING?, [Paper], [Code]
(arXiv 2022.12) SceneGATE: Scene-Graph based co-Attention networks for TExt visual question answering, [Paper]
(arXiv 2022.12) Emergent Analogical Reasoning in Large Language Models, [Paper]
(arXiv 2022.12) Unleashing the Power of Visual Prompting At the Pixel Level, [Paper], [Code]
(arXiv 2022.12) Does CLIP Bind Concepts? Probing Compositionality in Large Image Models, [Paper]
(arXiv 2022.12) LayoutDETR: Detection Transformer Is a Good Multimodal Layout Designer, [Paper], [Code]
(arXiv 2022.12) Towards Unsupervised Visual Reasoning: Do Off-The-Shelf Features Know How to Reason?, [Paper]
(arXiv 2022.12) Benchmarking Spatial Relationships in Text-to-Image Generation, [Paper], [Project]
(arXiv 2022.12) MetaCLUE: Towards Comprehensive Visual Metaphors Research, [Paper], [Project]
(arXiv 2022.12) Tackling Ambiguity with Images: Improved Multimodal Machine Translation and Contrastive Evaluation, [Paper], [Code]
(arXiv 2022.12) Cross-modal Attention Congruence Regularization for Vision-Language Relation Alignment, [Paper]
(arXiv 2022.12) Does unsupervised grammar induction need pixels?, [Paper]
(arXiv 2022.12) Hi-LASSIE: High-Fidelity Articulated Shape and Skeleton Discovery from Sparse Image Ensemble, [Paper]
(arXiv 2022.12) MAViC: Multimodal Active Learning for Video Captioning, [Paper]
(arXiv 2022.12) What Makes for Good Tokenizers in Vision Transformer? [Paper]
(arXiv 2022.12) Not Just Pretty Pictures: Text-to-Image Generators Enable Interpretable Interventions for Robust Representations, [Paper], [Code]
(arXiv 2022.12) Generalized Decoding for Pixel, Image, and Language, [Paper], [Project]
(arXiv 2022.12) METEOR Guided Divergence for Video Captioning, [Paper], [Code]
(arXiv 2022.12) SLGTFORMER: AN ATTENTION-BASED APPROACH TO SIGN LANGUAGE RECOGNITION, [Paper], [Code]
(arXiv 2022.12) FROM IMAGES TO TEXTUAL PROMPTS: ZERO-SHOT VQA WITH FROZEN LARGE LANGUAGE MODELS, [Paper], [Code]
(arXiv 2022.12) 3D Highlighter: Localizing Regions on 3D Shapes via Text Descriptions, [Paper], [Code]
(arXiv 2022.12) Contrastive Language-Vision AI Models Pretrained on Web-Scraped Multimodal Data Exhibit Sexual Objectification Bias, [Paper]
(arXiv 2022.12) Ultra-High-Definition Low-Light Image Enhancement: A Benchmark and Transformer-Based Method, [Paper], [Code]
(arXiv 2022.12) Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation, [Paper], [Project]
(arXiv 2022.12) Beyond SOT: It’s Time to Track Multiple Generic Objects at Once, [Paper]
(arXiv 2022.12) KNOWLEDGE-DRIVEN SCENE PRIORS FOR SEMANTIC AUDIO-VISUAL EMBODIED NAVIGATION, [Paper]
(arXiv 2022.12) SegViT: Semantic Segmentation with Plain Vision Transformers, [Paper], [Code]
(arXiv 2022.12) Open-Vocabulary Temporal Action Detection with Off-the-Shelf Image-Text Features, [Paper]
(arXiv 2022.12) Point·E: A System for Generating 3D Point Clouds from Complex Prompts, [Paper], [Code]
(arXiv 2022.12) Inductive Attention for Video Action Anticipation, [Paper]
(arXiv 2022.12) Image-and-Language Understanding from Pixels Only, [Paper], [Code]
(arXiv 2022.12) FlexiViT: One Model for All Patch Sizes, [Paper], [Code]
(arXiv 2022.12) Unsupervised Object Localization: Observing the Background to Discover Objects, [Paper], [Code]
(arXiv 2022.12) Vision Transformers are Parameter-Efficient Audio-Visual Learners, [Paper], [Project]
(arXiv 2022.12) Full Contextual Attention for Multi-resolution Transformers in Semantic Segmentation, [Paper]
(arXiv 2022.12) DETR4D: Direct Multi-View 3D Object Detection with Sparse Attention, [Paper]
(arXiv 2022.12) Enhanced Training of Query-Based Object Detection via Selective Query Recollection, [Paper], [Code]
(arXiv 2022.12) TEXT-GUIDED MASK-FREE LOCAL IMAGE RETOUCHING, [Paper]
(arXiv 2022.12) Summary-Oriented Vision Modeling for Multimodal Abstractive Summarization, [Paper], [Code]
(arXiv 2022.12) One-Shot Domain Adaptive and Generalizable Semantic Segmentation with Class-Aware Cross-Domain Transformers, [Paper]
(arXiv 2022.12) ConQueR: Query Contrast Voxel-DETR for 3D Object Detection, [Paper]
(arXiv 2022.12) Examining the Difference Among Transformers and CNNs with Explanation Methods, [Paper]
(arXiv 2022.12) Find Someone Who: Visual Commonsense Understanding in Human-Centric Grounding, [Paper], [Code]
(arXiv 2022.12) Dual-branch Cross-Patch Attention Learning for Group Affect Recognition, [Paper]
(arXiv 2022.12) Cross-Modal Similarity-Based Curriculum Learning for Image Captioning, [Paper]
(arXiv 2022.12) NLIP: Noise-robust Language-Image Pre-training, [Paper]
(arXiv 2022.12) LidarCLIP or: How I Learned to Talk to Point Clouds, [Paper], [Code]
(arXiv 2022.12) CLIPSEP: LEARNING TEXT-QUERIED SOUND SEPARATION WITH NOISY UNLABELED VIDEOS, [Paper]
(arXiv 2022.12) Reproducible scaling laws for contrastive language-image learning, [Paper], [Code]
(arXiv 2022.12) WHAT DO VISION TRANSFORMERS LEARN? A VISUAL EXPLORATION, [Paper]
(arXiv 2022.12) Self-Play and Self-Describe: Policy Adaptation with Vision-Language Foundation Models, [Paper], [Project]
(arXiv 2022.12) GPVIT: A HIGH RESOLUTION NON-HIERARCHICAL VISION TRANSFORMER WITH GROUP PROPAGATION, [Paper], [Code]
(arXiv 2022.12) Learning 3D Representations from 2D Pre-trained Models via Image-to-Point Masked Autoencoders, [Paper], [Code]
(arXiv 2022.12) Parallel Queries for Human-Object Interaction Detection, [Paper]
(arXiv 2022.12) Structure-Guided Image Completion with Image-level and Object-level Semantic Discriminators, [Paper]
(arXiv 2022.12) Localized Latent Updates for Fine-Tuning Vision-Language Models, [Paper]
(arXiv 2022.12) CamoFormer: Masked Separable Attention for Camouflaged Object Detection, [Paper]
(arXiv 2022.12) FastMIM: Expediting Masked Image Modeling Pre-training for Vision, [Paper], [Code]
(arXiv 2022.12) OAMixer: Object-aware Mixing Layer for Vision Transformers, [Paper], [Code]
(arXiv 2022.12) Doubly Right Object Recognition: A Why Prompt for Visual Rationales, [Paper]
(arXiv 2022.12) RT-1: ROBOTICS TRANSFORMER FOR REAL-WORLD CONTROL AT SCALE, [Paper], [Project]
(arXiv 2022.12) Egocentric Video Task Translation, [Paper]
(arXiv 2022.12) ScanEnts3D: Exploiting Phrase-to-3D-Object Correspondences for Improved Visio-Linguistic Models in 3D Scenes, [Paper], [Project]
(arXiv 2022.12) Curriculum Learning Meets Weakly Supervised Modality Correlation Learning, [Paper]
(arXiv 2022.12) IMoS: Intent-Driven Full-Body Motion Synthesis for Human-Object Interactions, [Paper]
(arXiv 2022.12) MultiAct: Long-Term 3D Human Motion Generation from Multiple Action Labels, [Paper]
(arXiv 2022.12) A New Path: Scaling Vision-and-Language Navigation with Synthetic Instructions and Imitation Learning, [Paper]
(arXiv 2022.12) Beyond Object Recognition: A New Benchmark towards Object Concept Learning, [Paper], [Project]
(arXiv 2022.12) ViTPose+: Vision Transformer Foundation Model for Generic Body Pose Estimation, [Paper], [Code]
(arXiv 2022.12) Structured Vision-Language Pretraining for Computational Cooking, [Paper]
(arXiv 2022.12) MIME: Human-Aware 3D Scene Generation, [Paper], [Project]
(arXiv 2022.12) OFASY S: A Multi-Modal Multi-Task Learning System for Building Generalist Models, [Paper], [Code]
(arXiv 2022.12) Task Bias in Vision-Language Models, [Paper]
(arXiv 2022.12) Multi-Concept Customization of Text-to-Image Diffusion, [Paper], [Code]
(arXiv 2022.12) Few-View Object Reconstruction with Unknown Categories and Camera Poses, [Paper], [Project]
(arXiv 2022.12) Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning, [Paper], [Code]
(arXiv 2022.12) Learning Video Representations from Large Language Models, [Paper], [Project]
(arXiv 2022.12) Frozen CLIP Model is Efficient Point Cloud Backbone, [Paper]
(arXiv 2022.12) DialogCC: Large-scale Multi-Modal Dialogue Dataset, [Paper], [Project]
(arXiv 2022.12) Group Generalized Mean Pooling for Vision Transformer, [Paper]
(arXiv 2022.12) LEARNING DOMAIN INVARIANT PROMPT FOR VISION-LANGUAGE MODELS, [Paper]
(arXiv 2022.12) LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models, [Paper]
(arXiv 2022.12) Hyperbolic Contrastive Learning for Visual Representations beyond Objects, [Paper], [Code]

2022.11

(arXiv 2022.11) Texts as Images in Prompt Tuning for Multi-Label Image Recognition, [Paper], [Code]
(arXiv 2022.11) Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation, [Paper]
(arXiv 2022.11) InDiReCT: Language-Guided Zero-Shot Deep Metric Learning for Images, [Paper]
(arXiv 2022.11) VoP: Text-Video Co-operative Prompt Tuning for Cross-Modal Retrieval, [Paper], [Code]
(arXiv 2022.11) Completing point cloud from few points by Wasserstein GAN and Transformers, [Paper], [Code]
(arXiv 2022.11) Integrally Pre-Trained Transformer Pyramid Networks, [Paper], [Code]
(arXiv 2022.11) Data Augmentation Vision Transformer for Fine-grained Image Classification, [Paper]
(arXiv 2022.11) DETRs with Collaborative Hybrid Assignments Training, [Paper], [Code]
(arXiv 2022.11) Open-vocabulary Attribute Detection, [Paper], [Project]
(arXiv 2022.11) Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised Monocular Depth Estimation, [Paper], [Code]
(arXiv 2022.11) Inversion-Based Creativity Transfer with Diffusion Models, [Paper], [Code]
(arXiv 2022.11) CODA-Prompt: COntinual Decomposed Attention-based Prompting for Rehearsal-Free Continual Learning, [Paper]
(arXiv 2022.11) SVFormer: Semi-supervised Video Transformer for Action Recognition, [Paper], [Code]
(arXiv 2022.11) Generalizable Implicit Neural Representations via Instance Pattern Composers, [Paper]
(arXiv 2022.11) Improving Visual-textual Sentiment Analysis by Fusing Expert Features, [Paper]
(arXiv 2022.11) Self-Supervised Learning based on Heat Equation, [Paper]
(arXiv 2022.11) Peekaboo: Text to Image Diffusion Models are Zero-Shot Segmentors, [Paper]
(arXiv 2022.11) Paint by Example: Exemplar-based Image Editing with Diffusion Models, [Paper], [Code]
(arXiv 2022.11) Human or Machine? Turing Tests for Vision and Language, [Paper], [Code]
(arXiv 2022.11) Teach-DETR: Better Training DETR with Teachers, [Paper], [Code]
(arXiv 2022.11) Conv2Former: A Simple Transformer-Style ConvNet for Visual Recognition, [Paper]
(arXiv 2022.11) X^2-VLM: All-In-One Pre-trained Model For Vision-Language Tasks, [Paper], [Code]
(arXiv 2022.11) Aligning Source Visual and Target Language Domains for Unpaired Video Captioning, [Paper]
(arXiv 2022.11) On the Transferability of Visual Features in Generalized Zero-Shot Learning, [Paper], [Code]
(arXiv 2022.11) Generalizable Industrial Visual Anomaly Detection with Self-Induction Vision Transformer, [Paper]
(arXiv 2022.11) Transformer Based Multi-Grained Features for Unsupervised Person Re-Identification, [Paper], [Code]
(arXiv 2022.11) Efficient Frequency Domain-based Transformers for High-Quality Image Deblurring, [Paper], [Code]
(arXiv 2022.11) Event Transformer+. A multi-purpose solution for efficient event data processing, [Paper]
(arXiv 2022.11) MagicPony: Learning Articulated 3D Animals in the Wild, [Paper], [Project]
(arXiv 2022.11) Gated Class-Attention with Cascaded Feature Drift Compensation for Exemplar-free Continual Learning of Vision Transformers, [Paper], [Code]
(arXiv 2022.11) Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations, [Paper], [Code]
(arXiv 2022.11) N-Gram in Swin Transformers for Efficient Lightweight Image Super-Resolution, [Paper]
(arXiv 2022.11) Robotic Skill Acquisition via Instruction Augmentation with Vision-Language Models, [Paper], [Code]
(arXiv 2022.11) Peeling the Onion: Hierarchical Reduction of Data Redundancy for Efficient Vision Transformer Training, [Paper], [Code]
(arXiv 2022.11) Unifying Vision-Language Representation Space with Single-tower Transformer, [Paper]
(arXiv 2022.11) DeepSolo: Let Transformer Decoder with Explicit Points Solo for Text Spotting, [Paper]
(arXiv 2022.11) Castling-ViT: Compressing Self-Attention via Switching Towards Linear-Angular Attention During Vision Transformer Inference, [Paper]
(arXiv 2022.11) CL-CrossVQA: A Continual Learning Benchmark for Cross-Domain Visual Question Answering, [Paper]
(arXiv 2022.11) Normal Transformer: Extracting Surface Geometry from LiDAR Points Enhanced by Visual Semantics, [Paper]
(arXiv 2022.11) A Unified Model for Video Understanding and Knowledge Embedding with Heterogeneous Knowledge Graph Dataset, [Paper]
(arXiv 2022.11) Efficient Video Representation Learning via Masked Video Modeling with Motion-centric Token Selection, [Paper]
(arXiv 2022.11) DiffStyler: Controllable Dual Diffusion for Text-Driven Image Stylization, [Paper]
(arXiv 2022.11) TORE: Token Reduction for Efficient Human Mesh Recovery with Transformer, [Paper]
(arXiv 2022.11) Synthesizing Coherent Story with Auto-Regressive Latent Diffusion Models, [Paper], [Code]
(arXiv 2022.11) Are Out-of-Distribution Detection Methods Reliable?, [Paper]
(arXiv 2022.11) GLT-T: Global-Local Transformer Voting for 3D Single Object Tracking in Point Clouds, [Paper], [Code]
(arXiv 2022.11) CROSS-MODAL CONTRASTIVE LEARNING FOR ROBUST REASONING IN VQA, [Paper], [Code]
(arXiv 2022.11) LISA: Localized Image Stylization with Audio via Implicit Neural Representation, [Paper]
(arXiv 2022.11) MagicVideo: Efficient Video Generation With Latent Diffusion Models, [Paper], [Code]
(arXiv 2022.11) DreamArtist: Towards Controllable One-Shot Text-to-Image Generation via Contrastive Prompt-Tuning, [Paper]
(arXiv 2022.11) Hybrid Transformer Based Feature Fusion for Self-Supervised Monocular Depth Estimation, [Paper]
(arXiv 2022.11) Language in a Bottle: Language Model Guided Concept Bottlenecks for Interpretable Image Classification, [Paper]
(arXiv 2022.11) Structure-Encoding Auxiliary Tasks for Improved Visual Representation in Vision-and-Language Navigation, [Paper]
(arXiv 2022.11) You Need Multiple Exiting: Dynamic Early Exiting for Accelerating Unified Vision Language Model, [Paper]
(arXiv 2022.11) Beyond Attentive Tokens: Incorporating Token Importance and Diversity for Efficient Vision Transformers, [Paper]
(arXiv 2022.11) FlowLens: Seeing Beyond the FoV via Flow-guided Clip-Recurrent Transformer, [Paper], [Code]
(arXiv 2022.11) PS-Transformer: Learning Sparse Photometric Stereo Network using Self-Attention Mechanism, [Paper]
(arXiv 2022.11) On the Robustness, Generalization, and Forgetting of Shape-Texture Debiased Continual Learning, [Paper]
(arXiv 2022.11) Vision Transformer with Super Token Sampling, [Paper], [Code]
(arXiv 2022.11) Detect Only What You Specify : Object Detection with Linguistic Target, [Paper]
(arXiv 2022.11) Visual Programming: Compositional visual reasoning without training, [Paper], [Project]
(arXiv 2022.11) ClipCrop: Conditioned Cropping Driven by Vision-Language Model, [Paper]
(arXiv 2022.11) SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-training, [Paper]
(arXiv 2022.11) Blur Interpolation Transformer for Real-World Motion from Blur, [Paper]
(arXiv 2022.11) Mean Shift Mask Transformer for Unseen Object Instance Segmentation, [Paper], [Code]
(arXiv 2022.11) PointCLIP V2: Adapting CLIP for Powerful 3D Open-world Learning, [Paper], [Code]
(arXiv 2022.11) Exploring Discrete Diffusion Models for Image Captioning, [Paper], [Code]
(arXiv 2022.11) PERCEIVER-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention, [Paper], [Code]
(arXiv 2022.11) Multitask Vision-Language Prompt Tuning, [Paper], [Code]
(arXiv 2022.11) Teaching Structured Vision & Language Concepts to Vision & Language Models, [Paper]
(arXiv 2022.11) WEIGHTED ENSEMBLE SELF-SUPERVISED LEARNING, [Paper]
(arXiv 2022.11) BEVFormer v2: Adapting Modern Image Backbones to Bird’s-Eye-View Recognition via Perspective Supervision, [Paper]
(arXiv 2022.11) Task Residual for Tuning Vision-Language Models, [Paper], [Code]
(arXiv 2022.11) α DARTS Once More: Enhancing Differentiable Architecture Search by Masked Image Modeling, [Paper]
(arXiv 2022.11) Delving into Transformer for Incremental Semantic Segmentation, [Paper]
(arXiv 2022.11) DETRDistill: A Universal Knowledge Distillation Framework for DETR-families, [Paper]
(arXiv 2022.11) PromptCap: Prompt-Guided Task-Aware Image Captioning, [Paper]
(arXiv 2022.11) UNIFORMERV2: SPATIOTEMPORAL LEARNING BY ARMING IMAGE VITS WITH VIDEO UNIFORMER, [Paper], [Code]
(arXiv 2022.11) Masked Reconstruction Contrastive Learning with Information Bottleneck Principle, [Paper]
(arXiv 2022.11) Listen, denoise, action! Audio-driven motion synthesis with diffusion models, [Paper], [Project]
(arXiv 2022.11) ConStruct-VL: Data-Free Continual Structured VL Concepts Learning, [Paper]
(arXiv 2022.11) How to Fine-Tune Vision Models with SGD, [Paper]
(arXiv 2022.11) Progressive Tree-Structured Prototype Network for End-to-End Image Captioning, [Paper], [Code]
(arXiv 2022.11) CapEnrich: Enriching Caption Semantics for Web Images via Cross-modal Pre-trained Knowledge, [Paper], [Code]
(arXiv 2022.11) Visual Commonsense-aware Representation Network for Video Captioning, [Paper], [Code]
(arXiv 2022.11) Language Conditioned Spatial Relation Reasoning for 3D Object Grounding, [Paper], [Code]
(arXiv 2022.11) HARDVS: Revisiting Human Activity Recognition with Dynamic Vision Sensors, [Paper], [Code]
(arXiv 2022.11) Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information, [Paper], [Code]
(arXiv 2022.11) Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks, [Paper], [Code]
(arXiv 2022.11) D^3ETR: Decoder Distillation for Detection Transformer, [Paper]
(arXiv 2022.11) CAE v2: Context Autoencoder with CLIP Target, [Paper]
(arXiv 2022.11) Cross-Modal Adapter for Text-Video Retrieval, [Paper], [Code]
(arXiv 2022.11) TOKEN TURING MACHINES, [Paper]
(arXiv 2022.11) WILL LARGE-SCALE GENERATIVE MODELS CORRUPT FUTURE DATASETS? [Paper], [Code]
(arXiv 2022.11) Demystify Self-Attention in Vision Transformers from a Semantic Perspective: Analysis and Application, [Paper]
(arXiv 2022.11) SATVSR: Scenario Adaptive Transformer for Cross Scenarios Video Super-Resolution, [Paper]
(arXiv 2022.11) TransCC: Transformer-based Multiple Illuminant Color Constancy Using Multitask Learning, [Paper]
(arXiv 2022.11) Stare at What You See: Masked Image Modeling without Reconstruction, [Paper], [Code]
(arXiv 2022.11) HeatViT: Hardware-Efficient Adaptive Token Pruning for Vision Transformers, [Paper]
(arXiv 2022.11) Cross-domain Federated Adaptive Prompt Tuning for CLIP, [Paper]
(arXiv 2022.11) YORO - Lightweight End to End Visual Grounding, [Paper]
(arXiv 2022.11) Knowledge Distillation for Detection Transformer with Consistent Distillation Points Sampling, [Paper]
(arXiv 2022.11) BiViT: Extremely Compressed Binary Vision Transformer, [Paper]
(arXiv 2022.11) ContextCLIP: Contextual Alignment of Image-Text pairs on CLIP visual representations, [Paper]
(arXiv 2022.11) Zero-shot Image Captioning by Anchor-augmented Vision-Language Space Alignment, [Paper]
(arXiv 2022.11) Seeing Beyond the Brain: Conditional Diffusion Model with Sparse Masked Modeling for Vision Decoding, [Paper], [Project]
(arXiv 2022.11) Enhancing Few-Shot Image Classification with Cosine Transformer, [Paper], [Code]
(arXiv 2022.11) SCOTCH and SODA: A Transformer Video Shadow Detection Framework, [Paper]
(arXiv 2022.11) AU-Aware Vision Transformers for Biased Facial Expression Recognition, [Paper]
(arXiv 2022.11) Fast Text-Conditional Discrete Denoising on Vector-Quantized Latent Spaces, [Paper], [Code]
(arXiv 2022.11) Large-Scale Bidirectional Training for Zero-Shot Image Captioning, [Paper]
(arXiv 2022.11) Grafting Pre-trained Models for Multimodal Headline Generation, [Paper]
(arXiv 2022.11) CabViT: Cross Attention among Blocks for Vision Transformer, [Paper], [Code]
(arXiv 2022.11) Composed Image Retrieval with Text Feedback via Multi-grained Uncertainty Regularization, [Paper]
(arXiv 2022.11) SSGVS: Semantic Scene Graph-to-Video Synthesis, [Paper]
(arXiv 2022.11) One-Time Model Adaptation to Heterogeneous Clients: An Intra-Client and Inter-Image Attention Design, [Paper]
(arXiv 2022.11) An Improved End-to-End Multi-Target Tracking Method Based on Transformer Self-Attention, [Paper]
(arXiv 2022.11) Zero-shot Visual Commonsense Immorality Prediction, [Paper], [Code]
(arXiv 2022.11) Hyperbolic Cosine Transformer for LiDAR 3D Object Detection, [Paper]
(arXiv 2022.11) Training a Vision Transformer from scratch in less than 24 hours with 1 GPU, [Paper], [Code]
(arXiv 2022.11) ViTALiTy: Unifying Low-rank and Sparse Approximation for Vision Transformer Acceleration with a Linear Taylor Attention, [Paper]
(arXiv 2022.11) SimOn: A Simple Framework for Online Temporal Action Localization, [Paper], [Code]
(arXiv 2022.11) ERNIE-UNIX^2: A UNIFIED CROSS-LINGUAL CROSS-MODAL FRAMEWORK FOR UNDERSTANDING AND GENERATION, [Paper]
(arXiv 2022.11) SG-Shuffle: Multi-aspect Shuffle Transformer for Scene Graph Generation, [Paper]
(arXiv 2022.11) Understanding Cross-modal Interactions in V&L Models that Generate Scene Descriptions, [Paper]
(arXiv 2022.11) VieCap4H - VLSP 2021: ObjectAoA - Enhancing performance of Object Relation Transformer with Attention on Attention for Vietnamese image captioning, [Paper]
(arXiv 2022.11) Watching the News: Towards VideoQA Models that can Read, [Paper], [Project]
(arXiv 2022.11) Efficient Joint Detection and Multiple Object Tracking with Spatially Aware Transformer, [Paper]
(arXiv 2022.11) Demystify Transformers & Convolutions in Modern Image Deep Networks, [Paper], [Code]
(arXiv 2022.11) InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions, [Paper], [Code]
(arXiv 2022.11) DEPTHFORMER: MULTIMODAL POSITIONAL ENCODINGS AND CROSS-INPUT ATTENTION FOR TRANSFORMER-BASED SEGMENTATION NETWORKS, [Paper]
(arXiv 2022.11) Sequential Transformer for End-to-End Person Search, [Paper]
(arXiv 2022.11) Prompting Large Pre-trained Vision-Language Models For Compositional Concept Learning, [Paper]
(arXiv 2022.11) CASA: Category-agnostic Skeletal Animal Reconstruction, [Paper]
(arXiv 2022.11) ViT-CX: Causal Explanation of Vision Transformers, [Paper]
(arXiv 2022.11) Disentangling Content and Motion for Text-Based Neural Video Manipulation, [Paper]
(arXiv 2022.11) Efficient Multi-order Gated Aggregation Network, [Paper]
(arXiv 2022.11) CLOP: Video-and-Language Pre-Training with Knowledge Regularizations, [Paper]
(arXiv 2022.11) MSMG-Net: Multi-scale Multi-grained Supervised Metworks for Multi-task Image Manipulation Detection and Localization, [Paper]
(arXiv 2022.11) Understanding and Mitigating Overfitting in Prompt Tuning for Vision-Language Models, [Paper], [Code]
(arXiv 2022.11) Zero-shot Video Moment Retrieval With Off-the-Shelf Models, [Paper]
(arXiv 2022.11) Scaling Multimodal Pre-Training via Cross-Modality Gradient Harmonization, [Paper]
(arXiv 2022.11) A Transformer Architecture for Online Gesture Recognition of Mathematical Expressions, [Paper]
(arXiv 2022.11) Evaluating and Improving Factuality in Multimodal Abstractive Summarization, [Paper], [Code]
(arXiv 2022.11) RCDPT: RADAR-CAMERA FUSION DENSE PREDICTION TRANSFORMER, [Paper]
(arXiv 2022.11) Video Event Extraction via Tracking Visual States of Arguments, [Paper]
(arXiv 2022.11) The Lottery Ticket Hypothesis for Vision Transformers, [Paper]
(arXiv 2022.11) TEXTCRAFT: ZERO-SHOT GENERATION OF HIGHFIDELITY AND DIVERSE SHAPES FROM TEXT, [Paper]
(arXiv 2022.11) PolyBuilding: Polygon Transformer for End-to-End Building Extraction, [Paper]
(arXiv 2022.11) RETHINKING HIERARCHIES IN PRE-TRAINED PLAIN VISION TRANSFORMER, [Paper], [Code]
(arXiv 2022.11) SAP-DETR: Bridging the Gap Between Salient Points and Queries-Based Transformer Detector for Fast Model Convergency, [Paper]
(arXiv 2022.11) Could Giant Pretrained Image Models Extract Universal Representations? [Paper]
(arXiv 2022.11) MAEDAY: MAE for few and zero shot AnomalY-Detection, [Paper], [Code]
(arXiv 2022.11) Degenerate Swin to Win: Plain Window-based Transformer without Sophisticated Operations, [Paper]
(arXiv 2022.11) Look Around and Refer: 2D Synthetic Semantics Knowledge Distillation for 3D Visual Grounding, [Paper], [Code]
(arXiv 2022.11) SpaText: Spatio-Textual Representation for Controllable Image Generation, [Paper], [Project]
(arXiv 2022.11) Learning 3D Scene Priors with 2D Supervision, [Paper], [Project]
(arXiv 2022.11) PoET: Pose Estimation Transformer for Single-View, Multi-Object 6D Pose Estimation, [Paper], [Code]
(arXiv 2022.11) Spatial-Spectral Transformer for Hyperspectral Image Denoising, [Paper], [Code]
(arXiv 2022.11) Training Vision-Language Models with Less Bimodal Supervision, [Paper]
(arXiv 2022.11) Text-Only Training for Image Captioning using Noise-Injected CLIP, [Paper], [Code]
(arXiv 2022.11) Attention-based Neural Cellular Automata, [Paper]
(arXiv 2022.11) eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers, [Paper], [Code]
(arXiv 2022.11) Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese, [Paper], [Code]
(arXiv 2022.11) P^3OVD: Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection, [Paper]
(arXiv 2022.11) tSF: Transformer-based Semantic Filter for Few-Shot Learning, [Paper]
(arXiv 2022.11) WITT: A WIRELESS IMAGE TRANSMISSION TRANSFORMER FOR SEMANTIC COMMUNICATIONS, [Paper], [Code]
(arXiv 2022.11) Pair DETR: Contrastive Learning Speeds Up DETR Training, [Paper]
(arXiv 2022.11) Interaction Visual Transformer for Egocentric Action Anticipation, [Paper]
(arXiv 2022.11) UDE: A Unified Driving Engine for Human Motion Generation, [Paper], [Code]
(arXiv 2022.11) Action-GPT: Leveraging Large-scale Language Models for Improved and Generalized Zero Shot Action Generation, [Paper], [Project]
(arXiv 2022.11) Human or Machine? Turing Tests for Vision and Language, [Paper], [Code]
(arXiv 2022.11) Knowledge Prompting for Few-shot Action Recognition, [Paper]
(arXiv 2022.11) UPainting: Unified Text-to-Image Diffusion Generation with Cross-modal Guidance, [Paper], [Project]
(arXiv 2022.11) LVP-M^3: Language-aware Visual Prompt for Multilingual Multimodal Machine Translation, [Paper]
(arXiv 2022.11) PROCONTEXT: PROGRESSIVE CONTEXT TRANSFORMER FOR TRACKING, [Paper], [Code]
(arXiv 2022.11) Video based Object 6D Pose Estimation using Transformers, [Paper], [Code]
(arXiv 2022.11) S2WAT: Image Style Transfer via Hierarchical Vision Transformer using Strips Window Attention, [Paper], [Code]
(arXiv 2022.11) Holistic Interaction Transformer Network for Action Detection, [Paper], [Code]
(arXiv 2022.11) Learning and Retrieval from Prior Data for Skill-based Imitation Learning, [Paper], [Code]
(arXiv 2022.11) SimpleClick: Interactive Image Segmentation with Simple Vision Transformers, [Paper], [Code]
(arXiv 2022.11) TANGO: Text-driven Photorealistic and Robust 3D Stylization via Lighting Decomposition, [Paper], [Code]
(arXiv 2022.11) CPL: Counterfactual Prompt Learning for Vision and Language Models, [Paper], [Code]
(arXiv 2022.11) Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training, [Paper]
(arXiv 2022.11) Selective Query-guided Debiasing for Video Corpus Moment Retrieval, [Paper]
(arXiv 2022.11) Scaling & Shifting Your Features: A New Baseline for Efficient Model Tuning, [Paper], [Code]
(arXiv 2022.11) DENOISING MASKED AUTOENCODERS ARE CERTIFIABLE ROBUST VISION LEARNERS, [Paper], [Code]
(arXiv 2022.11) Token-Label Alignment for Vision Transformers, [Paper], [Code]
(arXiv 2022.11) CLIP-Fields: Weakly Supervised Semantic Fields for Robotic Memory, [Paper], [Code]
(arXiv 2022.11) Multi-Scale Wavelet Transformer for Face Forgery Detection, [Paper]
(arXiv 2022.11) CLIP-PAE: PROJECTION-AUGMENTATION EMBEDDING TO EXTRACT RELEVANT FEATURES FOR A DISENTANGLED, INTERPRETABLE, AND CONTROLLABLE TEXT-GUIDED IMAGE MANIPULATION, [Paper]
(arXiv 2022.11) VISUAL PROMPT TUNING FOR TEST-TIME DOMAIN ADAPTATION, [Paper]
(arXiv 2022.11) FastCLIPstyler: Optimisation-free Text-based Image Style Transfer Using Style Representations, [Paper]
(arXiv 2022.11) PROGRESSIVE DENOISING MODEL FOR FINEGRAINED TEXT-TO-IMAGE GENERATION, [Paper]
(arXiv 2022.11) DALL-E-Bot: Introducing Web-Scale Diffusion Models to Robotics, [Paper], [Project]
(arXiv 2022.11) Decomposed Soft Prompt Guided Fusion Enhancing for Compositional Zero-Shot Learning, [Paper], [Code]
(arXiv 2022.11) ACCURATE IMAGE RESTORATION WITH ATTENTION RETRACTABLE TRANSFORMER, [Paper], [Code]
(arXiv 2022.11) Dilated Neighborhood Attention Transformer, [Paper], [Code]
(arXiv 2022.11) Unified Loss of Pair Similarity Optimization for Vision-Language Retrieval, [Paper]
(arXiv 2022.11) TVLT: Textless Vision-Language Transformer, [Paper], [Code]

2022.10

(arXiv 2022.10) DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention, [Paper]
(arXiv 2022.10) TFORMER: 3D TOOTH SEGMENTATION IN MESH SCANS WITH GEOMETRY GUIDED TRANSFORMER, [Paper]
(arXiv 2022.10) ON-THE-FLY OBJECT DETECTION USING STYLEGAN WITH CLIP GUIDANCE, [Paper]
(arXiv 2022.10) Image-free Domain Generalization via CLIP for 3D Hand Pose Estimation, [Paper]
(arXiv 2022.10) Temporal-Viewpoint Transportation Plan for Skeletal Few-shot Action Recognition, [Paper]
(arXiv 2022.10) A SIMPLE, EFFICIENT AND SCALABLE CONTRASTIVE MASKED AUTOENCODER FOR LEARNING VISUAL REPRESENTATIONS, [Paper]
(arXiv 2022.10) Time-rEversed diffusioN tEnsor Transformer: A new TENET of Few-Shot Object Detection, [Paper]
(arXiv 2022.10) Foreign Object Debris Detection for Airport Pavement Images based on Self-supervised Localization and Vision Transformer, [Paper]
(arXiv 2022.10) ViT-LSLA: Vision Transformer with Light Self-Limited-Attention, [Paper]
(arXiv 2022.10) Generative Negative Text Replay for Continual Vision-Language Pretraining, [Paper]
(arXiv 2022.10) PatchRot: A Self-Supervised Technique for Training Vision Transformers, [Paper]
(arXiv 2022.10) MULTIMODAL TRANSFORMER DISTILLATION FOR AUDIO-VISUAL SYNCHRONIZATION, [Paper]
(arXiv 2022.10) Multimodal Transformer for Parallel Concatenated Variational Autoencoders, [Paper]
(arXiv 2022.10) Differentially Private CutMix for Split Learning with Vision Transformer, [Paper]
(arXiv 2022.10) OHMG: ZERO-SHOT OPEN-VOCABULARY HUMAN MOTION GENERATION, [Paper]
(arXiv 2022.10) VLT: Vision-Language Transformer and Query Generation for Referring Segmentation, [Paper]
(arXiv 2022.10) PSFORMER: POINT TRANSFORMER FOR 3D SALIENT OBJECT DETECTION, [Paper]
(arXiv 2022.10) GRAFTING VISION TRANSFORMERS, [Paper]
(arXiv 2022.10) Generalization Differences between End-to-End and Neuro-Symbolic Vision-Language Reasoning Systems, [Paper]
(arXiv 2022.10) FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified Retrieval and Captioning, [Paper]
(arXiv 2022.10) Masked Vision-Language Transformer in Fashion, [Paper], [Code]
(arXiv 2022.10) Learning Variational Motion Prior for Video-based Motion Capture, [Paper]
(arXiv 2022.10) Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models, [Paper], [Code]
(arXiv 2022.10) TEXT2MODEL: MODEL INDUCTION FOR ZERO-SHOT GENERALIZATION USING TASK DESCRIPTIONS, [Paper]
(arXiv 2022.10) Learning Joint Representation of Human Motion and Language, [Paper]
(arXiv 2022.10) ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model with Knowledge-Enhanced Mixture-of-Denoising-Experts, [Paper]
(arXiv 2022.10) MSF3DDETR: Multi-Sensor Fusion 3D Detection Transformer for Autonomous Driving, [Paper]
(arXiv 2022.10) Li3DeTr: A LiDAR based 3D Detection Transformer, [[Paper]](Li3DeTr: A LiDAR based 3D Detection Transformer)
(arXiv 2022.10) Masked Transformer for image Anomaly Localization, [Paper]
(arXiv 2022.10) Discovering Design Concepts for CAD Sketches, [Paper]
(arXiv 2022.10) Compressing And Debiasing Vision-Language Pre-Trained Models for Visual Question Answering, [Paper]
(arXiv 2022.10) End-to-End Multimodal Representation Learning for Video Dialog, [Paper]
(arXiv 2022.10) TPFNet: A Novel Text In-painting Transformer for Text Removal, [Paper], [Code]
(arXiv 2022.10) IMU2CLIP: MULTIMODAL CONTRASTIVE LEARNING FOR IMU MOTION SENSORS FROM EGOCENTRIC VIDEOS AND TEXT NARRATIONS, [Paper]
(arXiv 2022.10) Can Transformer Attention Spread Give Insights Into Uncertainty of Detected and Tracked Objects? [Paper]
(arXiv 2022.10) SemFormer: Semantic Guided Activation Transformer for Weakly Supervised Semantic Segmentation, [Paper], [Code]
(arXiv 2022.10) End-to-end Tracking with a Multi-query Transformer, [Paper]
(arXiv 2022.10) Explicitly Increasing Input Information Density for Vision Transformers on Small Datasets, [Paper], [Code]
(arXiv 2022.10) TAMFORMER: MULTI-MODAL TRANSFORMER WITH LEARNED ATTENTION MASK FOR EARLY INTENT PREDICTION, [Paper]
(arXiv 2022.10) VISUAL ANSWER LOCALIZATION WITH CROSS-MODAL MUTUAL KNOWLEDGE TRANSFER, [Paper], [Code]
(arXiv 2022.10) Visual Semantic Parsing: From Images to Abstract Meaning Representation, [Paper]
(arXiv 2022.10) End-to-end Transformer for Compressed Video Quality Enhancement, [Paper]
(arXiv 2022.10) PlanT: Explainable Planning Transformers via Object-Level Representations, [Paper], [Project]
(arXiv 2022.10) Strong-TransCenter: Improved Multi-Object Tracking based on Transformers with Dense Representations, [Paper], [Code]
(arXiv 2022.10) GliTr: Glimpse Transformers with Spatiotemporal Consistency for Online Action Prediction, [Paper]
(arXiv 2022.10) VLC-BERT: Visual Question Answering with Contextualized Commonsense Knowledge, [Paper], [Code]
(arXiv 2022.10) Learning by Hallucinating: Vision-Language Pre-training with Weak Supervision, [Paper]
(arXiv 2022.10) Learning Explicit Object-Centric Representations with Vision Transformers, [Paper]
(arXiv 2022.10) Abductive Action Inference, [Paper]
(arXiv 2022.10) Minutiae-Guided Fingerprint Embeddings via Vision Transformers, [Paper], [Code]
(arXiv 2022.10) 3DALL-E: Integrating Text-to-Image AI in 3D Design Workflows, [Paper]
(arXiv 2022.10) COMPOSING ENSEMBLES OF PRE-TRAINED MODELS VIA ITERATIVE CONSENSUS, [Paper], [Code]
(arXiv 2022.10) Do Vision-and-Language Transformers Learn Grounded Predicate-Noun Dependencies?, [Paper]
(arXiv 2022.10) Boosting vision transformers for image retrieval, [Paper], [Code]
(arXiv 2022.10) LiteVL: Efficient Video-Language Learning with Enhanced Spatial-Temporal Modeling, [Paper]
(arXiv 2022.10) Fine-grained Semantic Alignment Network for Weakly Supervised Temporal Language Grounding, [Paper]
(arXiv 2022.10) Face Pyramid Vision Transformer, [Paper], [Code]
(arXiv 2022.10) Context-Enhanced Stereo Transformer, [Paper], [Code]
(arXiv 2022.10) CRT-6D: Fast 6D Object Pose Estimation with Cascaded Refinement Transformers, [Paper], [Code]
(arXiv 2022.10) Rethinking Learning Approaches for Long-Term Action Anticipation, [Paper], [Code]
(arXiv 2022.10) Extending Phrase Grounding with Pronouns in Visual Dialogues, [Paper], [Code]
(arXiv 2022.10) Accumulated Trivial Attention Matters in Vision Transformers on Small Datasets, [Paper], [Code]
(arXiv 2022.10) Transformers For Recognition In Overhead Imagery: A Reality Check, [Paper]
(arXiv 2022.10) Anticipative Feature Fusion Transformer for Multi-Modal Action Anticipation, [Paper], [Code]
(arXiv 2022.10) UIA-ViT: Unsupervised Inconsistency-Aware Method based on Vision Transformer for Face Forgery Detection, [Paper]
(arXiv 2022.10) LCPFormer: Towards Effective 3D Point Cloud Analysis via Local Context Propagation in Transformers, [Paper]
(arXiv 2022.10) Towards Real-Time Text2Video via CLIP-Guided, Pixel-Level Optimization, [Paper], [Code]
(arXiv 2022.10) Language-free Training for Zero-shot Video Grounding, [Paper], [Code]
(arXiv 2022.10) Foreground Guidance and Multi-Layer Feature Fusion for Unsupervised Object Discovery with Transformers, [Paper], [Code]
(arXiv 2022.10) Towards Unifying Reference Expression Generation and Comprehension, [Paper]
(arXiv 2022.10) Robust Self-Supervised Learning with Lie Groups, [Paper]
(arXiv 2022.10) VIOLA: Imitation Learning for Vision-Based Manipulation with Object Proposal Priors, [Paper], [Code]
(arXiv 2022.10) VTC: Improving Video-Text Retrieval with User Comments, [Paper], [Project]
(arXiv 2022.10) SOLVING REASONING TASKS WITH A SLOT TRANSFORMER, [Paper], [Code]
(arXiv 2022.10) Prompting through Prototype: A Prototype-based Prompt Learning on Pretrained Vision-Language Models, [Paper]
(arXiv 2022.10) Grounded Video Situation Recognition, [Paper], [Project]
(arXiv 2022.10) Single Image Super-Resolution Using Lightweight Networks Based on Swin Transformer, [Paper]
(arXiv 2022.10) Visual Spatial Description: Controlled Spatial-Oriented Image-to-Text Generation, [Paper], [Code]
(arXiv 2022.10) MovieCLIP: Visual Scene Recognition in Movies, [Paper]
(arXiv 2022.10) PointTAD: Multi-Label Temporal Action Detection with Learnable Query Points, [Paper], [Code]
(arXiv 2022.10) TOWARDS SUSTAINABLE SELF-SUPERVISED LEARNING, [Paper]
(arXiv 2022.10) Visual-Semantic Contrastive Alignment for Few-Shot Image Classification, [Paper]
(arXiv 2022.10) i-MAE: ARE LATENT REPRESENTATIONS IN MASKED AUTOENCODERS LINEARLY SEPARABLE? [Paper], [Code]
(arXiv 2022.10) 2nd Place Solution to ECCV 2022 Challenge: Transformer-based Action recognition in hand-object interacting scenarios, [Paper]
(arXiv 2022.10) 1st Place Solution to ECCV 2022 Challenge on HBHA: Transformer-based Global 3D Hand Pose Estimation in Two Hands Manipulating Objects Scenarios, [Paper]
(arXiv 2022.10) DALLE-2 is Seeing Double: Flaws in Word-to-Concept Mapping in Text2Image Models, [Paper]
(arXiv 2022.10) CLIP-Driven Fine-grained Text-Image Person Re-identification, [Paper]
(arXiv 2022.10) Dense but Efficient VideoQA for Intricate Compositional Reasoning, [Paper]
(arXiv 2022.10) Multi-view Gait Recognition based on SiameseVisionTransformer, [Paper]
(arXiv 2022.10) TOIST: Task Oriented Instance Segmentation Transformer with Noun-Pronoun Distillation, [Paper], [Code]
(arXiv 2022.10) CroCo: Self-Supervised Pre-training for 3D Vision Tasks by Cross-View Completion, [Paper], [Project]
(arXiv 2022.10) A Unified View of Masked Image Modeling, [Paper], [Code]
(arXiv 2022.10) Cross-Modal Fusion Distillation for Fine-Grained Sketch-Based Image Retrieval, [Paper], [Code]
(arXiv 2022.10) BOAT: Bilateral Local Attention Vision Transformer, [Paper]
(arXiv 2022.12) TOKEN MERGING: YOUR VIT BUT FASTER, [Paper], [Code]
(arXiv 2022.10) Using Language to Extend to Unseen Domains, [Paper], [Code]
(arXiv 2022.10) SWINV2-IMAGEN: HIERARCHICAL VISION TRANSFORMER DIFFUSION MODELS FOR TEXT-TO-IMAGE GENERATION, [Paper]
(arXiv 2022.10) HUMANISE: Language-conditioned Human Motion Generation in 3D Scenes, [Paper], [Project]
(arXiv 2022.10) Transfer-learning for video classification: Video Swin Transformer on multiple domains, [Paper]
(arXiv 2022.10) PERCEPTUAL GROUPING IN VISION-LANGUAGE MODELS, [Paper]
(arXiv 2022.10) How Mask Matters: Towards Theoretical Understandings of Masked Autoencoders, [Paper], [Code]
(arXiv 2022.10) LINEAR VIDEO TRANSFORMER WITH FEATURE FIXATION, [Paper], [Code]
(arXiv 2022.10) Transformer-based dimensionality reduction, [Paper]
(arXiv 2022.10) Bridging the Domain Gap for Multi-Agent Perception, [Paper]
(arXiv 2022.10) TransVisDrone: Spatio-Temporal Transformer for Vision-based Drone-to-Drone Detection in Aerial Videos, [Paper], [Code]
(arXiv 2022.10) SCRATCHING VISUAL TRANSFORMER’S BACK WITH UNIFORM ATTENTION, [Paper]
(arXiv 2022.10) Increasing Visual Awareness in Multimodal Neural Machine Translation from an Information Theoretic Perspective, [Paper]
(arXiv 2022.10) TLDW: Extreme Multimodal Summarisation of News Videos, [Paper], [Code]
(arXiv 2022.10) Character-Centric Story Visualization via Visual Planning and Token Alignment, [Paper], [Code]
(arXiv 2022.10) COFAR: Commonsense and Factual Reasoning in Image Search, [Paper], [Code]
(arXiv 2022.10) Learning Self-Regularized Adversarial Views for Self-Supervised Vision Transformers, [Paper], [Code]
(arXiv 2022.10) Temporal and Contextual Transformer for Multi-Camera Editing of TV Shows, [Paper]
(arXiv 2022.10) Forecasting Human Trajectory from Scene History, [Paper], [Code]
(arXiv 2022.10) SGRAM: Improving Scene Graph Parsing via Abstract Meaning Representation, [Paper]
(arXiv 2022.10) Contrastive Language-Image Pre-Training with Knowledge Graphs, [Paper]
(arXiv 2022.10) A Saccaded Visual Transformer for General Object Spotting, [Paper]
(arXiv 2022.10) Vision Transformers provably learn spatial structure, [Paper]
(arXiv 2022.10) oViT: An Accurate Second-Order Pruning Framework for Vision Transformers, [Paper]
(arXiv 2022.10) Fine-grained Category Discovery under Coarse-grained supervision with Hierarchical Weighted Self-contrastive Learning, [Paper], [Code]
(arXiv 2022.10) Non-Contrastive Learning Meets Language-Image Pre-Training, [Paper]
(arXiv 2022.10) Frame Mining: a Free Lunch for Learning Robotic Manipulation from 3D Point Clouds, [Paper], [Project]
(arXiv 2022.10) Pretrained Transformers Do not Always Improve Robustness, [Paper]
(arXiv 2022.10) Plausible May Not Be Faithful: Probing Object Hallucination in Vision-Language Pre-training, [Paper]
(arXiv 2022.10) CONTRASTIVE AUDIO-VISUAL MASKED AUTOENCODER, [Paper]
(arXiv 2022.10) SWFormer: Sparse Window Transformer for 3D Object Detection in Point Clouds, [Paper]
(arXiv 2022.10) Trailers12k: Improving Transfer Learning with a Dual Image and Video Transformer for Multi-label Movie Trailer Genre Classification, [Paper]
(arXiv 2022.10) AVLEN: Audio-Visual-Language Embodied Navigation in 3D Environments, [Paper]
(arXiv 2022.10) MOVE: Unsupervised Movable Object Segmentation and Detection, [Paper]
(arXiv 2022.10) IS SYNTHETIC DATA FROM GENERATIVE MODELS READY FOR IMAGE RECOGNITION?, [Paper], [Code]
(arXiv 2022.10) Towards Transformer-based Homogenization of Satellite Imagery for Landsat-8 and Sentinel-2, [Paper]
(arXiv 2022.10) MCTNET: A MULTI-SCALE CNN-TRANSFORMER NETWORK FOR CHANGE DETECTION IN OPTICAL REMOTE SENSING IMAGES, [Paper]
(arXiv 2022.10) Vision Transformer Visualization: What Neurons Tell and How Neurons Behave? [Paper], [Code]
(arXiv 2022.10) TokenMixup: Efficient Attention-guided Token-level Data Augmentation for Transformers, [Paper], [Code]
(arXiv 2022.10) SQA3D: SITUATED QUESTION ANSWERING IN 3D SCENES, [Paper]
(arXiv 2022.10) When Adversarial Training Meets Vision Transformers: Recipes from Training to Architecture, [Paper], [Code]
(arXiv 2022.10) STAR-Transformer: A Spatio-temporal Cross Attention Transformer for Human Action Recognition, [Paper]
(arXiv 2022.10) PedFormer: Pedestrian Behavior Prediction via Cross-Modal Attention Modulation and Gated Multitask Learning, [Paper]
(arXiv 2022.10) One Model to Edit Them All: Free-Form Text-Driven Image Manipulation with Semantic Modulations, [Paper], [Code]
(arXiv 2022.10) IMAGINARYNET: LEARNING OBJECT DETECTORS WITHOUT REAL IMAGES AND ANNOTATIONS, [Paper], [Code]
(arXiv 2022.10) Feature-Proxy Transformer for Few-Shot Segmentation, [Paper], [Code]
(arXiv 2022.10) Scene Text Image Super-Resolution via Content Perceptual Loss and Criss-Cross Transformer Blocks, [Paper]
(arXiv 2022.10) UNIFIED VISION AND LANGUAGE PROMPT LEARNING, [Paper], [Code]
(arXiv 2022.10) Exploring Long-Sequence Masked Autoencoders, [Paper], [Code]
(arXiv 2022.10) MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting, [Paper]
(arXiv 2022.10) Interactive Language: Talking to Robots in Real Time, [Paper], [Project]
(arXiv 2022.10) RTFormer: Efficient Design for Real-Time Semantic Segmentation with Transformer, [Paper], [Code]
(arXiv 2022.10) How to Train Vision Transformer on Small-scale Datasets?, [Paper], [Code]
(arXiv 2022.10) Hate-CLIPper: Multimodal Hateful Meme Classification based on Cross-modal Interaction of CLIP Features, [Paper], [Code]
(arXiv 2022.10) Large Models are Parsimonious Learners: Activation Sparsity in Trained Transformers, [Paper]
(arXiv 2022.10) CURVED REPRESENTATION SPACE OF VISION TRANSFORMERS, [Paper]
(arXiv 2022.10) Foundation Transformers, [Paper], [Code]
(arXiv 2022.10) Underspecification in Scene Description-to-Depiction Tasks, [Paper]
(arXiv 2022.10) Continuous conditional video synthesis by neural processes, [Paper], [Code]
(arXiv 2022.10) SAIT: SPARSE VISION TRANSFORMERS THROUGH ADAPTIVE TOKEN PRUNING, [Paper]
(arXiv 2022.10) ZITS++: Image Inpainting by Improving the Incremental Transformer on Structural Priors, [Paper]
(arXiv 2022.10) SLOTFORMER: UNSUPERVISED VISUAL DYNAMICS SIMULATION WITH OBJECT-CENTRIC MODELS, [Paper], [Project]
(arXiv 2022.10) Learning by Asking Questions for Knowledge-based Novel Object Recognition, [Paper]
(arXiv 2022.10) Bridging the Gap Between Vision Transformers and Convolutional Neural Networks on Small Datasets, [Paper], [Code]
(arXiv 2022.10) GGViT:Multistream Vision Transformer Network in Face2Face Facial Reenactment Detection, [Paper]
(arXiv 2022.10) Distilling Knowledge from Language Models for Video-based Action Anticipation, [Paper]
(arXiv 2022.10) Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning, [Paper], [Code]
(arXiv 2022.10) M3VIDEO: MASKED MOTION MODELING FOR SELFSUPERVISED VIDEO REPRESENTATION LEARNING, [Paper]
(arXiv 2022.10) Uplift and Upsample: Efficient 3D Human Pose Estimation with Uplifting Transformers, [Paper], [Code]
(arXiv 2022.10) FontTransformer: Few-shot High-resolution Chinese Glyph Image Synthesis via Stacked Transformers, [Paper]
(arXiv 2022.10) AISFormer: Amodal Instance Segmentation with Transformer, [Paper], [Code]
(arXiv 2022.10) ViewBirdiformer: Learning to recover ground-plane crowd trajectories and ego-motion from a single ego-centric view, [Paper]
(arXiv 2022.10) One does not fit all! On the Complementarity of Vision Encoders for Vision and Language Tasks, [Paper]
(arXiv 2022.10) PROMPT GENERATION NETWORKS FOR EFFICIENT ADAPTATION OF FROZEN VISION TRANSFORMERS, [Paper], [Code]
(arXiv 2022.10) Generating Executable Action Plans with Environmentally-Aware Language Models, [Paper]
(arXiv 2022.10) AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio Visual Event Localization, [Paper]
(arXiv 2022.10) Improving Dense Contrastive Learning with Dense Negative Pairs, [Paper]
(arXiv 2022.10) Fine-Grained Image Style Transfer with Visual Transformers, [Paper], [Code]
(arXiv 2022.10) IT TAKES TWO: MASKED APPEARANCE-MOTION MODELING FOR SELF-SUPERVISED VIDEO TRANSFORMER PRE-TRAINING, [Paper]
(arXiv 2022.10) Contrastive Video-Language Learning with Fine-grained Frame Sampling, [Paper]
(arXiv 2022.10) Style-Guided Inference of Transformer for High-resolution Image Synthesis, [Paper]
(arXiv 2022.10) MAP: Modality-Agnostic Uncertainty-Aware Vision-Language Pre-training Model, [Paper], [Code]
(arXiv 2022.10) LEARNING TO LOCATE VISUAL ANSWER IN VIDEO CORPUS USING QUESTION, [Paper], [Code]
(arXiv 2022.10) UNDERSTANDING EMBODIED REFERENCE WITH TOUCH-LINE TRANSFORMER, [Paper]
(arXiv 2022.10) Point Transformer V2: Grouped Vector Attention and Partition-based Pooling, [Paper], [Code]
(arXiv 2022.10) See, Plan, Predict: Language-guided Cognitive Planning with Video Prediction, [Paper]
(arXiv 2022.10) USING BOTH DEMONSTRATIONS AND LANGUAGE INSTRUCTIONS TO EFFICIENTLY LEARN ROBOTIC TASKS, [Paper], [Project]
(arXiv 2022.10) Generating image captions with external encyclopedic knowledge, [Paper]
(arXiv 2022.10) LOCL: Learning Object-Attribute Composition using Localization, [Paper]
(arXiv 2022.10) SVL-Adapter: Self-Supervised Adapter for Vision-Language Pretrained Models, [Paper], [Code]
(arXiv 2022.10) ConTra: (Con)text (Tra)nsformer for Cross-Modal Video Retrieval, [Paper]
(arXiv 2022.10) Learning Fine-Grained Visual Understanding for Video Question Answering via Decoupling Spatial-Temporal Modeling, [Paper], [Code]
(arXiv 2022.10) (Fusionformer):Exploiting the Joint Motion Synergy with Fusion Network Based On Transformer for 3D Human Pose Estimation, [Paper]
(arXiv 2022.10) Fast-ParC: Position Aware Global Kernel for ConvNets and ViTs, [Paper], [Code]
(arXiv 2022.10) OGC: Unsupervised 3D Object Segmentation from Rigid Dynamics of Point Clouds, [Paper], [Code]
(arXiv 2022.10) Multi-Modal Fusion Transformer for Visual Question Answering in Remote Sensing, [Paper], [Code]
(arXiv 2022.10) Semantics-Consistent Cross-domain Summarization via Optimal Transport Alignment, [Paper]
(arXiv 2022.10) VOLTA: VISION-LANGUAGE TRANSFORMER WITH WEAKLY-SUPERVISED LOCAL-FEATURE ALIGNMENT, [Paper]
(arXiv 2022.10) OPEN-VOCABULARY SEMANTIC SEGMENTATION WITH MASK-ADAPTED CLIP, [Paper], [Project]
(arXiv 2022.10) MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning, [Paper]
(arXiv 2022.10) SELF-SUPERVISED VIDEO REPRESENTATION LEARNING WITH MOTION-AWARE MASKED AUTOENCODERS, [Paper], [Code]
(arXiv 2022.10) LEARNING TO DECOMPOSE VISUAL FEATURES WITH LATENT TEXTUAL PROMPTS, [Paper]
(arXiv 2022.10) DCVQE: A Hierarchical Transformer for Video Quality Assessment, [Paper]
(arXiv 2022.10) Fine-grained Object Categorization for Service Robots, [Paper]
(arXiv 2022.10) CLIP-DIFFUSION-LM: APPLY DIFFUSION MODEL ON IMAGE CAPTIONING, [Paper], [Code]
(arXiv 2022.10) A Memory Transformer Network for Incremental Learning, [Paper]
(arXiv 2022.10) Bridging CLIP and StyleGAN through Latent Alignment for Image Editing, [Paper]
(arXiv 2022.10) LMQFormer: A Laplace-Prior-Guided Mask Query Transformer for Lightweight Snow Removal, [Paper]
(arXiv 2022.10) FS-DETR: FEW-SHOT DETECTION TRANSFORMER WITH PROMPTING AND WITHOUT RE-TRAINING, [Paper]
(arXiv 2022.10) Transformer-based Localization from Embodied Dialog with Large-scale Pre-training, [Paper]
(arXiv 2022.10) Turbo Training with Token Dropout, [Paper]
(arXiv 2022.10) Polyhistor: Parameter-Efficient Multi-Task Adaptation for Dense Vision Tasks, [Paper]
(arXiv 2022.10) C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval, [Paper]
(arXiv 2022.10) Pose Guided Human Image Synthesis with Partially Decoupled GAN, [Paper]
(arXiv 2022.10) A Simple Plugin for Transforming Images to Arbitrary Scales, [Paper], [Project]
(arXiv 2022.10) Time-Space Transformers for Video Panoptic Segmentation, [Paper]
(arXiv 2022.10) MOAT: ALTERNATING MOBILE CONVOLUTION AND ATTENTION BRINGS STRONG VISION MODELS, [Paper], [Code]
(arXiv 2022.10) IMAGEN VIDEO: HIGH DEFINITION VIDEO GENERATION WITH DIFFUSION MODELS, [Paper], [Project]
(arXiv 2022.10) clip2latent: Text driven sampling of a pre-trained StyleGAN using denoising diffusion and CLIP, [Paper]
(arXiv 2022.10) FQDet: Fast-converging Query-based Detector, [Paper], [Code]
(arXiv 2022.10) VARIATIONAL PROMPT TUNING IMPROVES GENERALIZATION OF VISION-LANGUAGE MODELS, [Paper]
(arXiv 2022.10) Grounding Language with Visual Affordances over Unstructured Data, [Paper], [Project]
(arXiv 2022.10) Granularity-aware Adaptation for Image Retrieval over Multiple Tasks, [Paper]
(arXiv 2022.10) WHEN AND WHY VISION-LANGUAGE MODELS BEHAVE LIKE BAGS-OF-WORDS, AND WHAT TO DO ABOUT IT? [Paper]
(arXiv 2022.10) Multi-view Human Body Mesh Translator, [Paper]
(arXiv 2022.10) EXPLORING THE ROLE OF MEAN TEACHERS IN SELFSUPERVISED MASKED AUTO-ENCODERS, [Paper]
(arXiv 2022.10) Point Cloud Recognition with Position-to-Structure Attention Transformers, [Paper]
(arXiv 2022.10) TEMPORALLY CONSISTENT VIDEO TRANSFORMER FOR LONG-TERM VIDEO PREDICTION, [Paper], [Code]
(arXiv 2022.10) PHENAKI: VARIABLE LENGTH VIDEO GENERATION FROM OPEN DOMAIN TEXTUAL DESCRIPTIONS, [Paper]
(arXiv 2022.10) MuRAG: Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text, [Paper]
(arXiv 2022.10) Real-World Robot Learning with Masked Visual Pre-training, [Paper], [Project]
(arXiv 2022.10) BaseTransformers: Attention over base data-points for One Shot Learning, [Paper], [Code]
(arXiv 2022.10) Focal and Global Spatial-Temporal Transformer for Skeleton-based Action Recognition, [Paper]
(arXiv 2022.10) Vision Transformer Based Model for Describing a Set of Images as a Story, [Paper]
(arXiv 2022.10) Video Referring Expression Comprehension via Transformer with Content-aware Query, [Paper], [Code]
(arXiv 2022.10) EFFECTIVE SELF-SUPERVISED PRE-TRAINING ON LOW-COMPUTE NETWORKS WITHOUT DISTILLATION, [Paper]
(arXiv 2022.10) CLIP MODEL IS AN EFFICIENT CONTINUAL LEARNER, [Paper]
(arXiv 2022.10) Content-Based Search for Deep Generative Models, [Paper]
(arXiv 2022.10) MAPLE: MULTI-MODAL PROMPT LEARNING, [Paper], [Code]
(arXiv 2022.10) SYSTEMATIC GENERALIZATION AND EMERGENT STRUCTURES IN TRANSFORMERS TRAINED ON STRUCTURED TASKS, [Paper]
(arXiv 2022.10) WIDE ATTENTION IS THE WAY FORWARD FOR TRANSFORMERS? [Paper]
(arXiv 2022.10) DARTFORMER: FINDING THE BEST TYPE OF ATTENTION, [Paper]
(arXiv 2022.10) MOBILEVITV3: MOBILE-FRIENDLY VISION TRANSFORMER WITH SIMPLE AND EFFECTIVE FUSION OF LOCAL, GLOBAL AND INPUT FEATURES, [Paper], [Code]
(arXiv 2022.10) Differentiable Parsing and Visual Grounding of Verbal Instructions for Object Placement, [Paper], [Project]
(arXiv 2022.10) EAPruning: Evolutionary Pruning for Vision Transformers and CNNs, [Paper]
(arXiv 2022.10) Motion-inductive Self-supervised Object Discovery in Videos, [Paper]
(arXiv 2022.10) Fully Transformer Network for Change Detection of Remote Sensing Images, [Paper], [Code]
(arXiv 2022.10) TOWARDS A UNIFIED VIEW ON VISUAL PARAMETER-EFFICIENT TRANSFER LEARNING, [Paper]
(arXiv 2022.10) Visual Prompt Tuning for Generative Transfer Learning, [Paper]
(arXiv 2022.10) A Strong Transfer Baseline for RGB-D Fusion in Vision Transformers, [Paper]
(arXiv 2022.10) LPT: LONG-TAILED PROMPT TUNING FOR IMAGE CLASSIFICATION, [Paper]
(arXiv 2022.10) Expediting Large-Scale Vision Transformer for Dense Prediction without Fine-tuning, [Paper]
(arXiv 2022.10) CLIP2POINT: TRANSFER CLIP TO POINT CLOUD CLASSIFICATION WITH IMAGE-DEPTH PRE-TRAINING, [Paper]
(arXiv 2022.10) Dual-former: Hybrid Self-attention Transformer for Efficient Image Restoration, [Paper]
(arXiv 2022.10) LANGUAGE-AWARE SOFT PROMPTING FOR VISION & LANGUAGE FOUNDATION MODELS, [Paper]
(arXiv 2022.10) ASIF: COUPLED DATA TURNS UNIMODAL MODELS TO MULTIMODAL WITHOUT TRAINING, [Paper]
(arXiv 2022.10) ImmFusion: Robust mmWave-RGB Fusion for 3D Human Body Reconstruction in All Weather Conditions, [Paper]
(arXiv 2022.10) PROMPT LEARNING WITH OPTIMAL TRANSPORT FOR VISION-LANGUAGE MODELS, [Paper]
(arXiv 2022.10) Bridged Transformer for Vision and Point Cloud 3D Object Detection, [Paper]
(arXiv 2022.10) Dense Prediction Transformer for Scale Estimation in Monocular Visual Odometry, [Paper]
(arXiv 2022.10) HUMAN MOTION DIFFUSION MODEL, [Paper], [Project]
(arXiv 2022.10) TokenFlow: Rethinking Fine-grained Cross-modal Alignment in Vision-Language Retrieval, [Paper]
(arXiv 2022.10) UniCLIP: Unified Framework for Contrastive Language–Image Pre-training, [Paper]
(arXiv 2022.10) CrossDTR: Cross-view and Depth-guided Transformers for 3D Object Detection, [Paper], [Code]
(arXiv 2022.10) Multi-dataset Training of Transformers for Robust Action Recognition, [Paper], [Code]
(arXiv 2022.10) Multi-Scale Human-Object Interaction Detector, [Paper]
(arXiv 2022.10) LGDN: Language-Guided Denoising Network for Video-Language Modeling, [Paper]
(arXiv 2022.10) RaP: Redundancy-aware Video-language Pre-training for Text-Video Retrieval, [Paper], [Code]
(arXiv 2022.10) Intermediate Prototype Mining Transformer for Few-Shot Semantic Segmentation, [Paper], [Code]
(arXiv 2022.10) Decoding Visual Neural Representations by Multimodal Learning of Brain-Visual-Linguistic Features, [Paper], [Code]
(arXiv 2022.10) Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer, [Paper], [Code]
(arXiv 2022.10) Prepended Domain Transformer: Heterogeneous Face Recognition without Bells and Whistles, [Paper]
(arXiv 2022.10) Visual Knowledge Graph for Human Action Reasoning in Videos, [Paper]
(arXiv 2022.10) Human Joint Kinematics Diffusion-Refinement for Stochastic Motion Prediction, [Paper]
(arXiv 2022.10) VIMA: GENERAL ROBOT MANIPULATION WITH MULTIMODAL PROMPTS, [Paper], [Project]
(arXiv 2022.10) What Should the System Do Next?: Operative Action Captioning for Estimating System Actions, [Paper]
(arXiv 2022.10) DMMGAN: Diverse Multi Motion Prediction of 3D Human Joints using Attention-Based Generative Adversarial Network, [Paper]
(arXiv 2022.10) PIZZA: A Powerful Image-only Zero-Shot Zero-CAD Approach to 6 DoF Tracking, [Paper], [Code]

2022.09

(arXiv 2022.09) SELF-DISTILLATION FOR FURTHER PRE-TRAINING OF TRANSFORMERS, [Paper]
(arXiv 2022.09) Visuo-Tactile Transformers for Manipulation, [Paper], [Project]
(arXiv 2022.09) UNDERSTANDING PURE CLIP GUIDANCE FOR VOXEL GRID NERF MODELS, [Paper], [Project]
(arXiv 2022.09) Dual Progressive Transformations for Weakly Supervised Semantic Segmentation, [Paper], [Code]
(arXiv 2022.09) Transformers for Object Detection in Large Point Clouds, [Paper]
(arXiv 2022.09) DIFFUSION-BASED IMAGE TRANSLATION USING DISENTANGLED STYLE AND CONTENT REPRESENTATION, [Paper]
(arXiv 2022.09) ERNIE-VIL 2.0: MULTI-VIEW CONTRASTIVE LEARNING FOR IMAGE-TEXT PRE-TRAINING, [Paper], [Code]
(arXiv 2022.09) LEARNING TRANSFERABLE SPATIOTEMPORAL REPRESENTATIONS FROM NATURAL SCRIPT KNOWLEDGE, [Paper]
(arXiv 2022.09) SMALLCAP: Lightweight Image Captioning Prompted with Retrieval Augmentation, [Paper], [Code]
(arXiv 2022.09) SPIKFORMER: WHEN SPIKING NEURAL NETWORK MEETS TRANSFORMER, [Paper]
(arXiv 2022.09) F-VLM: OPEN-VOCABULARY OBJECT DETECTION UPON FROZEN VISION AND LANGUAGE MODELS, [Paper]
(arXiv 2022.09) CONTRASTIVE CORPUS ATTRIBUTION FOR EXPLAINING REPRESENTATIONS, [Paper]
(arXiv 2022.09) Alignment-guided Temporal Attention for Video Action Recognition, [Paper]
(arXiv 2022.09) EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual and Language Learning, [Paper], [Code]
(arXiv 2022.09) SPOTLIGHT: MOBILE UI UNDERSTANDING USING VISION-LANGUAGE MODELS WITH A FOCUS, [Paper]
(arXiv 2022.09) DREAMFUSION: TEXT-TO-3D USING 2D DIFFUSION, [Paper], [Project]
(arXiv 2022.09) REST: RETRIEVE & SELF-TRAIN FOR GENERATIVE ACTION RECOGNITION, [Paper]
(arXiv 2022.09) Effective Vision Transformer Training: A Data-Centric Perspective, [Paper]
(arXiv 2022.09) Human-in-the-loop Robotic Grasping using BERT Scene Representation, [Paper], [Project]
(arXiv 2022.09) Revisiting Few-Shot Learning from a Causal Perspective, [Paper]
(arXiv 2022.09) Attacking Compressed Vision Transformers, [Paper]
(arXiv 2022.09) Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention, [Paper]
(arXiv 2022.09) DeViT: Deformed Vision Transformers in Video Inpainting, [Paper]
(arXiv 2022.09) Obj2Seq: Formatting Objects as Sequences with Class Prompt for Visual Tasks, [Paper], [Code]
(arXiv 2022.09) Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding, [Paper]
(arXiv 2022.09) Motion Transformer for Unsupervised Image Animation, [Paper]
(arXiv 2022.09) Weighted Contrastive Hashing, [Paper], [Code]
(arXiv 2022.09) CALIP: Zero-Shot Enhancement of CLIP with Parameter-free Attention, [Paper]
(arXiv 2022.09) Dialog Acts for Task-Driven Embodied Agents, [Paper]
(arXiv 2022.09) NEURAL MARIONETTE: A Transformer-based Multi-action Human Motion Synthesis System, [Paper], [Code]
(arXiv 2022.09) Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding, [Paper], [Code]
(arXiv 2022.09) Text-Adaptive Multiple Visual Prototype Matching for Video-Text Retrieval, [Paper]
(arXiv 2022.09) Towards Parameter-Efficient Integration of Pre-Trained Language Models In Temporal Video Grounding, [Paper]
(arXiv 2022.09) Anomaly Detection in Aerial Videos with Transformers, [Paper], [Code]
(arXiv 2022.09) AdaFocusV3: On Unified Spatial-temporal Dynamic Video Recognition, [Paper]
(arXiv 2022.09) Motion Transformer with Global Intention Localization and Local Movement Refinement, [Paper], [Code]
(arXiv 2022.09) FREESEG: FREE MASK FROM INTERPRETABLE CONTRASTIVE LANGUAGE-IMAGE PRETRAINING FOR SEMANTIC SEGMENTATION, [Paper]
(arXiv 2022.09) Learning State-Aware Visual Representations from Audible Interactions, [Paper], [Code]
(arXiv 2022.09) Towards Explainable 3D Grounded Visual Question Answering: A New Benchmark and Strong Baseline, [Paper]
(arXiv 2022.09) Leveraging Self-Supervised Training for Unintentional Action Recognition, [Paper]
(arXiv 2022.09) NeRF-Loc: Transformer-Based Object Localization Within Neural Radiance Fields, [Paper]
(arXiv 2022.09) All are Worth Words: a ViT Backbone for Score-based Diffusion Models, [Paper]
(arXiv 2022.09) Paraphrasing Is All You Need for Novel Object Captioning, [Paper]
(arXiv 2022.09) Collaboration of Pre-trained Models Makes Better Few-shot Learner, [Paper]
(arXiv 2022.09) Multi-modal Video Chapter Generation, [Paper], [Code]
(arXiv 2022.09) Best Prompts for Text-to-Image Models and How to Find Them, [Paper]
(arXiv 2022.09) Swin2SR: SwinV2 Transformer for Compressed Image Super-Resolution and Restoration, [Paper], [Code]
(arXiv 2022.09) 3DPCT: 3D Point Cloud Transformer with Dual Self-attention, [Paper]
(arXiv 2022.09) LIGHTWEIGHT TRANSFORMERS FOR HUMAN ACTIVITY RECOGNITION ON MOBILE DEVICES, [Paper]
(arXiv 2022.09) PACT: Perception-Action Causal Transformer for Autoregressive Robotics Pre-Training, [Paper]
(arXiv 2022.09) UniColor: A Unified Framework for Multi-Modal Colorization with Transformer, [Paper], [Code]
(arXiv 2022.09) Traffic Accident Risk Forecasting using Contextual Vision Transformers, [Paper]
(arXiv 2022.09) CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding, [Paper]
(arXiv 2022.09) Recipe Generation from Unsegmented Cooking Videos, [Paper]
(arXiv 2022.09) PicT: A Slim Weakly Supervised Vision Transformer for Pavement Distress Classification, [Paper], [Code]
(arXiv 2022.09) Show, Interpret and Tell: Entity-aware Contextualised Image Captioning in Wikipedia, [Paper]
(arXiv 2022.09) RNGDet++: Road Network Graph Detection by Transformer with Instance Segmentation and Multi-scale Features Enhancement, [Paper], [Code]
(arXiv 2022.09) Toward 3D Spatial Reasoning for Human-like Text-based Visual Question Answering, [Paper]
(arXiv 2022.09) I2DFormer: Learning Image to Document Attention for Zero-Shot Image Classification, [Paper]
(arXiv 2022.09) Integer Fine-tuning of Transformer-based Models, [Paper]
(arXiv 2022.09) Open-vocabulary Queryable Scene Representations for Real World Planning, [Paper], [Code]
(arXiv 2022.09) DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection, [Paper]
(arXiv 2022.09) Hierarchical Temporal Transformer for 3D Hand Pose Estimation and Action Recognition from Egocentric RGB Videos, [Paper]
(arXiv 2022.09) Graph Reasoning Transformer for Image Parsing, [Paper]
(arXiv 2022.09) Quantum Vision Transformers, [Paper]
(arXiv 2022.09) Active Visual Search in the Wild, [Paper]
(arXiv 2022.09) PPT: token-Pruned Pose Transformer for monocular and multi-view human pose estimation, [Paper], [Code]
(arXiv 2022.09) Learning Distinct and Representative Modes for Image Captioning, [Paper], [Code]
(arXiv 2022.09) TODE-Trans: Transparent Object Depth Estimation with Transformer, [Paper], [Code]
(arXiv 2022.09) Tree-based Text-Vision BERT for Video Search in Baidu Video Advertising, [Paper]
(arXiv 2022.09) Integrative Feature and Cost Aggregation with Transformers for Dense Correspondence, [Paper]
(arXiv 2022.09) Axially Expanded Windows for Local-Global Interaction in Vision Transformers, [Paper]
(arXiv 2022.09) UNCERTAINTY AWARE MULTITASK PYRAMID VISION TRANSFORMER FOR UAV-BASED OBJECT RE-IDENTIFICATION, [Paper]
(arXiv 2022.09) TASKED: Transformer-based Adversarial learning for human activity recognition using wearable sensors via Self-KnowledgE Distillation, [Paper]
(arXiv 2022.09) EcoFormer: Energy-Saving Attention with Linear Complexity, [Paper], [[Code]]](https://github.com/ziplab/EcoFormer)
(arXiv 2022.09) Panoramic Vision Transformer for Saliency Detection in 360◦ Videos, [Paper]
(arXiv 2022.09) THE BIASED ARTIST: EXPLOITING CULTURAL BIASES VIA HOMOGLYPHS IN TEXT-GUIDED IMAGE GENERATION MODELS, [Paper]
(arXiv 2022.09) Scene Graph Modification as Incremental Structure Expanding, [Paper], [Code]
(arXiv 2022.09) Discriminative Sampling of Proposals in Self-Supervised Transformers for Weakly Supervised Object Localization, [Paper], [Code]
(arXiv 2022.09) Real-time Online Video Detection with Temporal Smoothing Transformers, [Paper]
(arXiv 2022.09) ViT-DD: Multi-Task Vision Transformer for Semi-Supervised Driver Distraction Detection, [Paper], [Code]
(arXiv 2022.09) Code as Policies: Language Model Programs for Embodied Control, [Paper], [Project]
(arXiv 2022.09) SQ-Swin: a Pretrained Siamese Quadratic Swin Transformer for Lettuce Browning Prediction, [Paper]
(arXiv 2022.09) Self-Attentive Pooling for Efficient Deep Learning, [Paper]
(arXiv 2022.09) Domain-Unified Prompt Representations for Source-Free Domain Generalization, [Paper], [Code]
(arXiv 2022.09) BRIDGING THE GAP TO REAL-WORLD OBJECTCENTRIC LEARNING, [Paper]
(arXiv 2022.09) Prompt-guided Scene Generation for 3D Zero-Shot Learning, [Paper]
(arXiv 2022.09) RE-IMAGEN: RETRIEVAL-AUGMENTED TEXT-TO-IMAGE GENERATOR, [Paper]
(arXiv 2022.09) Distribution Aware Metrics for Conditional Natural Language Generation, [Paper]
(arXiv 2022.09) CLIPping Privacy: Identity Inference Attacks on Multi-Modal Machine Learning Models, [Paper]
(arXiv 2022.09) Finetuning Pretrained Vision-Language Models with Correlation Information Bottleneck for Robust Visual Question Answering, [Paper]
(arXiv 2022.09) PriorLane: A Prior Knowledge Enhanced Lane Detection Approach Based on Transformer, [Paper], [Code]
(arXiv 2022.09) Can We Solve 3D Vision Tasks Starting from A 2D Vision Transformer? [Paper], [Code]
(arXiv 2022.09) EXPLORING VISUAL INTERPRETABILITY FOR CONTRASTIVE LANGUAGE-IMAGE PRE-TRAINING, [Paper]
(arXiv 2022.09) OmniVL: One Foundation Model for Image-Language and Video-Language Tasks, [Paper]
(arXiv 2022.09) Test-Time Training with Masked Autoencoders, [Paper], [Code]
(arXiv 2022.09) VISUAL RECOGNITION WITH DEEP NEAREST CENTROIDS, [Paper], [Code]
(arXiv 2022.09) One-Shot Transfer of Affordance Regions? AffCorrs! [Paper], [Code]
(arXiv 2022.09) Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models, [Paper], [Code]
(arXiv 2022.09) A Light Recipe to Train Robust Vision Transformers, [Paper], [Code]
(arXiv 2022.09) On the Surprising Effectiveness of Transformers in Low-Labeled Video Recognition, [Paper]
(arXiv 2022.09) Number of Attention Heads vs. Number of Transformer-Encoders in Computer Vision, [Paper]
(arXiv 2022.09) Global Semantic Descriptors for Zero-Shot Action Recognition, [Paper], [Code]
(arXiv 2022.09) Revisiting Neural Scaling Laws in Language and Vision, [Paper]
(arXiv 2022.09) Small Transformers Compute Universal Metric Embeddings, [Paper]
(arXiv 2022.09) CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment, [Paper], [Code]
(arXiv 2022.09) CRAFT: Camera-Radar 3D Object Detection with Spatio-Contextual Fusion Transformer, [Paper]
(arXiv 2022.09) Transformers and CNNs both Beat Humans on SBIR, [Paper]
(arXiv 2022.09) PaLI: A Jointly-Scaled Multilingual Language-Image Model, [Paper]
(arXiv 2022.09) MUST-VQA: MUltilingual Scene-text VQA, [Paper], [Code]
(arXiv 2022.09) Leveraging Large Language Models for Robot 3D Scene Understanding, [Paper], [Code]
(arXiv 2022.09) A lightweight Transformer-based model for fish landmark detection, [Paper]
(arXiv 2022.09) PSAQ-ViT V2: Towards Accurate and General Data-Free Quantization for Vision Transformers, [Paper], [Code]
(arXiv 2022.09) ComplETR: Reducing the cost of annotations for object detection in dense scenes with vision transformers, [Paper]
(arXiv 2022.09) Semantic2Graph: Graph-based Multi-modal Feature for Action Segmentation in Videos, [Paper]
(arXiv 2022.09) CenterFormer: Center-based Transformer for 3D Object Detection, [Paper], [Code]
(arXiv 2022.09) PreSTU: Pre-Training for Scene-Text Understanding, [Paper]
(arXiv 2022.09) OmDet: Language-Aware Object Detection with Large-scale Vision-Language Multi-dataset Pre-training, [Paper]
(arXiv 2022.09) DMTNet: Dynamic Multi-scale Network for Dual-pixel Images Defocus Deblurring with Transformer, [Paper]
(arXiv 2022.09) SeRP: Self-Supervised Representation Learning Using Perturbed Point Clouds, [Paper]
(arXiv 2022.09) VL-Taboo: An Analysis of Attribute-based Zero-shot Capabilities of Vision-Language Models, [Paper]
(arXiv 2022.09) StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story Continuation, [Paper], [Code]
(arXiv 2022.09) ON THE COMPUTATIONAL COMPLEXITY OF SELF-ATTENTION, [Paper]
(arXiv 2022.09) Instruction-driven history-aware policies for robotic manipulations, [Paper], [Code]
(arXiv 2022.09) Towards Multi-Lingual Visual Question Answering, [Paper]
(arXiv 2022.09) PERCEIVER-ACTOR: A Multi-Task Transformer for Robotic Manipulation, [Paper], [Project]
(arXiv 2022.09) GLOBAL PROTOTYPE ENCODING FOR INCREMENTAL VIDEO HIGHLIGHTS DETECTION, [Paper], [Code]
(arXiv 2022.09) Self-Supervised Multimodal Fusion Transformer for Passive Activity Recognition, [Paper]
(arXiv 2022.09) FETA: Towards Specializing Foundation Models for Expert Task Applications, [Paper]
(arXiv 2022.09) Prior Knowledge-Guided Attention in Self-Supervised Vision Transformers, [Paper]
(arXiv 2022.09) Exploring Target Representations for Masked Autoencoders, [Paper]
(arXiv 2022.09) ISS: IMAGE AS STEPPING STONE FOR TEXT-GUIDED 3D SHAPE GENERATION, [Paper]
(arXiv 2022.09) Towards Confidence-guided Shape Completion for Robotic Applications, [Paper], [Code]
(arXiv 2022.09) Pre-training image-language transformers for open-vocabulary tasks, [Paper]
(arXiv 2022.09) Improved Masked Image Generation with Token-Critic, [Paper]
(arXiv 2022.09) Do As I Can, Not As I Say: Grounding Language in Robotic Affordances, [Paper], [Code]
(arXiv 2022.09) Uformer-ICS: A Specialized U-Shaped Transformer for Image Compressive Sensing, [Paper]
(arXiv 2022.09) An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling, [Paper]
(arXiv 2022.09) Spatial-Temporal Transformer for Video Snapshot Compressive Imaging, [Paper], [Code]
(arXiv 2022.09) MAFormer: A Transformer Network with Multi-scale Attention Fusion for Visual Recognition, [Paper]
(arXiv 2022.09) SEFormer: Structure Embedding Transformer for 3D Object Detection, [Paper]
(arXiv 2022.09) ADTR: Anomaly Detection Transformer with Feature Reconstruction, [Paper]
(arXiv 2022.09) Learning Canonical Embeddings for Unsupervised Shape Correspondence with Locally Linear Transformations, [Paper]
(arXiv 2022.09) Transformer-CNN Cohort: Semi-supervised Semantic Segmentation by the Best of Both Students, [Paper]
(arXiv 2022.09) PTSEFormer: Progressive Temporal-Spatial Enhanced TransFormer Towards Video Object Detection, [Paper], [Code]
(arXiv 2022.09) VITKD: PRACTICAL GUIDELINES FOR VIT FEATURE KNOWLEDGE DISTILLATION, [Paper], [Code]
(arXiv 2022.09) DPIT: Dual-Pipeline Integrated Transformer for Human Pose Estimation, [Paper]
(arXiv 2022.09) SkeletonMAE: Spatial-Temporal Masked Autoencoders for Self-supervised Skeleton Action Recognition, [Paper]
(arXiv 2022.09) What does a platypus look like? Generating customized prompts for zero-shot image classification, [Paper], [Code]
(arXiv 2022.09) AI Illustrator: Translating Raw Descriptions into Images by Prompt-based Cross-Modal Generation, [Paper], [Code]
(arXiv 2022.09) MimCo: Masked Image Modeling Pre-training with Contrastive Teacher, [Paper]
(arXiv 2022.09) Multi-modal Contrastive Representation Learning for Entity Alignment, [Paper]
(arXiv 2022.09) Zero-Shot Multi-Modal Artist-Controlled Retrieval and Exploration of 3D Object Sets, [Paper]
(arXiv 2022.09) Geometry Aligned Variational Transformer for Image-conditioned Layout Generation, [Paper]
(arXiv 2022.09) Real-time 3D Single Object Tracking with Transformer, [Paper], [Code]
(arXiv 2022.09) Video-Guided Curriculum Learning for Spoken Video Grounding, [Paper], [Code]
(arXiv 2022.09) FLAME: Free-form Language-based Motion Synthesis & Editing, [Paper]
(arXiv 2022.09) TOKENCUT: SEGMENTING OBJECTS IN IMAGES AND VIDEOS WITH SELF-SUPERVISED TRANSFORMER AND NORMALIZED CUT, [Paper], [Code]
(arXiv 2022.09) Unified Fully and Timestamp Supervised Temporal Action Segmentation via Sequence to Sequence Translation, [Paper]
(arXiv 2022.09) MAPLE: Masked Pseudo-Labeling autoEncoder for Semi-supervised Point Cloud Action Recognition, [Paper], [Project]
(arXiv 2022.09) Visual Prompting via Image Inpainting, [Paper], [Project]
(arXiv 2022.09) RLIP: Relational Language-Image Pre-training for Human-Object Interaction Detection, [Paper], [Code]

2022.08

(arXiv 2022.08) On Grounded Planning for Embodied Tasks with Language Models, [Paper], [Project]
(arXiv 2022.08) Group Activity Recognition in Basketball Tracking Data - Neural Embeddings in Team Sports (NETS), [Paper]
(arXiv 2022.08) SWIN-TRANSFORMER-YOLOV5 FOR REAL-TIME WINE GRAPE BUNCH DETECTION, [Paper]
(arXiv 2022.08) SIM-Trans: Structure Information Modeling Transformer for Fine-grained Visual Categorization, [Paper], [Code]
(arXiv 2022.08) INJECTING IMAGE DETAILS INTO CLIP’S FEATURE SPACE, [Paper]
(arXiv 2022.08) Hierarchical Local-Global Transformer for Temporal Sentence Grounding, [Paper]
(arXiv 2022.08) EViT: Privacy-Preserving Image Retrieval via Encrypted Vision Transformer in Cloud Computing, [Paper]
(arXiv 2022.08) TRUST: An Accurate and End-to-End Table structure Recognizer Using Splitting-based Transformers, [Paper]
(arXiv 2022.08) ELMformer: Efficient Raw Image Restoration with a Locally Multiplicative Transformer, [Paper], [Code]
(arXiv 2022.08) SoMoFormer: Multi-Person Pose Forecasting with Transformers, [Paper]
(arXiv 2022.08) A Circular Window-based Cascade Transformer for Online Action Detection, [Paper]
(arXiv 2022.08) ASpanFormer: Detector-Free Image Matching with Adaptive Span Transformer, [Paper]
(arXiv 2022.08) Robust Sound-Guided Image Manipulation, [Paper]
(arXiv 2022.08) TrojViT: Trojan Insertion in Vision Transformers, [Paper]
(arXiv 2022.08) User-Controllable Latent Transformer for StyleGAN Image Layout Editing, [Paper]
(arXiv 2022.08) Few-Shot Learning Meets Transformer: Unified Query-Support Transformers for Few-Shot Classification, [Paper]
(arXiv 2022.08) JARVIS: A Neuro-Symbolic Commonsense Reasoning Framework for Conversational Embodied Agents, [Paper]
(arXiv 2022.08) TFusion: Transformer based N-to-One Multimodal Fusion Block, [Paper]
(arXiv 2022.08) VMFormer: End-to-End Video Matting with Transformer, [Paper], [Code]
(arXiv 2022.08) LOGICRANK: Logic Induced Reranking for Generative Text-to-Image Systems, [Paper]
(arXiv 2022.08) CLUSTR: EXPLORING EFFICIENT SELF-ATTENTION VIA CLUSTERING FOR VISION TRANSFORMERS, [Paper]
(arXiv 2022.08) Federated Zero-Shot Learning with Mid-Level Semantic Knowledge Transfer, [Paper]
(arXiv 2022.08) Prompt Tuning with Soft Context Sharing for Vision-Language Models, [Paper]
(arXiv 2022.08) Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment, [Paper], [Code]
(arXiv 2022.08) CounTR: Transformer-based Generalised Visual Counting, [Paper], [Code]
(arXiv 2022.08) Open-Set Semi-Supervised Object Detection, [Paper]
(arXiv 2022.08) gSwin: Gated MLP Vision Model with Hierarchical Structure of Shifted Window, [Paper]
(arXiv 2022.08) Adaptive Perception Transformer for Temporal Action Localization, [Paper], [Code]
(arXiv 2022.08) Symbolic Replay: Scene Graph as Prompt for Continual Learning on VQA Task, [Paper], [Code]
(arXiv 2022.08) Masked Autoencoders Enable Efficient Knowledge Distillers, [Paper], [Code]
(arXiv 2022.08) LaTeRF: Label and Text Driven Object Radiance Fields, [Paper]
(arXiv 2022.08) Video Mobile-Former: Video Recognition with Efficient Global Spatial-temporal Modeling, [Paper]
(arXiv 2022.08) Pix4Point: Image Pretrained Transformers for 3D Point Cloud Understanding, [Paper], [Code]
(arXiv 2022.08) MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining, [Paper]
(arXiv 2022.08) Visual Subtitle Feature Enhanced Video Outline Generation, [Paper], [Code]
(arXiv 2022.08) CATS: COMPLEMENTARY CNN AND TRANSFORMER ENCODERS FOR SEGMENTATION, [Paper]
(arXiv 2022.08) Modeling Paragraph-Level Vision-Language Semantic Alignment for Multi-Modal Summarization, [Paper]
(arXiv 2022.08) FashionVQA: A Domain-Specific Visual Question Answering System, [Paper]
(arXiv 2022.08) K-ORDER GRAPH-ORIENTED TRANSFORMER WITH GRAATTENTION FOR 3D POSE AND SHAPE ESTIMATION, [Paper]
(arXiv 2022.08) Towards Efficient Use of Multi-Scale Features in Transformer-Based Object Detectors, [Paper], [Code]
(arXiv 2022.08) Improving video retrieval using multilingual knowledge transfer, [Paper]
(arXiv 2022.08) EFFICIENT SPARSELY ACTIVATED TRANSFORMERS, [Paper]
(arXiv 2022.08) M2HF: MULTI-LEVEL MULTI-MODAL HYBRID FUSION FOR TEXT-VIDEO RETRIEVAL, [Paper]
(arXiv 2022.08) Accelerating Vision Transformer Training via a Patch Sampling Schedule, [Paper], [Project]
(arXiv 2022.08) A Dual Modality Approach For (Zero-Shot) Multi-Label Classification, [Paper]
(arXiv 2022.08) Offline Handwritten Mathematical Recognition using Adversarial Learning and Transformers, [Paper]
(arXiv 2022.08) Semantic-enhanced Image Clustering, [Paper]
(arXiv 2022.08) DPTNet: A Dual-Path Transformer Architecture for Scene Text Detection, [Paper]
(arXiv 2022.08) ProtoPFormer: Concentrating on Prototypical Parts in Vision Transformers for Interpretable Image Recognition, [Paper], [Code]
(arXiv 2022.08) Image as a Foreign Language: BEIT Pretraining for All Vision and Vision-Language Tasks, [Paper], [Project]
(arXiv 2022.08) PoseBERT: A Generic Transformer Module for Temporal 3D Human Modeling, [Paper], [Code]
(arXiv 2022.08) EFFICIENT ATTENTION-FREE VIDEO SHIFT TRANSFORMERS, [Paper]
(arXiv 2022.08) Flat Multi-modal Interaction Transformer for Named Entity Recognition, [Paper]
(arXiv 2022.08) Dance Style Transfer with Cross-modal Transformer, [Paper]
(arXiv 2022.08) Improved Image Classification with Token Fusion , [Paper]
(arXiv 2022.08) VAuLT: Augmenting the Vision-and-Language Transformer with the Propagation of Deep Language Representations, [Paper], [Code]
(arXiv 2022.08) TEXT TO IMAGE GENERATION: LEAVING NO LANGUAGE BEHIND, [Paper]
(arXiv 2022.08) Aspect-based Sentiment Classification with Sequential Cross-modal Semantic Graph, [Paper]
(arXiv 2022.08) Diverse Video Captioning by Adaptive Spatio-temporal Attention, [Paper]
(arXiv 2022.08) VLMAE: Vision-Language Masked Autoencoder, [Paper]
(arXiv 2022.08) SoMoFormer: Social-Aware Motion Transformer for Multi-Person Motion Prediction, [Paper]
(arXiv 2022.08) ILLUME: Rationalizing Vision-Language Models by Interacting with their Jabber, [Paper]
(arXiv 2022.08) ViT-ReT: Vision and Recurrent Transformer Neural Networks for Human Activity Recognition in Videos, [Paper]
(arXiv 2022.08) UniLayout: Taming Unified Sequence-to-Sequence Transformers for Graphic Layout Generation, [Paper]
(arXiv 2022.08) InterTrack: Interaction Transformer for 3D Multi-Object Tracking, [Paper]
(arXiv 2022.08) Understanding Attention for Vision-and-Language Task, [Paper]
(arXiv 2022.08) Towards Open-vocabulary Scene Graph Generation with Prompt-based Finetuning, [Paper]
(arXiv 2022.08) Class-Aware Visual Prompt Tuning for Vision-Language Pre-Trained Model, [Paper]
(arXiv 2022.08) Unifying Visual Perception by Dispersible Points Learning, [Paper], [Code]
(arXiv 2022.08) Text-to-Image Generation via Implicit Visual Guidance and Hypernetwork, [Paper]
(arXiv 2022.08) ConMatch: Semi-Supervised Learning with Confidence-Guided Consistency Regularization, [Paper], [Code]
(arXiv 2022.08) The 8-Point Algorithm as an Inductive Bias for Relative Pose Prediction by ViTs, [Paper]
(arXiv 2022.08) Open-Vocabulary Panoptic Segmentation with MaskCLIP, [Paper]
(arXiv 2022.08) Prompt Vision Transformer for Domain Generalization, [Paper]
(arXiv 2022.08) GSRFormer: Grounded Situation Recognition Transformer with Alternate Semantic Attention Refinement, [Paper]
(arXiv 2022.08) CONVIFORMERS: CONVOLUTIONALLY GUIDED VISION TRANSFORMER, [Paper]
(arXiv 2022.08) Learning Spatial-Frequency Transformer for Visual Object Tracking, [Paper], [Code]
(arXiv 2022.08) Efficient Multimodal Transformer with Dual-Level Feature Restoration for Robust Multimodal Sentiment Analysis, [Paper]
(arXiv 2022.08) Your ViT is Secretly a Hybrid Discriminative-Generative Diffusion Model, [Paper], [Code]
(arXiv 2022.08) LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale, [Paper]
(arXiv 2022.08) ExpansionNet v2: Block Static Expansion in fast end to end training for Image Captioning, [Paper], [Code]
(arXiv 2022.08) Multi-modal Transformer Path Prediction for Autonomous Vehicle, [Paper]
(arXiv 2022.08) Flow-Guided Transformer for Video Inpainting, [Paper], [Code]
(arXiv 2022.08) TL;DW? Summarizing Instructional Videos with Task Relevance & Cross-Modal Saliency, [Paper], [Project]
(arXiv 2022.08) HoW-3D: Holistic 3D Wireframe Perception from a Single Image, [Paper], [Code]
(arXiv 2022.08) BEIT V2: Masked Image Modeling with Vector-Quantized Visual Tokenizers, [Paper], [Code]
(arXiv 2022.08) MILAN: Masked Image Pretraining on Language Assisted Representation, [Paper], [Code]
(arXiv 2022.08) Hybrid Transformer Network for Deepfake Detection, [Paper]
(arXiv 2022.08) Semi-supervised Vision Transformers at Scale, [Paper]
(arXiv 2022.08) PPMN: Pixel-Phrase Matching Network for One-Stage Panoptic Narrative Grounding, [Paper], [Code]
(arXiv 2022.08) Exploring Anchor-based Detection for Ego4D Natural Language Query, [Paper]
(arXiv 2022.08) Language Supervised Training for Skeleton-based Action Recognition, [Paper], [Code]
(arXiv 2022.08) Exploring Point-BEV Fusion for 3D Point Cloud Object Tracking with Transformer, [Paper], [Code]
(arXiv 2022.08) Ghost-free High Dynamic Range Imaging with Context-aware Transformer, [Paper], [Code]
(arXiv 2022.08) CLIP-based Neural Neighbor Style Transfer for 3D Assets, [Paper]
(arXiv 2022.08) Sports Video Analysis on Large-Scale Data, [Paper], [Code]
(arXiv 2022.08) How Well Do Vision Transformers (VTs) Transfer To The Non-Natural Image Domain? An Empirical Study Involving Art Classification, [Paper]
(arXiv 2022.08) In the Eye of Transformer: Global-Local Correlation for Egocentric Gaze Estimation, [Paper], [Code]
(arXiv 2022.08) DALLE-URBAN: Capturing the urban design expertise of large text to image transformers, [Paper], [Code]
(arXiv 2022.08) PlaneFormers: From Sparse View Planes to 3D Reconstruction, [Paper], [Code]
(arXiv 2022.08) Boosting Video-Text Retrieval with Explicit High-Level Semantics, [Paper]
(arXiv 2022.08) Distinctive Image Captioning via CLIP Guided Group Optimization, [Paper]
(arXiv 2022.08) Understanding Masked Image Modeling via Learning Occlusion Invariant Feature, [Paper]
(arXiv 2022.08) GRIT-VLP: Grouped Mini-batch Sampling for Efficient Vision and Language Pre-training, [Paper], [Code]
(arXiv 2022.08) Advancing Plain Vision Transformer Towards Remote Sensing Foundation Model, [Paper], [Code]
(arXiv 2022.08) Domain Randomization-Enhanced Depth Simulation and Restoration for Perceiving and Grasping Specular and Transparent Objects, [Paper], [Code]
(arXiv 2022.08) Jointformer: Single-Frame Lifting Transformer with Error Prediction and Refinement for 3D Human Pose Estimation, [Paper]
(arXiv 2022.08) Frozen CLIP Models are Efficient Video Learners, [Paper], [Code]
(arXiv 2022.08) MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer, [Paper], [Code]
(arXiv 2022.08) HaloAE: An HaloNet based Local Transformer Auto-Encoder for Anomaly Detection and Localization, [Paper], [Code]
(arXiv 2022.08) IVT: An End-to-End Instance-guided Video Transformer for 3D Pose Estimation, [Paper]
(arXiv 2022.08) A Sketch Is Worth a Thousand Words: Image Retrieval with Text and Sketch, [Paper], [Code]
(arXiv 2022.08) PointConvFormer: Revenge of the Point-based Convolution, [Paper]
(arXiv 2022.08) ChiQA: A Large Scale Image-based Real-World Question Answering Dataset for Multi-Modal Understanding, [Paper]
(arXiv 2022.08) LaTTe: Language Trajectory TransformEr, [Paper], [Code]
(arXiv 2022.08) Learning Spatiotemporal Frequency-Transformer for Compressed Video Super-Resolution, [Paper], [Code]
(arXiv 2022.08) TransMatting: Enhancing Transparent Objects Matting with Transformers, [Paper], [Project]
(arXiv 2022.08) Word-Level Fine-Grained Story Visualization, [Paper]
(arXiv 2022.08) Fine-Grained Semantically Aligned Vision-Language Pre-Training, [Paper]
(arXiv 2022.08) Expanding Language-Image Pretrained Models for General Video Recognition, [Paper], [Code]
(arXiv 2022.08) P2P: Tuning Pre-trained Image Models for Point Cloud Analysis with Point-to-Pixel Prompting, [Paper], [Code]
(arXiv 2022.08) DropKey, [Paper]
(arXiv 2022.08) MVSFormer: Multi-View Stereo with Pre-trained Vision Transformers and Temperature-based Depth, [Paper]
(arXiv 2022.08) Per-Clip Video Object Segmentation, [Paper]
(arXiv 2022.08) XCon: Learning with Experts for Fine-grained Category Discovery, [Paper], [Code]
(arXiv 2022.08) Combined CNN Transformer Encoder for Enhanced Fine-grained Human Action Recognition, [Paper]
(arXiv 2022.08) RE-ATTENTION TRANSFORMER FOR WEAKLY SUPERVISED OBJECT LOCALIZATION, [Paper], [Code]
(arXiv 2022.08) TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation, [Paper]
(arXiv 2022.08) Two-Stream Transformer Architecture for Long Form Video Understanding, [Paper]
(arXiv 2022.08) A Fast Text-Driven Approach for Generating Artistic Content, [Paper]
(arXiv 2022.08) DAHITRA: DAMAGE ASSESSMENT USING A NOVEL HIERARCHICAL TRANSFORMER ARCHITECTURE, [Paper]
(arXiv 2022.08) MinVIS: A Minimal Video Instance Segmentation Framework without Video-based Training, [Paper], [Code]
(arXiv 2022.08) Masked Vision and Language Modeling for Multi-modal Representation Learning, [Paper]
(arXiv 2022.08) SSformer: A Lightweight Transformer for Semantic Segmentation, [Paper], [Code]
(arXiv 2022.08) Pose Uncertainty Aware Movement Synchrony Estimation via Spatial-Temporal Graph Transformer, [Paper]
(arXiv 2022.08) Making the Best of Both Worlds: A Domain-Oriented Transformer for Unsupervised Domain Adaptation, [Paper], [Code]
(arXiv 2022.08) Unified Normalization for Accelerating and Stabilizing Transformers, [Paper]
(arXiv 2022.08) An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion, [Paper], [Project]
(arXiv 2022.08) Prompt-to-Prompt Image Editing with Cross Attention Control, [Paper]
(arXiv 2022.08) Momentum Transformer: Closing the Performance Gap Between Self-attention and Its Linearization, [Paper]
(arXiv 2022.08) Testing Relational Understanding in Text-Guided Image Generation, [Paper]
(arXiv 2022.08) UAVM: A Unified Model for Audio-Visual Learning, [Paper]
(arXiv 2022.08) Meta-DETR: Image-Level Few-Shot Detection with Inter-Class Correlation Exploitation, [Paper], [Code]
(arXiv 2022.08) Point Primitive Transformer for Long-Term 4D Point Cloud Video Understanding, [Paper]
(arXiv 2022.08) One for All: One-stage Referring Expression Comprehension with Dynamic Reasoning, [Paper]
(arXiv 2022.08) Toward Understanding WordArt: Corner-Guided Transformer for Scene Text Recognition, [Paper], [Code]
(arXiv 2022.08) SdAE: Self-distillated Masked Autoencoder, [Paper], [Code]
(arXiv 2022.08) Augmenting Vision Language Pretraining by Learning Codebook with Visual Semantics, [Paper]
(arXiv 2022.08) STrajNet: Occupancy Flow Prediction via Multi-modal Swin Transformer, [Paper]
(arXiv 2022.08) D^3Former: Debiased Dual Distilled Transformer for Incremental Learning, [Paper], [Code]
(arXiv 2022.08) Local Perception-Aware Transformer for Aerial Tracking, [Paper], [Code]
(arXiv 2022.08) SIAMIXFORMER: A SIAMESE TRANSFORMER NETWORK FOR BUILDING DETECTION AND CHANGE DETECTION FROM BI-TEMPORAL REMOTE SENSING IMAGES, [Paper]
(arXiv 2022.08) Transformers as Meta-Learners for Implicit Neural Representations, [Paper], [Code]
(arXiv 2022.08) Video Question Answering with Iterative Video-Text Co-Tokenization, [Paper], [Code]
(arXiv 2022.08) Understanding Adversarial Robustness of Vision Transformers via Cauchy Problem, [Paper], [Code]

2022.07

(arXiv 2022.07) Pro-tuning: Unified Prompt Tuning for Vision Tasks, [Paper]
(arXiv 2022.07) ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval, [Paper], [Code]
(arXiv 2022.07) Curriculum Learning for Data-Efficient Vision-Language Alignme, [Paper]
(arXiv 2022.07) DnSwin: Toward Real-World Denoising via Continuous Wavelet Sliding-Transformer, [Paper]
(arXiv 2022.07) Cross-Attention of Disentangled Modalities for 3D Human Mesh Recovery with Transformers, [Paper], [Code]
(arXiv 2022.07) AvatarPoser: Articulated Full-Body Pose Tracking from Sparse Motion Sensing, [Paper], [Project]
(arXiv 2022.07) Semantic-Aligned Matching for Enhanced DETR Convergence and Multi-Scale Feature Fusion, [Paper], [Code]
(arXiv 2022.07) Safety-Enhanced Autonomous Driving Using Interpretable Sensor Fusion Transformer, [Paper], [Code]
(arXiv 2022.07) Video Mask Transfiner for High-Quality Video Instance Segmentation, [Paper], [Project]
(arXiv 2022.07) A Variational AutoEncoder for Transformers with Nonparametric Variational Information Bottleneck, [Paper]
(arXiv 2022.07) Online Continual Learning with Contrastive Vision Transformer, [Paper]
(arXiv 2022.07) Retrieval-Augmented Transformer for Image Captioning, [Paper]
(arXiv 2022.07) Spatiotemporal Self-attention Modeling with Temporal Patch Shift for Action Recognition, [Paper], [Code]
(arXiv 2022.07) Is Attention All NeRF Needs?, [Paper], [Code]
(arXiv 2022.07) Convolutional Embedding Makes Hierarchical Vision Transformer Stronger, [Paper]
(arXiv 2022.07) SiRi: A Simple Selective Retraining Mechanism for Transformer-based Visual Grounding, [Paper], [Code]
(arXiv 2022.07) Deep Clustering with Features from Self-Supervised Pretraining, [Paper]
(arXiv 2022.07) Contrastive Masked Autoencoders are Stronger Vision Learners, [Paper]
(arXiv 2022.07) VICTOR: VISUAL INCOMPATIBILITY DETECTION WITH TRANSFORMERS AND FASHION-SPECIFIC CONTRASTIVE PRE-TRAINING, [Paper]
(arXiv 2022.07) Compositional Human-Scene Interaction Synthesis with Semantic Control, [Paper], [Code]
(arXiv 2022.07) Static and Dynamic Concepts for Self-supervised Video Representation Learning, [Paper]
(arXiv 2022.07) Unsupervised Domain Adaptation for Video Transformers in Action Recognition, [Paper], [Code]
(arXiv 2022.07) LaKo: Knowledge-driven Visual Question Answering via Late Knowledge-to-Text Injection, [Paper]
(arXiv 2022.07) TransFiner: A Full-Scale Refinement Approach for Multiple Object Tracking, [Paper]
(arXiv 2022.07) S-Prompts Learning with Pre-trained Transformers: An Occam’s Razor for Domain Incremental Learning, [Paper]
(arXiv 2022.07) WinoGAViL: Gamified Association Benchmark to Challenge Vision-and-Language Models, [Paper], [Project]
(arXiv 2022.07) Cross-Modal Causal Relational Reasoning for Event-Level Visual Question Answering, [Paper]
(arXiv 2022.07) Graph Neural Network and Spatiotemporal Transformer Attention for 3D Video Object Detection from Point Clouds, [Paper]
(arXiv 2022.07) Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training, [Paper], [Code]
(arXiv 2022.07) V^2L: Leveraging Vision and Vision-language Models into Large-scale Product Retrieval, [Paper], [Code]
(arXiv 2022.07) NewsStories: Illustrating articles with visual summaries, [Paper], [Project]
(arXiv 2022.07) DETRs with Hybrid Matching, [Paper], [Code]
(arXiv 2022.07) GROUP DETR: FAST TRAINING CONVERGENCE WITH DECOUPLED ONE-TO-MANY LABEL ASSIGNMENT, [Paper]
(arXiv 2022.07) Improved Super Resolution of MR Images Using CNNs and Vision Transformers, [Paper]
(arXiv 2022.07) Video Swin Transformers for Egocentric Video Understanding @ Ego4D Challenges 2022, [Paper], [Code]
(arXiv 2022.07) An Impartial Take to the CNN vs Transformer Robustness Contest, [Paper]
(arXiv 2022.07) Generative Artisan: A Semantic-Aware and Controllable CLIPstyler, [Paper]
(arXiv 2022.07) MAR: Masked Autoencoders for Efficient Action Recognition, [Paper], [Code]
(arXiv 2022.07) Object State Change Classification in Egocentric Videos using the Divided Space-Time Attention Mechanism, [Paper], [Cpde]
(arXiv 2022.07) Behind Every Domain There is a Shift: Adapting Distortion-aware Vision Transformers for Panoramic Semantic Segmentation, [Paper], [Code]
(arXiv 2022.07) Reference-based Image Super-Resolution with Deformable Attention Transformer, [Paper], [Code]
(arXiv 2022.07) JIGSAW-VIT: LEARNING JIGSAW PUZZLES IN VISION TRANSFORMER, [Paper], [Code]
(arXiv 2022.07) TransCL: Transformer Makes Strong and Flexible Compressive Learning, [Paper], [Code]
(arXiv 2022.07) 3D Siamese Transformer Network for Single Object Tracking on Point Clouds, [Paper], [Code]
(arXiv 2022.07) Intention-Conditioned Long-Term Human Egocentric Action Forecasting @ EGO4D Challenge 2022, [Paper], [Code]
(arXiv 2022.07) IGFormer: Interaction Graph Transformer for Skeleton-based Human Interaction Recognition, [Paper]
(arXiv 2022.07) Is GPT-3 all you need for Visual Question Answering in Cultural Heritage? [Paper]
(arXiv 2022.07) Applying Spatiotemporal Attention to Identify Distracted and Drowsy Driving with Vision Transformers, [Paper]
(arXiv 2022.07) Action Quality Assessment using Transformers, [Paper]
(arXiv 2022.07) Self-Distilled Vision Transformer for Domain Generalization, [Paper], [Code]
(arXiv 2022.07) Exploring CLIP for Assessing the Look and Feel of Images, [Paper], [Code]
(arXiv 2022.07) Transformer with Implicit Edges for Particle-based Physics Simulation, [Paper], [Code]
(arXiv 2022.07) Auto-regressive Image Synthesis with Integrated Quantization, [Paper]
(arXiv 2022.07) Efficient Modeling of Future Context for Image Captioning, [Paper], [Code]
(arXiv 2022.07) Zero-Shot Video Captioning with Evolving Pseudo-Tokens, [Paper], [Code]
(arXiv 2022.07) Panoptic Scene Graph Generation, [Paper], [Project], [Code]
(arXiv 2022.07) Facial Expression Recognition using Vanilla ViT backbones with MAE Pretraining, [Paper]
(arXiv 2022.07) Target-Driven Structured Transformer Planner for Vision-Language Navigation, [Paper]
(arXiv 2022.07) Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? [Paper]
(arXiv 2022.07) Hybrid CNN-Transformer Model For Facial Affect Recognition In the ABAW4 Challenge, [Paper]
(arXiv 2022.07) MeshMAE: Masked Autoencoders for 3D Mesh Data Analysis, [Paper]
(arXiv 2022.07) SeedFormer: Patch Seeds based Point Cloud Completion with Upsample Transformer, [Paper], [Code]
(arXiv 2022.07) LocVTP: Video-Text Pre-training for Temporal Localization, [Paper], [Code]
(arXiv 2022.07) Temporal Saliency Query Network for Efficient Video Recognition, [Paper], [Code]
(arXiv 2022.07) Pose for Everything: Towards Category-Agnostic Pose Estimation, [Paper], [Code]
(arXiv 2022.07) Weakly Supervised Object Localization via Transformer with Implicit Spatial Calibration, [Paper], [Code]
(arXiv 2022.07) An Efficient Spatio-Temporal Pyramid Transformer for Action Detection, [Paper]
(arXiv 2022.07) Towards Efficient Adversarial Training on Vision Transformers, [Paper]
(arXiv 2022.07) TinyViT: Fast Pretraining Distillation for Small Vision Transformers, [Paper], [Code]
(arXiv 2022.07) Hierarchically Self-Supervised Transformer for Human Skeleton Representation Learning, [Paper], [Code]
(arXiv 2022.07) Explicit Image Caption Editing, [Paper], [Code]
(arXiv 2022.07) AiATrack: Attention in Attention for Transformer Visual Tracking, [Paper], [Code]
(arXiv 2022.07) Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification, [Paper], [Code]
(arXiv 2022.07) Single Frame Atmospheric Turbulence Mitigation: A Benchmark Study and A New Physics-Inspired Transformer Model, [Paper], [Code]
(arXiv 2022.07) HTNet: Anchor-free Temporal Action Localization with Hierarchical Transformers, [Paper]
(arXiv 2022.07) GRIT: Faster and Better Image captioning Transformer Using Dual Visual Features, [Paper]
(arXiv 2022.07) OTPose: Occlusion-Aware Transformer for Pose Estimation in Sparsely-Labeled Videos, [Paper]
(arXiv 2022.07) FaceFormer: Scale-aware Blind Face Restoration with Transformers, [Paper]
(arXiv 2022.07) Multimodal Transformer for Automatic 3D Annotation and Object Detection, [Paper], [Code]
(arXiv 2022.07) Temporal and cross-modal attention for audio-visual zero-shot learning, [Paper], [Code]
(arXiv 2022.07) Locality Guidance for Improving Vision Transformers on Tiny Datasets, [Paper], [Code]
(arXiv 2022.07) Is an Object-Centric Video Representation Beneficial for Transfer? [Paper]
(arXiv 2022.07) DUQIM-Net: Probabilistic Object Hierarchy Representation for Multi-View Manipulation, [Paper]
(arXiv 2022.07) RELATIONAL FUTURE CAPTIONING MODEL FOR EXPLAINING LIKELY COLLISIONS IN DAILY TASKS, [Paper]
(arXiv 2022.07) Conditional DETR V2: Efficient Detection Transformer with Box Queries, [Paper]
(arXiv 2022.07) Exploiting Unlabeled Data with Vision and Language Models for Object Detection, [Paper], [Code]
(arXiv 2022.07) TTVFI: Learning Trajectory-Aware Transformer for Video Frame Interpolation, [Paper], [Code]
(arXiv 2022.07) Time Is MattEr: Temporal Self-supervision for Video Transformers, [Paper]
(arXiv 2022.07) IDET: Iterative Difference-Enhanced Transformers for High-Quality Change Detection, [Paper]
(arXiv 2022.07) Don’t Stop Learning: Towards Continual Learning for the CLIP Model, [Paper]
(arXiv 2022.07) Action Quality Assessment with Temporal Parsing Transformer, [Paper]
(arXiv 2022.07) Visual Representation Learning with Transformer: A Sequence-to-Sequence Perspective, [Paper], [Code]
(arXiv 2022.07) Structural Prior Guided Generative Adversarial Transformers for Low-Light Image Enhancement, [Paper]
(arXiv 2022.07) TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval, [Paper], [Code]
(arXiv 2022.07) Clover: Towards A Unified Video-Language Alignment and Fusion Model, [Paper], [Code]
(arXiv 2022.07) SatMAE: Pre-training Transformers for Temporal and Multi-Spectral Satellite Imagery, [Paper]
(arXiv 2022.07) FashionViL: Fashion-Focused Vision-and-Language Representation Learning, [Paper], [Code]
(arXiv 2022.07) Zero-Shot Temporal Action Detection via Vision-Language Prompting, [Paper], [Code]
(arXiv 2022.07) Rethinking Alignment in Video Super-Resolution Transformers, [Paper], [Code]
(arXiv 2022.07) Open-world Semantic Segmentation via Contrasting and Clustering Vision-Language Embedding, [Paper]
(arXiv 2022.07) TokenMix: Rethinking Image Mixing for Data Augmentation in Vision Transformers, [Paper], [Code]
(arXiv 2022.07) Towards the Human Global Context: Does the Vision-Language Model Really Judge Like a Human Being? [Paper]
(arXiv 2022.07) Defect Transformer: An Efficient Hybrid Transformer Architecture for Surface Defect Detection, [Paper]
(arXiv 2022.07) Semantic Novelty Detection via Relational Reasoning, [Paper]
(arXiv 2022.07) Unifying Event Detection and Captioning as Sequence Generation via Pre-Training, [Paper], [Code]
(arXiv 2022.07) Multi-manifold Attention for Vision Transformers, [Paper]
(arXiv 2022.07) UniFormer: Unified Multi-view Fusion Transformer for Spatial-Temporal Representation in Bird’s-Eye-View, [Paper]
(arXiv 2022.07) Position Prediction as an Effective Pretraining Strategy, [Paper]
(arXiv 2022.07) Lightweight Vision Transformer with Cross Feature Attention, [Paper]
(arXiv 2022.07) Parameterization of Cross-Token Relations with Relative Positional Encoding for Vision MLP, [Paper], [Code]
(arXiv 2022.07) X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval, [Paper]
(arXiv 2022.07) Learning Parallax Transformer Network for Stereo Image JPEG Artifacts Removal, [Paper]
(arXiv 2022.07) A Dual-Masked Auto-Encoder for Robust Motion Capture with Spatial-Temporal Skeletal Token Completion, [Paper]
(arXiv 2022.07) Is a Caption Worth a Thousand Images? A Controlled Study for Representation Learning, [Paper]
(arXiv 2022.07) Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models, [Paper]
(arXiv 2022.07) Cross-Attention Transformer for Video Interpolation, [Paper]
(arXiv 2022.07) Towards Multimodal Vision-Language Models Generating Non-Generic Text, [Paper]
(arXiv 2022.07) QKVA grid: Attention in Image Perspective and Stacked DETR, [Paper], [Code]
(arXiv 2022.07) Snipper: A Spatiotemporal Transformer for Simultaneous Multi-Person 3D Pose Estimation Tracking and Forecasting on a Video Snippet, [Paper], [Code]
(arXiv 2022.07) Horizontal and Vertical Attention in Transformers, [Paper]
(arXiv 2022.07) CoMER: Modeling Coverage for Transformer-based Handwritten Mathematical Expression Recognition, [Paper], [Code]
(arXiv 2022.07) DPText-DETR: Towards Better Scene Text Detection with Dynamic Points in Transformer, [Paper], [Code]
(arXiv 2022.07) DEPTHFORMER: MULTISCALE VISION TRANSFORMER FOR MONOCULAR DEPTH ESTIMATION WITH GLOBAL LOCAL INFORMATION FUSION, [Paper], [Code]
(arXiv 2022.07) LaT: Latent Translation with Cycle-Consistency for Video-Text Retrieval, [Paper]
(arXiv 2022.07) Dual Vision Transformer, [Paper], [Code]
(arXiv 2022.07) Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning, [Paper], [Code]
(arXiv 2022.07) Scaling Novel Object Detection with Weakly Supervised Detection Transformers, [Paper]
(arXiv 2022.07) Hunting Group Clues with Transformers for Social Group Activity Recognition, [Paper]
(arXiv 2022.07) Outpainting by Queries, [Paper], [Code]
(arXiv 2022.07) IDEA: Increasing Text Diversity via Online Multi-Label Recognition for Vision-Language Pre-training, [Paper]
(arXiv 2022.07) Video Graph Transformer for Video Question Answering, [Paper], [Code]
(arXiv 2022.07) Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios, [Paper]
(arXiv 2022.07) UniNet: Unified Architecture Search with Convolution, Transformer, and MLP, [Paper], [Code]
(arXiv 2022.07) Image and Model Transformation with Secret Key for Vision Transformer, [Paper]
(arXiv 2022.07) eX-ViT: A Novel eXplainable Vision Transformer for Weakly Supervised Semantic Segmentation, [Paper]
(arXiv 2022.07) Compound Prototype Matching for Few-shot Action Recognition, [Paper]
(arXiv 2022.07) Long-term Leap Attention, Short-term Periodic Shift for Video Classification, [Paper], [Code]
(arXiv 2022.07) LightViT: Towards Light-Weight Convolution-Free Vision Transformers, [Paper], [Code]
(arXiv 2022.07) Learning from Label Relationships in Human Affect, [Paper]
(arXiv 2022.07) MSP-Former: Multi-Scale Projection Transformer for Single Image Desnowing, [Paper]
(arXiv 2022.07) Tell Me the Evidence? Dual Visual-Linguistic Interaction for Answer Grounding, [Paper]
(arXiv 2022.07) Vision Transformer for NeRF-Based View Synthesis from a Single Input Image, [Paper], [Code]
(arXiv 2022.07) COSIM: Commonsense Reasoning for Counterfactual Scene Imagination, [Paper], [Code]
(arXiv 2022.07) Beyond Transfer Learning: Co-finetuning for Action Localisation, [Paper]
(arXiv 2022.07) RePFormer: Refinement Pyramid Transformer for Robust Facial Landmark Detection, [Paper]
(arXiv 2022.07) k-means Mask Transformer, [Paper], [Code]
(arXiv 2022.07) Training Transformers Together, [Paper], [Code]
(arXiv 2022.07) Improving Few-Shot Image Classification Using Machine- and User-Generated Natural Language Descriptions, [Paper]
(arXiv 2022.07) MaiT: Leverage Attention Masks for More Efficient Image Transformers, [Paper]
(arXiv 2022.07) Dual-Stream Transformer for Generic Event Boundary Captioning, [Paper], [Code]
(arXiv 2022.07) Softmax-free Linear Transformers, [Paper], [[Code[[(https://github.com/fudan-zvg/SOFT)
(arXiv 2022.07) Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection, [Paper], [Code]
(arXiv 2022.07) Transformers are Adaptable Task Planners, [Paper], [Code]
(arXiv 2022.07) ARRAY CAMERA IMAGE FUSION USING PHYSICS-AWARE TRANSFORMERS, [Paper]
(arXiv 2022.07) OSFormer: One-Stage Camouflaged Instance Segmentation with Transformers, [Paper], [Code]
(arXiv 2022.07) Weakly Supervised Grounding for VQA in Vision-Language Transformers, [Paper], [Code]
(arXiv 2022.07) PIC 4th Challenge: Semantic-Assisted Multi-Feature Encoding and Multi-Head Decoding for Dense Video Captioning, [Paper]
(arXiv 2022.07) STVGFormer: Spatio-Temporal Video Grounding with Static-Dynamic Cross-Modal Understanding, [Paper]
(arXiv 2022.07) Towards Counterfactual Image Manipulation via CLIP, [Paper]
(arXiv 2022.07) MatFormer: A Generative Model for Procedural Materials, [Paper]
(arXiv 2022.07) Multimodal Frame-Scoring Transformer for Video Summarization, [Paper]
(arXiv 2022.07) 3D Part Assembly Generation with Instance Encoded Transformer, [Paper]
(arXiv 2022.07) Scene-Aware Prompt for Multi-modal Dialogue Understanding and Generation, [Paper]
(arXiv 2022.07) Efficient Representation Learning via Adaptive Context Pooling, [Paper]
(arXiv 2022.07) Gaze Target Estimation inspired by Interactive Attention, [Paper], [Code]
(arXiv 2022.07) Generalizable Patch-Based Neural Rendering, [Paper], [Project]
(arXiv 2022.07) Interaction Transformer for Human Reaction Generation, [Paper]
(arXiv 2022.07) TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of 3D Human Motions and Texts, [Paper], [Project]
(arXiv 2022.07) FishFormer: Annulus Slicing-based Transformer for Fisheye Rectification with Efficacy Domain Exploration, [Paper]
(arXiv 2022.07) Open-Vocabulary Multi-Label Classification via Multi-modal Knowledge Transfer, [Paper], [Code]
(arXiv 2022.07) Toward Explainable and Fine-Grained 3D Grounding through Referring Textual Phrases, [Paper], [Code]
(arXiv 2022.07) Improving Semantic Segmentation in Transformers using Hierarchical Inter-Level Attention, [Paper]
(arXiv 2022.07) MULTI-MODAL ROBUSTNESS ANALYSIS AGAINST LANGUAGE AND VISUAL PERTURBATIONS, [Paper], [Project]
(arXiv 2022.07) CoBEVT: Cooperative Bird’s Eye View Semantic Segmentation with Sparse Transformers, [Paper]
(arXiv 2022.07) Segmenting Moving Objects via an Object-Centric Layered Representation, [Paper]
(arXiv 2022.07) Counterfactually Measuring and Eliminating Social Bias in Vision-Language Pre-training Models, [Paper]
(arXiv 2022.07) Contrastive Cross-Modal Knowledge Sharing Pre-training for Vision-Language Representation Learning and Retrieval, [Paper]
(arXiv 2022.07) Learning Cross-Image Object Semantic Relation in Transformer for Few-Shot Fine-Grained Image Classification, [Paper], [Code]
(arXiv 2022.07) Memory-Based Label-Text Tuning for Few-Shot Class-Incremental Learning, [Paper]
(arXiv 2022.07) Exploiting Context Information for Generic Event Boundary Captioning, [Paper], [Code]
(arXiv 2022.07) You Only Need One Detector: Unified Object Detector for Different Modalities based on Vision Transformers, [Paper], [Code]
(arXiv 2022.07) Divert More Attention to Vision-Language Tracking, [Paper], [Code]
(arXiv 2022.07) Can Language Understand Depth? [Paper], [Code]
(arXiv 2022.07) TANet: Transformer-based Asymmetric Network for RGB-D Salient Object Detection, [Paper], [Code]
(arXiv 2022.07) DUET: Cross-modal Semantic Grounding for Contrastive Zero-shot Learning, [Paper]
(arXiv 2022.07) Transferring Textual Knowledge for Visual Recognition, [Paper], [Code]
(arXiv 2022.07) R^2-VOS: Robust Referring Video Object Segmentation via Relational Cycle Consistency, [Paper]
(arXiv 2022.07) CRFormer: A Cross-Region Transformer for Shadow Removal, [Paper]
(arXiv 2022.07) Dynamic Spatial Sparsification for Efficient Vision Transformers and Convolutional Neural Networks, [Paper], [Code]
(arXiv 2022.07) Back to MLP: A Simple Baseline for Human Motion Prediction, [Paper], [Code]
(arXiv 2022.07) I-ViT: Integer-only Quantization for Efficient Vision Transformer Inference, [Paper]
(arXiv 2022.07) Rethinking Query-Key Pairwise Interactions in Vision Transformers, [Paper]
(arXiv 2022.07) LARGE-SCALE ROBUSTNESS ANALYSIS OF VIDEO ACTION RECOGNITION MODELS, [Paper], [Code]
(arXiv 2022.07) VL-CheckList: Evaluating Pre-trained Vision-Language Models with Objects, Attributes and Relations, [Paper], [Code]
(arXiv 2022.07) Masked Autoencoders for Self-Supervised Learning on Automotive Point Clouds, [Paper]
(arXiv 2022.07) MotionMixer: MLP-based 3D Human Body Pose Forecasting, [Paper], [Code]
(arXiv 2022.07) DALG: Deep Attentive Local and Global Modeling for Image Retrieval, [Paper]
(arXiv 2022.07) PolarFormer: Multi-camera 3D Object Detection with Polar Transformers, [Paper], [Code]
(arXiv 2022.07) CTrGAN: Cycle Transformers GAN for Gait Transfer, [Paper]
(arXiv 2022.07) LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action, [Paper]
(arXiv 2022.07) Bootstrapped Masked Autoencoders for Vision BERT Pretraining, [Paper], [Code]
(arXiv 2022.07) ReAct: Temporal Action Detection with Relational Queries, [Paper], [Code]
(arXiv 2022.07) Benchmarking Omni-Vision Representation through the Lens of Visual Realms, [Paper], [Project]
(arXiv 2022.07) Convolutional Bypasses Are Better Vision Transformer Adapters, [Paper]
(arXiv 2022.07) LANGUAGE MODELLING WITH PIXELS, [Paper]
(arXiv 2022.07) Transformer-based Context Condensation for Boosting Feature Pyramids in Object Detection, [Paper]
(arXiv 2022.07) DEEPFAKE VIDEO DETECTION WITH SPATIOTEMPORAL DROPOUT TRANSFORMER, [Paper]
(arXiv 2022.07) iColoriT: Towards Propagating Local Hint to the Right Region in Interactive Colorization by Leveraging Vision Transformer, [Paper]
(arXiv 2022.07) Imaging through the Atmosphere using Turbulence Mitigation Transformer, [Paper]
(arXiv 2022.07) Symmetry-Aware Transformer-based Mirror Detection, [Paper], [Code]
(arXiv 2022.07) Pyramid Transformer for Traffic Sign Detection, [Paper]
(arXiv 2022.07) Global-local Motion Transformer for Unsupervised Skeleton-based Action Learning, [Paper], [Code]
(arXiv 2022.07) DynaST: Dynamic Sparse Transformer for Exemplar-Guided Image Generation, [Paper]
(arXiv 2022.07) Trans4Map: Revisiting Holistic Top-down Mapping from Egocentric Images to Allocentric Semantics with Vision Transformers, [Paper], [Code]
(arXiv 2022.07) Entry-Flipped Transformer for Inference and Prediction of Participant Behavior, [Paper]
(arXiv 2022.07) Wayformer: Motion Forecasting via Simple & Efficient Attention Networks, [Paper]
(arXiv 2022.07) Diverse Dance Synthesis via Keyframes with Transformer Controllers, [Paper]
(arXiv 2022.07) Learning to Estimate External Forces of Human Motion in Video, [Paper]
(arXiv 2022.07) Vision Transformer for Contrastive Clustering, [Paper], [Code]
(arXiv 2022.07) Pose2Room: Understanding 3D Scenes from Human Activities, [Paper]
(arXiv 2022.07) Towards Hard-Positive Query Mining for DETR-based Human-Object Interaction Detection, [Paper], [Code]
(arXiv 2022.07) Cross-Architecture Knowledge Distillation, [Paper]
(arXiv 2022.07) Distance Matters in Human-Object Interaction Detection, [Paper]

2022.06

(arXiv 2022.06) TENET: Transformer Encoding Network for Effective Temporal Flow on Motion Prediction, [Paper]
(arXiv 2022.06) GaitForeMer: Self-Supervised Pre-Training of Transformers via Human Motion Forecasting for Few-Shot Gait Impairment Severity Estimation, [Paper], [Code]
(arXiv 2022.06) GSCLIP : A Framework for Explaining Distribution Shifts in Natural Language, [Paper]
(arXiv 2022.06) Spatial Transformer Network with Transfer Learning for Small-scale Fine-grained Skeleton-based Tai Chi Action Recognition, [Paper]
(arXiv 2022.06) Causality for Inherently Explainable Transformers: CAT-XPLAIN, [Paper], [Code]
(arXiv 2022.06) A Unified End-to-End Retriever-Reader Framework for Knowledge-based VQA, [Paper]
(arXiv 2022.06) Continual Learning with Transformers for Image Classification, [Paper]
(arXiv 2022.06) ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning for Action Recognition, [Paper]
(arXiv 2022.06) Robustifying Vision Transformer without Retraining from Scratch by Test-Time Class-Conditional Feature Alignment, [Paper], [Code]
(arXiv 2022.06) Leveraging Language for Accelerated Learning of Tool Manipulation, [Paper]
(arXiv 2022.06) RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video Retrieval, [Paper]
(arXiv 2022.06) VLCAP: VISION-LANGUAGE WITH CONTRASTIVE LEARNING FOR COHERENT VIDEO PARAGRAPH CAPTIONING, [Paper], [Code]
(arXiv 2022.06) Video2StyleGAN: Encoding Video in Latent Space for Manipulation, [Paper]
(arXiv 2022.06) Text-Driven Stylization of Video Objects, [Paper], [Project]
(arXiv 2022.06) Open Vocabulary Object Detection with Proposal Mining and Prediction Equalization, [Paper], [Code]
(arXiv 2022.06) CMT-DeepLab: Clustering Mask Transformers for Panoptic Segmentation, [Paper]
(arXiv 2022.06) Towards Adversarial Attack on Vision-Language Pre-training Models, [Paper]
(arXiv 2022.06) CLiMB: A Continual Learning Benchmark for Vision-and-Language Tasks, [Paper], [Code]
(arXiv 2022.06) VISUALIZING AND UNDERSTANDING SELF-SUPERVISED VISION LEARNING, [Paper], [Code]
(arXiv 2022.06) VReBERT: A Simple and Flexible Transformer for Visual Relationship Detection, [Paper]
(arXiv 2022.06) Bear the Query in Mind: Visual Grounding with Query-conditioned Convolution, [Paper]
(arXiv 2022.06) DALL-E for Detection: Language-driven Context Image Synthesis for Object Detection, [Paper]
(arXiv 2022.06) REVECA – Rich Encoder-decoder framework for Video Event CAptioner, [Paper], [Code]
(arXiv 2022.06) SAViR-T: Spatially Attentive** Visual Reasoning** with Transformers, [Paper]
(arXiv 2022.06) EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm, [Paper], [Code]
(arXiv 2022.06) DualCoOp: Fast Adaptation to Multi-Label Recognition with Limited Annotations, [Paper]
(arXiv 2022.06) Capturing and Inferring Dense Full-Body Human-Scene Contact, [Paper], [Project]
(arXiv 2022.06) M&M Mix: A Multimodal Multiview Transformer Ensemble, [Paper]
(arXiv 2022.06) DisCoVQA: Temporal Distortion-Content Transformers for Video Quality Assessment, [Paper]
(arXiv 2022.06) Voxel-MAE: Masked Autoencoders for Pre-training Large-scale Point Clouds, [Paper], [Code]
(arXiv 2022.06) Global Context Vision Transformers, [Paper], [Code]
(arXiv 2022.06) Counting Varying Density Crowds Through Density Guided Adaptive Selection CNN and Transformer Estimation, [Paper]
(arXiv 2022.06) One-stage Action Detection Transformer, [Paper]
(arXiv 2022.06) SemMAE: Semantic-Guided Masking for Learning Masked Autoencoders, [Paper]
(arXiv 2022.06) TRANSFORMER-BASED MULTI-MODAL PROPOSAL AND RE-RANK FOR WIKIPEDIA IMAGE-CAPTION MATCHING, [Paper], [Code]
(arXiv 2022.06) Vicinity Vision Transformer, [Paper], [Code]
(arXiv 2022.06) EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications, [Paper], [Code]
(arXiv 2022.06) Temporally Consistent Semantic Video Editing, [Paper]
(arXiv 2022.06) VLMbench: A Compositional Benchmark for Vision-and-Language Manipulation, [Paper]
(arXiv 2022.06) MINEDOJO: Building Open-Ended Embodied Agents with Internet-Scale Knowledge, [Paper], [Project]
(arXiv 2022.06) IRISformer: Dense Vision Transformers for Single-Image Inverse Rendering in Indoor Scenes, [Paper], [Code]
(arXiv 2022.06) Backdoor Attacks on Vision Transformers, [Paper], [Code]
(arXiv 2022.06) Rectify ViT Shortcut Learning by Visual Saliency, [Paper]
(arXiv 2022.06) Learning Using Privileged Information for Zero-Shot Action Recognition, [Paper]
(arXiv 2022.06) Bridge-Tower: Building Bridges Between Encoders in Vision-Language Representation Learning, [Paper], [Code]
(arXiv 2022.06) CtrlFormer: Learning Transferable State Representation for Visual Control via Transformer, [Paper], [Project]
(arXiv 2022.06) SimA: Simple Softmax-free Attention for Vision Transformers, [Paper], [Code]
(arXiv 2022.06) UNIFIED-IO: A UNIFIED MODEL FOR VISION, LANGUAGE, AND MULTI-MODAL TASKS, [Paper], [Project]
(arXiv 2022.06) VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix, [Paper], [Code]
(arXiv 2022.06) ReLER@ZJU-Alibaba Submission to the Ego4D Natural Language Queries Challenge 2022, [Paper]
(arXiv 2022.06) Video + CLIP Baseline for Ego4D Long-term Action Anticipation, [Paper], [Code]
(arXiv 2022.06) What makes domain generalization hard?, [Paper]
(arXiv 2022.06) SAVi++: Towards End-to-End Object-Centric Learning from Real-World Videos, [Paper], [Code]
(arXiv 2022.06) Disentangling visual and written concepts in CLIP, [Paper], [Project]
(arXiv 2022.06) Multi-scale Cooperative Multimodal Transformers for Multimodal Sentiment Analysis in Videos, [Paper]
(arXiv 2022.06) Patch-level Representation Learning for Self-supervised Vision Transformers, [Paper]
(arXiv 2022.06) Zero-Shot Video Question Answering via Frozen Bidirectional Language Models, [Paper], [Code]
(arXiv 2022.06) OmniMAE: Single Model Masked Pretraining on Images and Videos, [Paper], [Code]
(arXiv 2022.06) Adapting Self-Supervised Vision Transformers by Probing Attention-Conditioned Masking Consistency, [Paper], [Code]
(arXiv 2022.06) LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling, [Paper], [Code]
(arXiv 2022.06) Multimodal Event Graphs: Towards Event Centric Understanding of Multimodal World, [Paper]
(arXiv 2022.06) Rethinking Generalization in Few-Shot Classification, [Paper], [Code]
(arXiv 2022.06) VCT: A Video Compression Transformer, [Paper]
(arXiv 2022.06) Forecasting of depth and ego-motion with transformers and self-supervision, [Paper]
(arXiv 2022.06) Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone, [Paper], [Code]
(arXiv 2022.06) SP-ViT: Learning 2D Spatial Priors for Vision Transformers, [Paper]
(arXiv 2022.06) A Simple Data Mixing Prior for Improving Self-Supervised Learning, [Paper], [Code]
(arXiv 2022.06) Prefix Language Models are Unified Modal Learners, [Paper], [Code]
(arXiv 2022.06) Masked Frequency Modeling for Self-Supervised Visual Pre-Training, [Paper], [Code]](https://www.mmlab-ntu.com/project/mfm/index.html)
(arXiv 2022.06) Generalizable Neural Radiance Fields for Novel View Synthesis with Transformer, [Paper]
(arXiv 2022.06) A Unified Continuous Learning Framework for Multi-modal Knowledge Discovery and Pre-training, [Paper]
(arXiv 2022.06) Learning to Estimate Shapley Values with Vision Transformers, [Paper], [Code]
(arXiv 2022.06) Graph-based Spatial Transformer with Memory Replay for Multi-future Pedestrian Trajectory Prediction, [Paper], [Code]
(arXiv 2022.06) GLIPv2: Unifying Localization and VL Understanding, [Paper], [Code]
(arXiv 2022.06) INDIGO: Intrinsic Multimodality for Domain Generalization, [Paper]
(arXiv 2022.06) TRANSDUCTIVE CLIP WITH CLASS-CONDITIONAL CONTRASTIVE LEARNING, [Paper]
(arXiv 2022.06) SILVER-BULLET-3D AT MANISKILL 2021: LEARNING-FROM-DEMONSTRATIONS AND HEURISTIC RULE-BASED METHODS FOR OBJECT MANIPULATION, [Paper], [Code]
(arXiv 2022.06) MLP-3D: A MLP-like 3D Architecture with Grouped Time Mixing, [Paper], [Code]
(arXiv 2022.06) Visual Transformer for Object Detection, [Paper]
(arXiv 2022.06) Bringing **Image Scene Structure to Video via Frame-Clip Consistency of Object Tokens, [Paper], [Code]
(arXiv 2022.06) TransVG++: End-to-End Visual Grounding with Language Conditioned Vision Transformer, [Paper]
(arXiv 2022.06) ReCo: Retrieve and Co-segment for Zero-shot Transfer, [Paper], [Project]
(arXiv 2022.06) MAREO: MEMORY- AND ATTENTION- BASED VISUAL REASONING, [Paper]
(arXiv 2022.06) Recurrent Transformer Variational Autoencoders for Multi-Action Motion Synthesis, [Paper]
(arXiv 2022.06) Object Scene Representation Transformer, [Paper]
(arXiv 2022.06) Comprehending and Ordering Semantics for Image Captioning, [Paper], [Code]
(arXiv 2022.06) Exploring Adversarial Attacks and Defenses in Vision Transformers trained with DINO, [Paper]
(arXiv 2022.06) Peripheral Vision Transformer, [Paper], [Code]
(arXiv 2022.06) Efficient Decoder-free Object Detection with Transformers, [Paper], [Code]
(arXiv 2022.06) Prototypical Contrastive Language Image Pretraining, [Paper], [Code]
(arXiv 2022.06) SpA-Former:Transformer image** shadow detection and removal** via spatial attention, [Paper], [Code]
(arXiv 2022.06) A Unified and Biologically-Plausible Relational Graph Representation of Vision Transformers, [Paper]
(arXiv 2022.06) Can Foundation Models Talk Causality? [Paper]
(arXiv 2022.06) Learning Viewpoint-Agnostic Visual Representations by Recovering Tokens in 3D Space, [Paper], [Code]
(arXiv 2022.06) MaskViT: Masked Visual Pre-Training for Video Prediction, [Paper]
(arXiv 2022.06) PromptPose: Language Prompt Helps Animal Pose Estimation, [Paper]
(arXiv 2022.06) Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos, [Paper]
(arXiv 2022.06) MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound, [Paper], [Project]
(arXiv 2022.06) Building Spatio-temporal Transformers for Egocentric 3D Pose Estimation, [Paper]
(arXiv 2022.06) Position Labels for Self-Supervised Vision Transformer, [Paper]
(arXiv 2022.06) Exploring Feature Self-relation for Self-supervised Transformer, [Paper]
(arXiv 2022.06) Patch-based Object-centric Transformers for Efficient Video Generation, [Paper], [Code]
(arXiv 2022.06) Sparse Fusion Mixture-of-Experts are Domain Generalizable Learners, [Paper], [Code]
(arXiv 2022.06) VN-Transformer: Rotation-Equivariant Attention for Vector Neurons, [Paper]
(arXiv 2022.06) CLIP-Actor: Text-Driven Recommendation and Stylization for Animating Human Meshes, [Paper], [Code]
(arXiv 2022.06) OOD Augmentation May Be at Odds with Open-Set Recognition, [Paper]
(arXiv 2022.06) Draft-and-Revise: Effective Image Generation with Contextual RQ-Transformer, [Paper]
(arXiv 2022.06) cycle text2face: cycle text-to-face gan via transformers, [Paper]
(arXiv 2022.06) Efficient and Robust 2D-to-BEV Representation Learning via Geometry-guided Kernel Transformer, [Paper], [Code]
(arXiv 2022.06) Transformer based Urdu Handwritten Text Optical Character Reader, [Paper]
(arXiv 2022.06) Spatial Entropy Regularization for Vision Transformers, [Paper]
(arXiv 2022.06) On Data Scaling in Masked Image Modeling, [Paper]
(arXiv 2022.06) Extreme Masking for Learning Instance and Distributed Visual Representations, [Paper]
(arXiv 2022.06) GateHUB: Gated History Unit with Background Suppression for Online Action Detection, [Paper]
(arXiv 2022.06) Anomaly detection in surveillance videos using transformer based attention model, [Paper], [Code]
(arXiv 2022.06) ContraCLIP: Interpretable GAN generation driven by pairs of contrasting sentences, [Paper], [Code]
(arXiv 2022.06) EAANet: Efficient Attention Augmented Convolutional Networks, [Paper]
(arXiv 2022.06) Visual Clues: Bridging Vision and Language Foundations for Image Paragraph Captioning, [Paper]
(arXiv 2022.06) Recurrent Video Restoration Transformer with Guided Deformable Attention, [Paper], [Code]
(arXiv 2022.06) Rethinking the Openness of CLIP, [Paper]
(arXiv 2022.06) OrdinalCLIP: Learning Rank Prompts for Language-Guided Ordinal Regression, [Paper]
(arXiv 2022.06) Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval, [Paper]
(arXiv 2022.06) CONTRASTIVE GRAPH MULTIMODAL MODEL FOR TEXT CLASSIFICATION IN VIDEOS, [Paper]
(arXiv 2022.06) Separable Self-attention for Mobile Vision Transformers, [Paper], [Code]
(arXiv 2022.06) Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation, [Paper], [Code]
(arXiv 2022.06) Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts, [Paper]
(arXiv 2022.06) cViL: Cross-Lingual Training of Vision-Language Models using Knowledge Distillation, [Paper]
(arXiv 2022.06) Masked Unsupervised Self-training for Zero-shot Image Classification, [Paper], [Code]
(arXiv 2022.06) DETR++: Taming Your Multi-Scale Detection Transformer, [Paper]
(arXiv 2022.06) Structured Context Transformer for Generic Event Boundary Detection, [Paper]
(arXiv 2022.06) Revealing Single Frame Bias for Video-and-Language Learning, [Paper], [Code]
(arXiv 2022.06) Cerberus Transformer: Joint Semantic, Affordance and Attribute Parsing, [Paper], [Code]
(arXiv 2022.06) Can CNNs Be More Robust Than Transformers? [Paper], [Code]
(arXiv 2022.06) Detection Hub: Unifying Object Detection Datasets via Query Adaptation on Language Embedding, [Paper]
(CVPR 2022) Keypoint Transformer: Solving Joint Identification in Challenging Hands and Object Interactions for Accurate 3D Pose Estimation, [Paper]
(arXiv 2022.06) A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge, [Paper], [Project]
(arXiv 2022.06) Revisiting the “Video” in Video-Language Understanding, [Paper], [Project]
(arXiv 2022.06) Efficient Self-supervised Vision Pretraining with Local Masked Reconstruction, [Paper]
(arXiv 2022.06) Modeling Image Composition for Complex Scene Generation, [Paper], [Code]
(arXiv 2022.06) Unified Recurrence Modeling for Video Action Anticipation, [Paper]
(arXiv 2022.06) Prefix Conditioning Unifies Language and Label Supervision, [Paper]
(arXiv 2022.06) Optimizing Relevance Maps of Vision Transformers Improves Robustness, [Paper], [Code]
(arXiv 2022.06) VL-BEIT: Generative Vision-Language Pretraining, [Paper], [Code]
(arXiv 2022.06) EfficientFormer: Vision Transformers at MobileNet Speed, [Paper], [Code]
(arXiv 2022.06) REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual Question Answering, [Paper]
(arXiv 2022.06) Siamese Image Modeling for Self-Supervised Vision Representation Learning, [Paper]
(CVPR 2022) Distillation Using Oracle Queries for Transformer-based Human-Object nteraction Detection, [Paper], [Code]
(CVPR 2022) Exploring Structure-aware Transformer over Interaction Proposals for Human-Object Interaction Detection, [Paper], [Code]
(CVPR 2022) Human Trajectory Prediction with Momentary Observation, [Paper]
(arXiv 2022.06) Where are my Neighbors? Exploiting Patches Relations in Self-Supervised Vision Transformer, [Paper]
(arXiv 2022.06) Unifying Voxel-based Representation with Transformer for 3D Object Detection, [Paper], [Code]
(arXiv 2022.06) Extreme Floorplan Reconstruction by Structure-Hallucinating Transformer Cascades, [Paper]
(arXiv 2022.06) Cross-View Language Modeling: Towards Unified Cross-Lingual Cross-Modal Pre-training, [Paper]
(arXiv 2022.06) VALHALLA: Visual Hallucination for Machine Translation, [Paper], [Code]
(arXiv 2022.06) Learning Sequential Contexts using Transformer for 3D Hand Pose Estimation, [Paper]
(arXiv 2022.06) CLIP4IDC: CLIP for Image Difference Captioning, [Paper], [Code]
(arXiv 2022.06) Cross-domain Detection Transformer based on Spatial-aware and Semantic-aware Token Alignment, [Paper]
(arXiv 2022.06) Vision GNN: An Image is Worth Graph of Nodes, [Paper], [Code]
(arXiv 2022.06) Weakly-supervised Action Transition Learning for Stochastic Human Motion Prediction, [Paper], [Code]
(arXiv 2022.06) TubeFormer-DeepLab: Video Mask Transformer, [Paper]
(arXiv 2022.06) Video-based Human-Object Interaction Detection from Tubelet Tokens, [Paper]

2022.05

(arXiv 2022.05) HeatER: An Efficient and Unified Network for Human Reconstruction via Heatmap-based TransformER, [Paper]
(arXiv 2022.05) Robotic grasp detection based on Transformer, [Paper]
(arXiv 2022.05) Multimodal Masked Autoencoders Learn Transferable Representations, [Paper]
(arXiv 2022.05) Multimodal Fake News Detection via CLIP-Guided Learning, [Paper]
(arXiv 2022.05) WT-MVSNet: Window-based Transformers for Multi-view Stereo, [Paper]
(arXiv 2022.05) Object-wise Masked Autoencoders for Fast Pre-training, [Paper]
(arXiv 2022.05) A Closer Look at Self-supervised Lightweight Vision Transformers, [Paper]
(arXiv 2022.05) Variational Transformer: A Framework Beyond the Trade-off between Accuracy and Diversity for Image Captioning, [Paper]
(arXiv 2022.05) CYCLIP: Cyclic Contrastive Language-Image Pretraining, [Paper], [Code]
(arXiv 2022.05) MDMLP: Image Classification from Scratch on Small Datasets with MLP, [Paper], [Code]
(arXiv 2022.05) SupMAE: Supervised Masked Autoencoders Are Efficient Vision Learners, [Paper], [Code]
(arXiv 2022.05) 3D-C2FT: Coarse-to-fine Transformer for Multi-view 3D Reconstruction, [Paper]
(arXiv 2022.05) Prompt-aligned Gradient for Prompt Tuning, [Paper], [Code]
(arXiv 2022.05) Illumination Adaptive Transformer, [Paper], [Code]
(arXiv 2022.05) HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling, [Paper]
(arXiv 2022.05) GMML is All you Need, [Paper], [Code]
(arXiv 2022.05) COMPLETEDT: POINT CLOUD COMPLETION WITH DENSE AUGMENT INFERENCE TRANSFORMERS, [Paper]
(arXiv 2022.05) Self-Supervised Pre-training of Vision Transformers for Dense Prediction Tasks, [Paper]
(arXiv 2022.05) VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models, [Paper], [Benchmark], [Code]
(arXiv 2022.05) Architecture-Agnostic Masked Image Modeling – From ViT back to CNN, [Paper]
(arXiv 2022.05) Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via Feature Distillation, [Paper], [Code]
(arXiv 2022.05) GIT: A Generative Image-to-text Transformer for Vision and Language, [Paper]
(arXiv 2022.05) 3DILG: Irregular Latent Grids for 3D Generative Modeling, [Paper]
(arXiv 2022.05) Simple Unsupervised Object-Centric Learning for Complex and Naturalistic Videos, [Paper], [Code]
(arXiv 2022.05) Future Transformer for Long-term Action Anticipation, [Paper], [Project]
(arXiv 2022.05) X-ViT: High Performance Linear Vision Transformer without Softmax, [Paper]
(arXiv 2022.05) Knowledge Distillation via the Target-aware Transformer, [Paper]
(arXiv 2022.05) Dynamic Query Selection for Fast Visual Perceiver, [Paper]
(arXiv 2022.05) MonoFormer: Towards Generalization of self-supervised monocular depth estimation with Transformers, [Paper]
(arXiv 2022.05) PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models, [Paper], [Code]
(arXiv 2022.05) Supporting Vision-Language Model Inference with Causality-pruning Knowledge Prompt, [Paper]
(arXiv 2022.05) Super Vision Transformer, [Paper], [Code]
(arXiv 2022.05) mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections, [Paper]
(arXiv 2022.05) VQA-GNN: Reasoning with Multimodal Semantic Graph for Visual Question Answering, [Paper]
(arXiv 2022.05) UMSNet: An Universal Multi-sensor Network for Human Activity Recognition, [Paper]
(arXiv 2022.05) Privacy-Preserving Image Classification Using Vision Transformer, [Paper]
(arXiv 2022.05) HiVLP: Hierarchical Vision-Language Pre-Training for Fast Image-Text Retrieval, [Paper]
(arXiv 2022.05) ASSET: Autoregressive Semantic Scene Editing with Transformers at High Resolutions, [Paper], [Code]
(arXiv 2022.05) HDGT: Heterogeneous Driving Graph Transformer for Multi-Agent Trajectory Prediction via Scene Encoding, [Paper]
(arXiv 2022.05) Mask-guided Vision Transformer (MG-ViT) for Few-Shot Learning, [Paper]
(arXiv 2022.05) Degradation-Aware Unfolding Half-Shuffle Transformer for Spectral Compressive Imaging, [Paper]
(arXiv 2022.05) Uniform Masking: Enabling MAE Pre-training for Pyramid-based Vision Transformers with Locality, [Paper], [Code]
(arXiv 2022.05) Visual Concepts Tokenization, [Paper]
(arXiv 2022.05) MSTRIQ: No Reference Image Quality Assessment Based on Swin Transformer with Multi-Stage Fusion, [Paper]
(arXiv 2022.05) CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers., [Paper], [Code]
(arXiv 2022.05) Evidence for Hypodescent in Visual Semantic AI, [Paper]
(arXiv 2022.05) Boosting Camouflaged Object Detection with Dual-Task Interactive Transformer, [Paper], [Code]
(arXiv 2022.05) muNet: Evolving Pretrained Deep Neural Networks into Scalable Auto-tuning Multitask Systems, [Paper]
(arXiv 2022.05) Large Language Models are Zero-Shot Reasoners, [Paper]
(arXiv 2022.05) AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition, [Paper], [Code]
(arXiv 2022.05) Green Hierarchical Vision Transformer for Masked Image Modeling, [Paper], [Code]
(arXiv 2022.05) Efficient U-Transformer with Boundary-Aware Loss for Action Segmentation, [Paper]
(arXiv 2022.05) Cross-Architecture Self-supervised Video Representation Learning, [Paper], [Code]
(arXiv 2022.05) Prompt-based Learning for Unpaired Image Captioning, [Paper]
(arXiv 2022.05) MixMIM: Mixed and Masked Image Modeling for Efficient Visual Representation Learning, [Paper], [Code]
(arXiv 2022.05) Fast Vision Transformers with HiLo Attention, [Paper], [Code]
(arXiv 2022.05) Fine-grained Image Captioning with CLIP Reward, [Paper], [Code]
(arXiv 2022.05) Mutual Information Divergence: A Unified Metric for Multimodal Generative Models, [Paper]
(arXiv 2022.05) MoCoViT: Mobile Convolutional Vision Transformer, [Paper]
(arXiv 2022.05) AO2-DETR: Arbitrary-Oriented Object Detection Transformer, [Paper]
(arXiv 2022.05) Inception Transformer, [Paper], [Code]
(arXiv 2022.05) VTP: Volumetric Transformer for Multi-view Multi-person 3D Pose Estimation, [Paper]
(arXiv 2022.05) UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes, [Paper]
(arXiv 2022.05) Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners, [Paper], [Code]
(arXiv 2022.05) Training Vision-Language Transformers from Captions Alone, [Paper], [Code]
(arXiv 2022.05) Voxel-informed Language Grounding, [Paper], [Code]
(arXiv 2022.05) Cross-Enhancement Transformer for Action Segmentation, [Paper]
(arXiv 2022.05) TRT-ViT: TensorRT-oriented Vision Transformer, [Paper]
(arXiv 2022.05) Integral Migrating Pre-trained Transformer Encoder-decoders for Visual Object Detection, [Paper]
(arXiv 2022.05) A graph-transformer for whole slide image classification, [Paper]
(arXiv 2022.05) VNT-Net: Rotational Invariant Vector Neuron Transformers, [Paper]
(arXiv 2022.05) Masked Image Modeling with Denoising Contrast, [Paper]
(arXiv 2022.05) Cross-subject Action Unit Detection with Meta Learning and Transformer-based Relation Modeling, [Paper]
(arXiv 2022.05) Masked Autoencoders As Spatiotemporal Learners, [Paper]
(arXiv 2022.05) BodyMap: Learning Full-Body Dense Correspondence Map, [Paper], [Code]
(arXiv 2022.05) Unraveling Attention via Convex Duality: Analysis and Interpretations of Vision Transformers, [Paper]
(arXiv 2022.05) AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars, [Paper]
(arXiv 2022.05) Vision Transformer Adapter for Dense Predictions, [Paper], [Code]
(arXiv 2022.05) Demo: Real-Time Semantic Communications with a Vision Transformer, [Paper]
(arXiv 2022.05) MulT: An End-to-End Multitask Learning Transformer, [Paper], [Code]
(arXiv 2022.05) A CLIP-Hitchhiker’s Guide to Long Video Retrieval, [Paper]
(arXiv 2022.05) Video Frame Interpolation with Transformer, [Paper], [Code]
(arXiv 2022.05) Dense residual Transformer for Image Denoising, [Paper]
(arXiv 2022.05) Learning Lip-Based Audio-Visual Speaker Embeddings with AV-HuBERT, [Paper]
(arXiv 2022.05) Robot Cooking with Stir-fry: Bimanual Non-prehensile Manipulation of Semi-fluid Objects, [Paper]
(arXiv 2022.05) Entity-aware and Motion-aware Transformers for Language-driven Action Localization in Videos, [Paper], [Code]
(arXiv 2022.05) Learning to Retrieve Videos by Asking Questions, [Paper]
(arXiv 2022.05) One Model, Multiple Modalities: A Sparsely Activated Approach for Text, Sound, Image, Video and Code, [Paper]
(arXiv 2022.05) Simple Open-Vocabulary Object Detection with Vision Transformers, [Paper], [Code]
(arXiv 2022.05) AggPose: Deep Aggregation Vision Transformer for Infant Pose Estimation, [Paper], [Code]
(arXiv 2022.05) An Empirical Study of Self-supervised Learning Approaches for Object Detection with Transformers, [Paper], [Code-DETR], [Code-Deform-DETR]
(arXiv 2022.05) Reduce Information Loss in Transformers for Pluralistic Image Inpainting, [Paper], [Code]
(arXiv 2022.05) Transformer-based Cross-Modal Recipe Embeddings with Large Batch Training, [Paper]
(arXiv 2022.05) Spatio-Temporal Transformer for Dynamic Facial Expression Recognition in the Wild, [Paper]
(arXiv 2022.05) Generalizable Task Planning through Representation Pretraining, [Paper], [Project]
(arXiv 2022.05) EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision Transformers, [Paper]
(arXiv 2022.05) Activating More Pixels in Image Super-Resolution Transformer, [Paper], [Code]
(arXiv 2022.05) Row-wise Accelerator for Vision Transformer, [Paper]
(arXiv 2022.05) SparseTT: Visual Tracking with Sparse Transformers, [Paper], [Code]
(arXiv 2022.05) RoViST: Learning Robust Metrics for Visual Storytelling, [Paper], [Code]
(arXiv 2022.05) Beyond Bounding Box: Multimodal Knowledge Learning for Object Detection, [Paper]
(arXiv 2022.05) Multilevel Hierarchical Network with Multiscale Sampling for Video Question Answering, [Paper]
(arXiv 2022.05) Incremental-DETR: Incremental Few-Shot Object Detection via Self-Supervised Learning, [Paper]
(arXiv 2022.05) ConvMAE: Masked Convolution Meets Masked Autoencoders, [Paper], [Code]
(arXiv 2022.05) Cross-lingual Adaptation for Recipe Retrieval with Mixup, [Paper]
(arXiv 2022.05) Zero and R2D2: A Large-scale Chinese Cross-modal Benchmark and A Vision-Language Framework, [Paper]
(arXiv 2022.05) Transformer Tracking with Cyclic Shifting Window Attention, [Paper], [Code]
(arXiv 2022.05) Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning, [Paper]
(arXiv 2022.05) Prompt Distribution Learning, [Paper]
(arXiv 2022.05) CLIP-CLOP: CLIP-Guided Collage and Photomontage, [Paper]
(arXiv 2022.05) Dual-Level Decoupled Transformer for Video Captioning, [Paper]
(arXiv 2022.05) Declaration-based Prompt Tuning for Visual Question Answering, [Paper], [Code]
(arXiv 2022.05) P^3IV: Probabilistic Procedure Planning from Instructional Videos with Weak Supervision, [Paper]
(arXiv 2022.05) Language Models Can See: Plugging Visual Controls in Text Generation, [Paper], [Code]
(arXiv 2022.05) YOLOPose: Transformer-based Multi-Object 6D Pose Estimation using Keypoint Regression, [Paper]
(arXiv 2022.05) Cross-view Transformers for real-time Map-view Semantic Segmentation, [Paper], [Code]
(arXiv 2022.05) i-Code: An Integrative and Composable Multimodal Learning Framework, [Paper]
(arXiv 2022.05) Visual Commonsense in Pretrained Unimodal and Multimodal Models, [Paper], [Project]
(arXiv 2022.05) Dual Cross-Attention Learning for Fine-Grained Visual Categorization and Object Re-Identification, [Paper]
(arXiv 2022.05) RecipeSnap - a lightweight image to recipe model, [Paper], [Code]
(arXiv 2022.05) CoCa: Contrastive Captioners are Image-Text Foundation Models, [Paper]
(arXiv 2022.05) Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP), [Paper]
(arXiv 2022.05) Cross-modal Representation Learning for Zero-shot Action Recognition, [Paper], [Code]
(arXiv 2022.05) Cross-Domain Object Detection with Mean-Teacher Transformer, [Paper]
(arXiv 2022.05) Better plain ViT baselines for ImageNet-1k, [Paper], [Code]
(arXiv 2022.05) Reinforced Swin-Convs Transformer for Underwater Image Enhancement, [Paper]
(arXiv 2022.05) UTC: A Unified Transformer with Inter-Task Contrastive Learning for Visual Dialog, [Paper]
(arXiv 2022.05) Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering, [Paper]
(arXiv 2022.05) CenterCLIP: Token Clustering for Efficient Text-Video Retrieval, [Paper], [Code]
(arXiv 2022.05) Arbitrary Shape Text Detection via Boundary Transformer, [Paper], [Code]
(arXiv 2022.05) HULC: 3D Human Motion Capture with Pose Manifold Sampling and Dense Contact Guidance, [Paper], [Project]

2022.04

(arXiv 2022.04) Learn to Understand Negation in Video Retrieval, [Paper]
(arXiv 2022.04) LayoutBERT: Masked Language Layout Model for Object Insertion, [Paper]
(arXiv 2022.04) Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning, [Paper], [Code]
(arXiv 2022.04) Coarse-to-Fine Video Denoising with Dual-Stage Spatial-Channel Transformer, [Paper]
(arXiv 2022.04) SideRT: A Real-time Pure Transformer Architecture for Single Image Depth Estimation, [Paper]
(arXiv 2022.04) Where in the World is this Image? Transformer-based Geo-localization in the Wild, [Paper]
(arXiv 2022.04) Depth Estimation with Simplified Transformer, [Paper]
(arXiv 2022.04) A very preliminary analysis of DALL-E 2, [Paper]
(arXiv 2022.04) CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers, [Paper], [Code]
(arXiv 2022.04) CLIP-Art: Contrastive Pre-training for Fine-Grained Art Classification, [Paper], [Code]
(arXiv 2022.04) TEMOS: Generating diverse human motions from textual descriptions, [Paper], [Project]
(arXiv 2022.04) PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining, [Paper]
(arXiv 2022.04) Symmetric Transformer-based Network for Unsupervised Image Registration, [Paper], [Code]
(arXiv 2022.04) Tragedy Plus Time: Capturing Unintended Human Activities from Weakly-labeled Videos, [Paper], [Code]
(arXiv 2022.04) CapOnImage: Context-driven Dense-Captioning on Image, [Paper]
(arXiv 2022.04) Self-Supervised Learning of Object Parts for Semantic Segmentation, [Paper], [Code]
(arXiv 2022.04) DearKD: Data-Efficient Early Knowledge Distillation for Vision Transformers, [Paper]
(arXiv 2022.04) CATrans: Context and Affinity Transformer for Few-Shot Segmentation, [Paper]
(arXiv 2022.04) Self-Driving Car Steering Angle Prediction: Let Transformer Be a Car Again, [Paper], [Code]
(arXiv 2022.04) ClothFormer: Taming Video Virtual Try-on in All Module, [Paper]
(arXiv 2022.04) Deeper Insights into ViTs Robustness towards Common Corruptions, [Paper]
(arXiv 2022.04) VITPOSE: SIMPLE VISION TRANSFORMER BASELINES FOR HUMAN POSE ESTIMATION, [Paper], [Code]
(arXiv 2022.04) Understanding The Robustness in Vision Transformers, [Paper], [Code]
(arXiv 2022.04) MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval, [Paper]
(arXiv 2022.04) Contrastive Language-Action Pre-training for Temporal Localization, [Paper]
(arXiv 2022.04) Boosting Adversarial Transferability of MLP-Mixer, [Paper]
(arXiv 2022.04) Adaptive Split-Fusion Transformer, [Paper], [Code]
(arXiv 2022.04) Can Foundation Models Perform Zero-Shot Task Specification For Robot Manipulation? [Paper], [Project]
(arXiv 2022.04) RELVIT: CONCEPT-GUIDED VISION TRANSFORMER FOR VISUAL RELATIONAL REASONING, [Paper]
(arXiv 2022.04) VISTA: Vision Transformer enhanced by U-Net and Image Colorfulness Frame Filtration for Automatic Retail Checkout, [Paper], [Code]
(arXiv 2022.04) CLIP-DISSECT: AUTOMATIC DESCRIPTION OF NEURON REPRESENTATIONS IN DEEP VISION NETWORKS, [Paper]
(arXiv 2022.04) TEMOS: Generating diverse human motions from textual descriptions, [Paper], [Project]
(arXiv 2022.04) Unsupervised Hierarchical Semantic Segmentation with Multiview Cosegmentation and Clustering Transformers, [Paper]
(arXiv 2022.04) SwinFuse: A Residual Swin Transformer Fusion Network for Infrared and Visible Images, [Paper], [Code]
(arXiv 2022.04) OCFormer: One-Class Transformer Network for Image Classification, [Paper]
(arXiv 2022.04) DRT: A Lightweight Single Image Deraining Recursive Transformer, [Paper], [Code]
(arXiv 2022.04) Hypergraph Transformer: Weakly-Supervised Multi-hop Reasoning for Knowledge-based Visual Question Answering, [Paper], [Code]
(arXiv 2022.04) ParkPredict+: Multimodal Intent and Motion Prediction for Vehicles in Parking Lots with CNN and Transformer, [Paper]
(arXiv 2022.04) iCAR: Bridging Image Classification and Image-text Alignment for Visual Recognition, [Paper], [Code]
(arXiv 2022.04) DIVERSE INSTANCE DISCOVERY: VISION-TRANSFORMER FOR INSTANCE-AWARE MULTI-LABEL IMAGE RECOGNITION, [Paper]
(arXiv 2022.04) Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds, [Paper], [Code]
(arXiv 2022.04) DFAM-DETR: Deformable feature based attention mechanism DETR on slender object detection, [Paper]
(arXiv 2022.04) NFormer: Robust Person Re-identification with Neighbor Transformer, [Paper], [Code]
(arXiv 2022.04) Video Moment Retrieval from Text Queries via Single Frame Annotation, [Paper]
(arXiv 2022.04) GIMO: Gaze-Informed Human Motion Prediction in Context, [Paper]
(arXiv 2022.04) VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance, [Paper]
(arXiv 2022.04) Sim-2-Sim Transfer for Vision-and-Language Navigation in Continuous Environments, [Paper]
(arXiv 2022.04) Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering Transformer, [Paper], [Code]
(arXiv 2022.04) Multimodal Token Fusion for Vision Transformers, [Paper]
(arXiv 2022.04) Self-Calibrated Efficient Transformer for Lightweight Super-Resolution, [Paper], [Code]
(arXiv 2022.04) Searching Intrinsic Dimensions of Vision Transformers, [Paper]
(arXiv 2022.04) Towards Lightweight Transformer via Group-wise Transformation for Vision-and-Language Tasks, [Paper]
(arXiv 2022.04) Multimodal Few-Shot Object Detection with Meta-Learning Based Cross-Modal Prompting, [Paper]
(arXiv 2022.04) Multi-Frame Self-Supervised Depth with Transformers, [Paper], [Code]
(arXiv 2022.04) MST++: Multi-stage Spectral-wise Transformer for Efficient Spectral Reconstruction, [Paper], [Code]
(arXiv 2022.04) Vision-Language Pre-Training for Multimodal Aspect-Based Sentiment Analysis, [Paper], [Code]
(arXiv 2022.04) An Extendable, Efficient and Effective Transformer-based Object Detector, [Paper], [Code]
(arXiv 2022.04) VDTR: Video Deblurring with Transformer, [Paper], [Code]
(arXiv 2022.04) BSRT: Improving Burst Super-Resolution with Swin Transformer and Flow-Guided Deformable Alignment, [Paper], [Code]
(arXiv 2022.04) Temporally Efficient Vision Transformer for Video Instance Segmentation, [Paper], [Code]
(arXiv 2022.04) VSA: Learning Varied-Size Window Attention in Vision Transformers, [Paper], [Code]
(arXiv 2022.04) XDBERT: Distilling Visual Information to BERT from Cross-Modal Systems to Improve Language Understanding, [Paper]
(arXiv 2022.04) IMPROVING CROSS-MODAL UNDERSTANDING IN VISUAL DIALOG VIA CONTRASTIVE LEARNING, [Paper]
(arXiv 2022.04) MVSTER: Epipolar Transformer for Efficient Multi-View Stereo, [Paper], [Code]
(arXiv 2022.04) UNCONDITIONAL IMAGE-TEXT PAIR GENERATION WITH MULTIMODAL CROSS QUANTIZER, [Paper]
(arXiv 2022.04) Pushing the Limits of Simple Pipelines for Few-Shot Learning: External Data and Fine-Tuning Make a Difference, [Paper]
(arXiv 2022.04) COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval, [Paper]
(arXiv 2022.04) Image Captioning In the Transformer Age, [Paper], [Code]
(arXiv 2022.04) ResT V2: Simpler, Faster and Stronger, [Paper], [Code]
(arXiv 2022.04) Lightweight Bimodal Network for Single-Image Super-Resolution via Symmetric CNN and Recursive Transformer, [Paper], [Code]
(arXiv 2022.04) Temporal Progressive Attention for Early Action Prediction, [Paper], [Code]
(arXiv 2022.04) Keep the Caption Information: Preventing Shortcut Learning in Contrastive Image-Caption Retrieval, [Paper]
(arXiv 2022.04) Flamingo: a Visual Language Model for Few-Shot Learning, [Paper]
(arXiv 2022.04) RELVIT: CONCEPT-GUIDED VISION TRANSFORMER FOR VISUAL RELATIONAL REASONING, [Paper]
(arXiv 2022.04) Unsupervised Human Action Recognition with Skeletal Graph Laplacian and Self-Supervised Viewpoints Invariance, [Paper], [Code]
(arXiv 2022.04) Learning Future Object Prediction with a Spatiotemporal Detection Transformer, [Paper]
(arXiv 2022.04) R^2-Trans: Fine-Grained Visual Categorization with Redundancy Reduction, [Paper], [Code]
(arXiv 2022.04) A New Dataset and Transformer for Stereoscopic Video Super-Resolution, [Paper], [Code]
(arXiv 2022.04) Transformer-Guided Convolutional Neural Network for Cross-View Geolocalization, [Paper]
(arXiv 2022.04) Multi-Scale Features and Parallel Transformers Based Image Quality Assessment, [Paper], [Code]
(arXiv 2022.04) BTranspose: Bottleneck Transformers for Human Pose Estimation with Self-Supervised Pre-Training, [Paper]
(arXiv 2022.04) Human-Object Interaction Detection via Disentangled Transformer, [Paper]
(arXiv 2022.04) ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models, [Paper]
(arXiv 2022.04) Interactiveness Field in Human-Object Interactions, [Paper], [Code]
(arXiv 2022.04) DeiT III: Revenge of the ViT, [Paper]
(arXiv 2022.04) Residual Swin Transformer Channel Attention Network for Image Demosaicing, [Paper]
(arXiv 2022.04) Neighborhood Attention Transformer, [Paper], [Code]
(arXiv 2022.04) MiniViT: Compressing Vision Transformers with Weight Multiplexing, [Paper], [Code]
(arXiv 2022.04) ViTOL: Vision Transformer for Weakly Supervised Object Localization, [Paper], [Code]
(arXiv 2022.04) What Matters in Language Conditioned Robotic Imitation Learning, [Paper], [Code]
(arXiv 2022.04) Consistency driven Sequential Transformers Attention Model for Partially Observable Scenes, [Paper]
(arXiv 2022.04) ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension, [Paper]
(arXiv 2022.04) Are Multimodal Transformers Robust to Missing Modality? [Paper]
(arXiv 2022.04) TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation, [Paper], [Code]
(arXiv 2022.04) X-DETR: A Versatile Architecture for Instance-wise Vision-Language Tasks, [Paper]
(arXiv 2022.04) Event Transformer, [Paper]
(arXiv 2022.04) Evaluating Vision Transformer Methods for Deep Reinforcement Learning from Pixels, [Paper]
(arXiv 2022.04) ManiTrans: Entity-Level Text-Guided Image Manipulation via Token-wise Semantic Alignment and Generation, [Paper], [Code]
(arXiv 2022.04) Multimodal Transformer for Nursing Activity Recognition, [Paper], [Code]
(arXiv 2022.04) Robust Cross-Modal Representation Learning with Progressive Self-Distillation, [Paper]
(arXiv 2022.04) Stripformer: Strip Transformer for Fast Image Deblurring, [Paper]
(arXiv 2022.04) No Token Left Behind: Explainability-Aided Image Classification and Generation, [Paper]
(arXiv 2022.04) Fashionformer: A Simple, Effective and Unified Baseline for Human Fashion Segmentation and Recognition, [Paper], [Code]
(arXiv 2022.04) Panoptic-PartFormer: Learning a Unified Model for Panoptic Part Segmentation, [Paper], [Code]
(arXiv 2022.04) DILEMMA: Self-Supervised Shape and Texture Learning with Transformers, [Paper]
(arXiv 2022.04) Learning Trajectory-Aware Transformer for Video Super-Resolution, [Paper], [Code]
(arXiv 2022.04) Learning to Induce Causal Structure, [Paper]
(arXiv 2022.04) Consistency Learning via Decoding Path Augmentation for Transformers in Human Object Interaction Detection, [Paper], [Code]
(arXiv 2022.04) Category-Aware Transformer Network for Better Human-Object Interaction Detection, [Paper]
(arXiv 2022.04) Does Robustness on ImageNet Transfer to Downstream Tasks?, [Paper]
(arXiv 2022.04) POSTER: A Pyramid Cross-Fusion Transformer Network for Facial Expression Recognition, [Paper], [Code]
(arXiv 2022.04) Vision Transformers for Single Image Dehazing, [Paper], [Code]
(arXiv 2022.04) Underwater Image Enhancement Using Pre-trained Transformer, [Paper]
(arXiv 2022.04) Event Transformer. A sparse-aware solution for efficient event data processing, [Paper], [Code]
(arXiv 2022.04) PSTR: End-to-End One-Step Person Search With Transformers, [Paper], [Code]
(arXiv 2022.04) Adapting CLIP For Phrase Localization Without Further Training, [Paper], [Code]
(arXiv 2022.04) FineDiving: A Fine-grained Dataset for Procedure-aware Action Quality Assessment, [Paper], [Project]
(arXiv 2022.04) DaViT: Dual Attention Vision Transformers, [Paper], [Code]
(arXiv 2022.04) Unsupervised Prompt Learning for Vision-Language Models, [Paper], [Code]
(arXiv 2022.04) Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer, [Paper], [Project]
(arXiv 2022.04) Unified Contrastive Learning in Image-Text-Label Space, [Paper], [Code]
(arXiv 2022.04) HunYuan_tvr for Text-Video Retrivial, [Paper]
(arXiv 2022.04) LEARNING TO COMPOSE SOFT PROMPTS FOR COMPOSITIONAL ZERO-SHOT LEARNING, [Paper]
(arXiv 2022.04) End-to-End Zero-Shot HOI Detection via Vision and Language Knowledge Distillation, [Paper], [Code]
(arXiv 2022.04) Temporal Alignment Networks for Long-term Video, [Paper], [Code]
(arXiv 2022.04) Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection, [Paper], [Code]
(arXiv 2022.04) MixFormer: Mixing Features across Windows and Dimensions, [Paper], [Code]
(arXiv 2022.04) CM3: A CAUSAL MASKED MULTIMODAL MODEL OF THE INTERNET, [Paper]
(arXiv 2022.04) DO AS I CAN, NOT AS I SAY: GROUNDING LANGUAGE IN ROBOTIC AFFORDANCES, [Paper], [Project]
(arXiv 2022.04) TransGeo: Transformer Is All You Need for Cross-view Image Geo-localization, [Paper], [Code]
(arXiv 2022.04) Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language, [Paper], [Project]
(arXiv 2022.04) Vision Transformer with Cross-attention by Temporal Shift for Efficient Action Recognition, [Paper]
(arXiv 2022.04) Learning Audio-Video Modalities from Image Captions, [Paper]
(arXiv 2022.04) Improving Vision Transformers by Revisiting High-frequency Components, [Paper]
(arXiv 2022.04) POS-BERT: Point Cloud One-Stage BERT Pre-Training, [Paper], [Code]
(arXiv 2022.04) BinsFormer: Revisiting Adaptive Bins for Monocular Depth Estimation, [Paper], [Code]
(arXiv 2022.04) BatchFormerV2: Exploring Sample Relationships for Dense Representation Learning, [Paper]
(arXiv 2022.04) TransRAC: Encoding Multi-scale Temporal Correlation with Transformers for Repetitive Action Counting, [Paper]
(arXiv 2022.04) Long Movie Clip Classification with State-Space Video Models, [Paper], [Code]
(arXiv 2022.04) TALLFormer: Temporal Action Localization with Long-memory Transformer, [Paper], [Code]
(arXiv 2022.04) MultiMAE: Multi-modal Multi-task Masked Autoencoders, [Paper], [Project]
(arXiv 2022.04) “This is my unicorn, Fluffy”: Personalizing frozen vision-language representations, [Paper]
(arXiv 2022.04) SE(3)-Equivariant Attention Networks for Shape Reconstruction in Function Space, [Paper]
(arXiv 2022.04) Multi-View Transformer for 3D Visual Grounding, [Paper], [Code]
(arXiv 2022.04) VISION TRANSFORMER EQUIPPED WITH NEURAL RESIZER ON FACIAL EXPRESSION RECOGNITION TASK, [Paper]
(arXiv 2022.04) Dual-AI: Dual-path Actor Interaction Learning for Group Activity Recognition, [Paper], [Project]
(arXiv 2022.04) Detector-Free Weakly Supervised Group Activity Recognition, [Paper]
(arXiv 2022.04) Joint Hand Motion and Interaction Hotspots Prediction from Egocentric Videos, [Paper], [Project]
(arXiv 2022.04) What to look at and where: Semantic and Spatial Refined Transformer for detecting human-object interactions, [Paper]
(arXiv 2022.04) MaxViT: Multi-Axis Vision Transformer, [Paper]

2022.03

(arXiv 2022.03) A ConvNet for the 2020s, [Paper], [Code]
(arXiv 2022.03) DeepNet: Scaling Transformers to 1,000 Layers, [Paper]
(arXiv 2022.03) Spatial-Temporal Parallel Transformer for Arm-Hand Dynamic Estimation, [Paper]
(arXiv 2022.03) ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval, [Paper]
(arXiv 2022.03) ReSTR: Convolution-free Referring Image Segmentation Using Transformers, [Paper], [Project]
(arXiv 2022.03) CREATE: A Benchmark for Chinese Short Video Retrieval and Title Generation, [Paper]
(arXiv 2022.03) Deformable Video Transformer, [Paper]
(arXiv 2022.03) End-to-End Trajectory Distribution Prediction Based on Occupancy Grid Maps, [Paper]
(arXiv 2022.03) CRAFT: Cross-Attentional Flow Transformer for Robust Optical Flow, [Paper], [Code]
(arXiv 2022.03) VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers, [Paper], [App]
(arXiv 2022.03) TransEditor: Transformer-Based Dual-Space GAN for Highly Controllable Facial Editing, [Paper], [Code]
(arXiv 2022.03) BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers, [Paper], [Code]
(arXiv 2022.03) Visual Prompting: Modifying Pixel Space to Adapt Pre-trained Models, [Paper], [Code]
(arXiv 2022.03) Bringing Old Films Back to Life, [Paper], [Code]
(arXiv 2022.03) Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model, [Paper], [Code]
(arXiv 2022.03) SeqTR: A Simple yet Universal Network for Visual Grounding, [Paper], [Code]
(arXiv 2022.03) InstaFormer: Instance-Aware Image-to-Image Translation with Transformer, [Paper]
(arXiv 2022.03) Omni-DETR: Omni-Supervised Object Detection with Transformers, [Paper], [Code]
(arXiv 2022.03) Learning Program Representations for Food Images and Cooking Recipes, [Paper], [Project]
(arXiv 2022.03) ITTR: Unpaired Image-to-Image Translation with Transformers, [Paper]
(arXiv 2022.03) VPTR: Efficient Transformers for Video Prediction, [Paper], [Code]
(arXiv 2022.03) Parameter-efficient Fine-tuning for Vision Transformers, [Paper]
(arXiv 2022.03) TubeDETR: Spatio-Temporal Video Grounding with Transformers, [Paper], [Code]
(arXiv 2022.03) Exploring Plain Vision Transformer Backbones for Object Detection, [Paper]
(arXiv 2022.03) PROMPTDET: EXPAND YOUR DETECTOR VOCABULARY WITH UNCURATED IMAGES, [Paper], [Code]
(arXiv 2022.03) Few-Shot Object Detection with Fully Cross-Transformer, [Paper]
(arXiv 2022.03) Unified Transformer Tracker for Object Tracking, [Paper]
(arXiv 2022.03) X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval, [Paper], [Code]
(arXiv 2022.03) Fine-tuning Image Transformers using Learnable Memory, [Paper]
(arXiv 2022.03) MAT: Mask-Aware Transformer for Large Hole Image Inpainting, [Paper], [Code]
(arXiv 2022.03) mc-BEiT: Multi-choice Discretization for Image BERT Pre-training, [Paper]
(arXiv 2022.03) End-to-End Transformer Based Model for Image Captioning, [Paper]
(arXiv 2022.03) Hybrid Routing Transformer for Zero-Shot Learning, [Paper]
(arXiv 2022.03) TREATMENT LEARNING TRANSFORMER FOR NOISY IMAGE CLASSIFICATION, [Paper]
(arXiv 2022.03) Do Vision-Language Pretrained Models Learn Primitive Concepts?, [Paper]
(arXiv 2022.03) Transformer Inertial Poser: Attention-based Real-time Human Motion Reconstruction from Sparse IMUs, [Paper]
(arXiv 2022.03) SepViT: Separable Vision Transformer, [Paper]
(arXiv 2022.03) MatteFormer: Transformer-Based Image Matting via Prior-Tokens, [Paper], [Code]
(arXiv 2022.03) Feature Selective Transformer for Semantic Image Segmentation, [Paper]
(arXiv 2022.03) Bridge-Prompt: Towards Ordinal Action Understanding in Instructional Videos, [Paper], [Code]
(arXiv 2022.03) RSTT: Real-time Spatial Temporal Transformer for Space-Time Video Super-Resolution, [Paper], [Code]
(arXiv 2022.03) Single-Stream Multi-Level Alignment for Vision-Language Pretraining, [Paper]
(arXiv 2022.03) Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers, [Paper], [Code]
(arXiv 2022.03) Collaborative Transformers for Grounded Situation Recognition, [Paper], [Code]
(arXiv 2022.03) Object Memory Transformer for Object Goal Navigation, [Paper]
(arXiv 2022.03) Brain-inspired Multilayer Perceptron with Spiking Neurons, [Paper], [Code]
(arXiv 2022.03) HandOccNet: Occlusion-Robust 3D Hand Mesh Estimation Network, [Paper], [Code]
(arXiv 2022.03) REGTR: End-to-end Point Cloud Correspondences with Transformers, [Paper], [Code]
(arXiv 2022.03) Automated Progressive Learning for Efficient Training of Vision Transformers, [Paper]
(arXiv 2022.03) Stratified Transformer for 3D Point Cloud Segmentation, [Paper], [Code]
(arXiv 2022.03) NOC-REK: Novel Object Captioning with Retrieved Vocabulary from External Knowledge, [Paper]
(arXiv 2022.03) FACIAL EXPRESSION RECOGNITION WITH SWIN TRANSFORMER, [Paper]
(arXiv 2022.03) Give Me Your Attention: Dot-Product Attention Considered Harmful for Adversarial Patch Robustness, [Paper]
(arXiv 2022.03) Efficient Visual Tracking via Hierarchical Cross-Attention Transformer, [Paper], [Code]
(arXiv 2022.03) High-Performance Transformer Tracking, [Paper], [Code]
(arXiv 2022.03) RayTran: 3D pose estimation and shape reconstruction of multiple objects from videos with ray-traced transformers, [Paper]
(arXiv 2022.03) Multi-modal Multi-label Facial Action Unit Detection with Transformer, [Paper]
(arXiv 2022.03) MonoDETR: Depth-aware Transformer for Monocular 3D Object Detection, [Paper], [Code]
(arXiv 2022.03) Text to Mesh Without 3D Supervision Using Limit Subdivision, [Paper], [Project]
(arXiv 2022.03) GEN-VLKT: Simplify Association and Enhance Interaction Understanding for HOI Detection, [Paper], [Code]
(arXiv 2022.03) CrossFormer: Cross Spatio-Temporal Transformer for 3D Human Pose Estimation, [Paper]
(arXiv 2022.03) FitCLIP: Refining Large-Scale Pretrained Image-Text Models for Zero-Shot Video Understanding Tasks, [Paper], [Code]
(arXiv 2022.03) Vision Transformer Compression with Structured Pruning and Low Rank Approximation, [Paper]
(arXiv 2022.03) Multi-Modal Learning for AU Detection Based on Multi-Head Fused Transformers, [Paper]
(arXiv 2022.03) MSTR: Multi-Scale Transformer for End-to-End Human-Object Interaction Detection, [Paper]
(arXiv 2022.03) Learning Patch-to-Cluster Attention in Vision Transformer, [Paper]
(arXiv 2022.03) Visual Prompt Tuning, [Paper]
(arXiv 2022.03) Training-free Transformer Architecture Search, [Paper]
(arXiv 2022.03) VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training, [Paper], [Code]
(arXiv 2022.03) METAMORPH: LEARNING UNIVERSAL CONTROLLERS WITH TRANSFORMERS, [Paper], [Project]
(arXiv 2022.03) A Prompt Array Keeps the Bias Away: Debiasing Vision-Language Models with Adversarial Learning, [Paper]
(arXiv 2022.03) Reshaping Robot Trajectories Using Natural Language Commands: A Study of Multi-Modal Data Alignment Using Transformers, [Paper], [Project]
(arXiv 2022.03) Associating Objects with Scalable Transformers for Video Object Segmentation, [Paper], [[Project]](https://github.com/z-x-yang/AOT0
(arXiv 2022.03) HOP: History-and-Order Aware Pre-training for Vision-and-Language Navigation, [Paper], [Code]
(arXiv 2022.03) Learning to generate line drawings that convey geometry and semantics, [Paper], [Project]
(arXiv 2022.03) UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection, [Paper], [Code]
(arXiv 2022.03) AIMusicGuru: Music Assisted Human Pose Correction, [Paper]
(arXiv 2022.03) What to Hide from Your Students: Attention-Guided Masked Image Modeling, [Paper]
(arXiv 2022.03) Towards Efficient and Elastic Visual Question Answering with Doubly Slimmable Transformer, [Paper]
(arXiv 2022.03) ViT-FOD: A Vision Transformer based Fine-grained Object Discriminator, [Paper]
(arXiv 2022.03) Keypoints Tracking via Transformer Networks, [Paper], [Code]
(arXiv 2022.03) Beyond Fixation: Dynamic Window Visual Transformer, [Paper], [Code]
(arXiv 2022.03) Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors, [Paper]
(arXiv 2022.03) Self-supervised Video-centralised Transformer for Video Face Clustering, [Paper]
(arXiv 2022.03) Towards Exemplar-Free Continual Learning in Vision Transformers: an Account of Attention, Functional and Weight Regularization, [Paper]
(arXiv 2022.03) Global Tracking Transformers, [Paper], [Code]
(arXiv 2022.03) Video Instance Segmentation via Multi-scale Spatio-temporal Split Attention Transformer, [Paper], [Code]
(arXiv 2022.03) QS-Craft: Learning to Quantize, Scrabble and Craft for Conditional Human Motion Animation, [Paper]
(arXiv 2022.03) Look for the Change: Learning Object States and State-Modifying Actions from Untrimmed Web Videos, [Paper], [Project]
(arXiv 2022.03) GradViT: Gradient Inversion of Vision Transformers, [Paper], [Code]
(arXiv 2022.03) Mask Usage Recognition using Vision Transformer with Transfer Learning and Data Augmentation, [Paper]
(arXiv 2022.03) Under the Hood of Transformer Networks for Trajectory Forecasting, [Paper]
(arXiv 2022.03) Open-Vocabulary DETR with Conditional Matching, [Paper]
(arXiv 2022.03) Meta-attention for ViT-backed Continual Learning, [Paper], [Code]
(arXiv 2022.03) CNNs and Transformers Perceive Hybrid Images Similar to Humans, [Paper], [Code]
(arXiv 2022.03) Bailando: 3D Dance Generation by Actor-Critic GPT with Choreographic Memory, [Paper], [Code]
(arXiv 2022.03) Affective Feedback Synthesis Towards Multimodal Text and Image Data, [Paper]
(arXiv 2022.03) ViewFormer: NeRF-free Neural Rendering from Few Images Using Transformers, [Paper]
(arXiv 2022.03) CLIP on Wheels: Zero-Shot Object Navigation as Object Localization and Exploration, [Paper]
(arXiv 2022.03) Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds, [Paper], [Code]
(arXiv 2022.03) HIPA: Hierarchical Patch Transformer for Single Image Super Resolution, [Paper]
(arXiv 2022.03) DirecFormer: A Directed Attention in Transformer Approach to Robust Action Recognition, [Paper], [Code]
(arXiv 2022.03) MixFormer: End-to-End Tracking with Iterative Mixed Attention, [Paper], [Code]
(arXiv 2022.03) PersFormer: 3D Lane Detection via Perspective Transformer and the OpenLane Benchmark, [Paper], [Code]
(arXiv 2022.03) Relationformer: A Unified Framework for Image-to-Graph Generation, [Paper], [Code]
(arXiv 2022.03) CLIP meets GamePhysics: Towards bug identification in gameplay videos using zero-shot transfer learning, [Paper], [Code]
(arXiv 2022.03) Hyperbolic Vision Transformers: Combining Improvements in Metric Learning, [Paper], [Code]
(arXiv 2022.03) MonoDTR: Monocular 3D Object Detection with Depth-Aware Transformer, [Paper], [Code]
(arXiv 2022.03) Transformer-based HTR for Historical Documents, [Paper]
(arXiv 2022.03) simCrossTrans: A Simple Cross-Modality Transfer Learning for Object Detection with ConvNets or Vision Transformers, [Paper], [Code]
(arXiv 2022.03) End-to-End Human-Gaze-Target Detection with Transformers, [Paper]
(arXiv 2022.03) End-to-End Video Text Spotting with Transformer, [Paper], [Code]
(arXiv 2022.03) Open-Vocabulary One-Stage Detection with Hierarchical Visual-Language Knowledge Distillation, [Paper], [Code]
(arXiv 2022.03) V2X-ViT: Vehicle-to-Everything Cooperative Perception with Vision Transformer, [Paper]
(arXiv 2022.03) LocATe: End-to-end Localization of Actions in 3D with Transformers, [Paper]
(arXiv 2022.03) AnoViT: Unsupervised Anomaly Detection and Localization with Vision Transformer-based Encoder-Decoder, [Paper]
(arXiv 2022.03) ViM: Out-Of-Distribution with Virtual-logit Matching, [Paper], [Code]
(arXiv 2022.03) ScalableViT: Rethinking the Context-oriented Generalization of Vision Transformer, [Paper]
(arXiv 2022.03) Iwin: Human-Object Interaction Detection via Transformer with Irregular Windows, [Paper]
(arXiv 2022.03) Vision Transformer with Convolutions Architecture Search, [Paper]
(arXiv 2022.03) Cascade Transformers for End-to-End Person Search, [Paper], [Code]
(arXiv 2022.03) CodedVTR: Codebook-based Sparse Voxel Transformer with Geometric Guidance, [Paper]
(arXiv 2022.03) MatchFormer: Interleaving Attention in Transformers for Feature Matching, [Paper], [Code]
(arXiv 2022.03) Local-Global Context Aware Transformer for Language-Guided Video Segmentation, [Paper], [Code]
(arXiv 2022.03) Three things everyone should know about Vision Transformers, [Paper]
(arXiv 2022.03) Are Vision Transformers Robust to Spurious Correlations? [Paper], [Code]
(arXiv 2022.03) MUTUAL GENERATIVE TRANSFORMER LEARNING FOR CROSS-VIEW GEO-LOCALIZATION, [Paper]
(arXiv 2022.03) DU-VLG: Unifying Vision-and-Language Generation via Dual Sequence-to-Sequence Pre-training, [Paper]
(arXiv 2022.03) Semantic-aligned Fusion Transformer for One-shot Object Detection, [Paper]
(arXiv 2022.03) UNIMO-2: End-to-End Unified Vision-Language Grounded Learning, [Paper], [Code]
(arXiv 2022.03) Attribute Surrogates Learning and Spectral Tokens Pooling in Transformers for Few-shot Learning, [Paper], [Code]
(arXiv 2022.03) One-Shot Adaptation of GAN in Just One CLIP, [Paper]
(arXiv 2022.03) PanoFormer: Panorama Transformer for Indoor 360° Depth Estimation, [Paper]
(arXiv 2022.03) PreTR: Spatio-Temporal Non-Autoregressive Trajectory Prediction Transformer, [Paper]
(arXiv 2022.03) Look Outside the Room: Synthesizing A Consistent Long-Term 3D Scene Video from A Single Image, [Paper], [Code]
(arXiv 2022.03) Transframer: Arbitrary Frame Prediction with Generative Models, [Paper]
(arXiv 2022.03) Towards Data-Efficient Detection Transformers, [Paper], [Code]
(arXiv 2022.03) Bi-directional Object-Context Prioritization Learning for Saliency Ranking, [Paper], [Code]
(arXiv 2022.03) PATCH-FOOL: ARE VISION TRANSFORMERS ALWAYS ROBUST AGAINST ADVERSARIAL PERTURBATIONS? [Paper], [Code]
(arXiv 2022.03) WegFormer: Transformers for Weakly Supervised Semantic Segmentation, [Paper]
(arXiv 2022.03) Open Set Recognition using Vision Transformer with an Additional Detection Head, [Paper], [Code]
(arXiv 2022.03) UNIFIED VISUAL TRANSFORMER COMPRESSION, [Paper], [Code]
(arXiv 2022.03) Towards Practical Certifiable Patch Defense with Vision Transformer, [Paper]
(arXiv 2022.03) EDTER: Edge Detection with Transformer, [Paper], [Code]
(arXiv 2022.03) ActFormer: A GAN Transformer Framework towards General Action-Conditioned 3D Human Motion Generation, [Paper]
(arXiv 2022.03) Rich CNN-Transformer Feature Aggregation Networks for Super-Resolution, [Paper]
(arXiv 2022.03) Revitalize Region Feature for Democratizing Video-Language Pre-training, [Paper], [Code]
(arXiv 2022.03) Inverted Pyramid Multi-task Transformer for Dense Scene Understanding, [Paper]
(arXiv 2022.03) Smoothing Matters: Momentum Transformer for Domain Adaptive Semantic Segmentation, [Paper], [Code]
(arXiv 2022.03) Style Transformer for Image Inversion and Editing, [Paper], [Code]
(arXiv 2022.03) MotionCLIP: Exposing Human Motion Generation to CLIP Space, [Paper], [Project]
(arXiv 2022.03) The Principle of Diversity: Training Stronger Vision Transformers Calls for Reducing All Levels of Redundancy, [Paper], [Code]
(arXiv 2022.03) Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation, [Paper]
(arXiv 2022.03) Sparse Local Patch Transformer for Robust Face Alignment and Landmarks Inherent Relation Learning, [Paper], [Code]
(arXiv 2022.03) Joint CNN and Transformer Network via weakly supervised Learning for efficient crowd counting, [Paper]
(arXiv 2022.03) DFTR: Depth-supervised Hierarchical Feature Fusion Transformer for Salient Object Detection, [Paper]
(arXiv 2022.03) DATR: Domain-adaptive transformer for multi-domain landmark detection, [Paper]
(arXiv 2022.03) EventFormer: AU Event Transformer for Facial Action Unit Event Detection, [Paper]
(arXiv 2022.03) Accelerating DETR Convergence via Semantic-Aligned Matching, [Paper], [Code]
(arXiv 2022.03) All in One: Exploring Unified Video-Language Pre-training, [Paper], [Code]
(arXiv 2022.03) CLIP Models are Few-shot Learners: Empirical Studies on VQA and Visual Entailment, [Paper]
(arXiv 2022.03) EIT: Efficiently Lead Inductive Biases to ViT, [Paper], [Code]
(arXiv 2022.03) Self-Promoted Supervision for Few-Shot Transformer, [Paper], [Code]
(arXiv 2022.03) MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization, [Paper]
(arXiv 2022.03) Disentangled Representation Learning for Text-Video Retrieval, [Paper]
(arXiv 2022.03) TransCAM: Transformer Attention-based CAM Refinement for Weakly Supervised Semantic Segmentation, [Paper], [Code]
(arXiv 2022.03) Synopses of Movie Narratives: a Video-Language Dataset for Story Understanding, [Paper], [Dataset]
(arXiv 2022.03) Visualizing and Understanding Patch Interactions in Vision Transformer, [Paper]
(arXiv 2022.03) ANTI-OVERSMOOTHING IN DEEP VISION TRANSFORMERS VIA THE FOURIER DOMAIN ANALYSIS: FROM THEORY TO PRACTICE, [Paper], [Code]
(arXiv 2022.03) Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision, [Paper], [Code]
(arXiv 2022.03) ActiveMLP: An MLP-like Architecture with Active Token Mixer, [Paper], [Code]
(arXiv 2022.03) Zero-Shot Action Recognition with Transformer-based Video Semantic Embedding, [Paper]
(arXiv 2022.03) TrueType Transformer: Character and Font Style Recognition in Outline Format, [Paper]
(arXiv 2022.03) LOOPITR: Combining Dual and Cross Encoder Architectures for Image-Text Retrieval, [Paper]
(arXiv 2022.03) MVP: Multimodality-guided Visual Pre-training, [Paper]
(arXiv 2022.03) DEER: Detection-agnostic End-to-End Recognizer for Scene Text Spotting, [Paper]
(arXiv 2022.03) Multi-Modal Mixup for Robust Fine-tuning, [Paper]
(arXiv 2022.03) AssistQ: Affordance-centric Question-driven Task Completion for Egocentric Assistant, [Paper], [Project]
(arXiv 2022.03) Coarse-to-Fine Vision Transformer, [Paper], [Code]
(arXiv 2022.03) Monocular Robot Navigation with Self-Supervised Pretrained Vision Transformers, [Paper]
(arXiv 2022.03) WAVEMIX: RESOURCE-EFFICIENT TOKEN MIXING FOR IMAGES, [Paper]
(arXiv 2022.03) VOVIT: LOW LATENCY GRAPH-BASED AUDIO-VISUAL VOICE SEPARATION TRANSFORMER, [Paper], [Code]
(arXiv 2022.03) Graph Attention Transformer Network for Multi-Label Image Classification, [Paper]
(arXiv 2022.03) EDGEFORMER: IMPROVING LIGHT-WEIGHT CONVNETS BY LEARNING FROM VISION TRANSFORMERS, [Paper], [Code]
(arXiv 2022.03) Skating-Mixer: Multimodal MLP for Scoring Figure Skating, [Paper]
(arXiv 2022.03) Dynamic Group Transformer: A General Vision Transformer Backbone with Dynamic Group Attention, [Paper]
(arXiv 2022.03) CP-ViT: Cascade Vision Transformer Pruning via Progressive Sparsity Prediction, [Paper]
(arXiv 2022.03) Model-Agnostic Multitask Fine-tuning for Few-shot Vision-Language Transfer Learning, [Paper]
(arXiv 2022.03) ChiTransformer: Towards Reliable Stereo from Cues, [Paper]
(arXiv 2022.03) A Unified Transformer Framework for Group-based Segmentation: Co-Segmentation,** Co-Saliency Detection** and Video Salient Object Detection, [Paper], [Code]
(arXiv 2022.03) Coarse-to-Fine Sparse Transformer for Hyperspectral Image Reconstruction, [Paper]
(arXiv 2022.03) CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers, [Paper], [Code]
(arXiv 2022.03) Multiscale Transformer for Hyperspectral Image Classification, [Paper]
(arXiv 2022.03) Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning, [Paper], [Code]
(arXiv 2022.03) Autoregressive Image Generation using Residual Quantization, [Paper]
(arXiv 2022.03) CONTEXTFORMER: A TRANSFORMER WITH SPATIO-CHANNEL ATTENTION FOR CONTEXT MODELING IN LEARNED IMAGE COMPRESSION, [Paper]
(arXiv 2022.03) Patch Similarity Aware Data-Free Quantization for Vision Transformers, [Paper]
(arXiv 2022.03) ViT-P: Rethinking Data-efficient Vision Transformers from Locality, [Paper]
(arXiv 2022.03) DIT: SELF-SUPERVISED PRE-TRAINING FOR DOCUMENT IMAGE TRANSFORMER, [Paper]
(arXiv 2022.03) Towards Efficient and Scalable Sharpness-Aware Minimization, [Paper]
(arXiv 2022.03) HyperTransformer: A Textural and Spectral Feature Fusion Transformer for Pansharpening, [Paper], [Code]
(arXiv 2022.03) UVCGAN: UNET VISION TRANSFORMER CYCLE-CONSISTENT GAN FOR UNPAIRED IMAGE-TO-IMAGE TRANSLATION, [Paper], [Code]
(arXiv 2022.03) Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning, [Paper], [Code]
(arXiv 2022.03) PANFORMER: A TRANSFORMER BASED MODEL FOR PAN-SHARPENING, [Paper], [Code]
(arXiv 2022.03) Multi-class Token Transformer for Weakly Supervised Semantic Segmentation, [Paper], [Code]
(arXiv 2022.03) Cross Language Image Matching for Weakly Supervised Semantic Segmentation, [Paper]
(arXiv 2022.03) Learning Affinity from Attention: End-to-End Weakly-Supervised Semantic Segmentation with Transformers, [Paper], [Code]
(arXiv 2022.03) DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection, [Paper], [Code]
(arXiv 2022.03) MetaFormer : A Unified Meta Framework for Fine-Grained Recognition, [Paper], [Code]
(arXiv 2022.03) Audio-visual Generalised Zero-shot Learning with Cross-modal Attention and Language, [Paper]
(arXiv 2022.03) Knowledge Amalgamation for Object Detection with Transformers, [Paper]
(arXiv 2022.03) Learnable Irrelevant Modality Dropout for Multimodal Action Recognition on Modality-Specific Annotated Videos, [Paper]
(arXiv 2022.03) Modeling Coreference Relations in Visual Dialog, [Paper], [Code]
(arXiv 2022.03) VITRANSPAD: VIDEO TRANSFORMER USING CONVOLUTION AND SELF-ATTENTION FOR FACE PRESENTATION ATTACK DETECTION, [Paper]
(arXiv 2022.03) Multi-Tailed Vision Transformer for Efficient Inference, [Paper]
(arXiv 2022.03) Bending Reality: Distortion-aware Transformers for Adapting to Panoramic Semantic Segmentation, [Paper], [Code]
(arXiv 2022.03) Ensembles of Vision Transformers as a New Paradigm for Automated Classification in Ecology, [Paper]
(arXiv 2022.03) LGT-Net: Indoor Panoramic Room Layout Estimation with Geometry-Aware Transformer Network, [Paper], [Code]
(arXiv 2022.03) LatentFormer: Multi-Agent Transformer-Based Interaction Modeling and Trajectory Prediction, [Paper]
(arXiv 2022.03) DCT-Former: Efficient Self-Attention with Discrete Cosine Transform, [Paper], [Code]
(arXiv 2022.03) Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular Alignment, [Paper]
(arXiv 2022.03) Spatiotemporal Transformer Attention Network for 3D Voxel Level Joint Segmentation and Motion Prediction in Point Cloud, [Paper]
(arXiv 2022.03) CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP, [Paper]
(arXiv 2022.03) MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose Estimation in Video, [Paper]
(arXiv 2022.03) X -Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D Dense Captioning, [Paper]
(arXiv 2022.03) 3DCTN: 3D Convolution-Transformer Network for Point Cloud Classification, [Paper]
(arXiv 2022.03) DeciWatch: A Simple Baseline for 10× Efficient 2D and 3D Pose Estimation, [Paper]
(arXiv 2022.03) D_2ETR: Decoder-Only DETR with Computationally Efficient Cross-Scale Attention, [Paper]
(arXiv 2022.03) Incremental Transformer Structure Enhanced Image Inpainting with Masking Positional Encoding, [Paper], [Code]
(arXiv 2022.03) Self-supervised Transformer for Deepfake Detection, [Paper]
(arXiv 2022.03) Aggregated Pyramid Vision Transformer: Splittransform-merge Strategy for Image Recognition without Convolutions, [Paper]
(arXiv 2022.03) TransDARC: Transformer-based Driver Activity Recognition with Latent Space Feature Calibration, [Paper], [Code]
(arXiv 2022.03) DN-DETR: Accelerate DETR Training by Introducing Query DeNoising, [Paper], [Code]
(arXiv 2022.03) Protecting Celebrities with Identity Consistency Transformer, [Paper]
(arXiv 2022.03) Masked Visual Pre-training for Motor Control, [Paper], [Project]
(arXiv 2022.03) NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language Tasks, [Paper], [Code]
(arXiv 2022.03) Conditional Prompt Learning for Vision-Language Models, [Paper], [Code]
(arXiv 2022.03) Lane Detection with Versatile AtrousFormer and Local Semantic Guidance, [Paper]
(arXiv 2022.03) DALL-EVAL: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers, [Paper], [Code]
(arXiv 2022.03) Forecasting Characteristic 3D Poses of Human Actions , [Paper], [Code]

2022.02

(arXiv 2022.02) Bayesian Structure Learning with Generative Flow Networks, [Paper]
(arXiv 2022.02) Towards Unsupervised Domain Adaptation via Domain-Transformer, [Paper]
(arXiv 2022.02) An End-to-End Transformer Model for Crowd Localization, [Paper]
(arXiv 2022.02) Instantaneous Physiological Estimation using Video Transformers, [Paper], [Code]
(arXiv 2022.02) StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Translation, [Paper], [Code]
(arXiv 2022.02) ATTENTION ENABLES ZERO APPROXIMATION ERROR, [Paper]
(arXiv 2022.02) When Transformer Meets Robotic Grasping: Exploits Context for Efficient Grasp Detection, [Paper], [Code]
(arXiv 2022.02) AUTO-SCALING VISION TRANSFORMERS WITHOUT TRAINING, [Paper], [Code]
(arXiv 2022.02) Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation, [Paper], [Project]
(arXiv 2022.02) LEARNING TO MERGE TOKENS IN VISION TRANSFORMERS, [Paper]
(arXiv 2022.02) ProFormer: Learning Data-efficient Representations of Body Movement with Prototype-based Feature Augmentation and Visual Transformers, [Paper], [Code]
(arXiv 2022.02) SELF-SUPERVISED TRANSFORMERS FOR UNSUPERVISED OBJECT DISCOVERY USING NORMALIZED CUT, [Paper], [Project]
(arXiv 2022.02) Paying U-Attention to Textures: Multi-Stage Hourglass Vision Transformer for Universal Texture Synthesis, [Paper]
(arXiv 2022.02) CaMEL: Mean Teacher Learning for Image Captioning, [Paper]
(arXiv 2022.02) Hierarchical Perceiver, [Paper]
(arXiv 2022.02) Movies2Scenes: Learning Scene Representations Using Movie Similarities, [Paper]
(arXiv 2022.02) GroupViT: Semantic Segmentation Emerges from Text Supervision, [Paper], [[Code
(arXiv 2022.02) Snowflake Point Deconvolution for Point Cloud Completion and Generation with Skip-Transformer, [Paper], [Code]
(arXiv 2022.02) Audio Visual Scene-Aware Dialog Generation with Transformer-based Video Representations, [Paper]
(arXiv 2022.02) ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond, [Paper]
(arXiv 2022.02) PMP-Net++: Point Cloud Completion by Transformer-Enhanced Multi-step Point Moving Paths, [Paper], [Code]
(arXiv 2022.02) DataMUX: Data Multiplexing for Neural Networks, [Paper], [Code]
(arXiv 2022.02) On Guiding Visual Attention with Language Specification, [Paper]
(arXiv 2022.02) SPATIO-TEMPORAL OUTDOOR LIGHTING AGGREGATION ON IMAGE SEQUENCES USING TRANSFORMER NETWORKS, [Paper]
(arXiv 2022.02) MISINFORMATION DETECTION IN SOCIAL MEDIA VIDEO POSTS, [Paper]
(arXiv 2022.02) Can Deep Learning be Applied to Model-Based Multi-Object Tracking? [Paper]
(arXiv 2022.02) NOT ALL PATCHES ARE WHAT YOU NEED: EXPEDITING VISION TRANSFORMERS VIA TOKEN REORGANIZATIONS, [Paper], [Code]
(arXiv 2022.02) ActionFormer: Localizing Moments of Actions with Transformers, [Paper], [Code]
(arXiv 2022.02) One Step at a Time: Long-Horizon Vision-and-Language Navigation with Milestones, [Paper]
(arXiv 2022.02) XAI for Transformers: Better Explanations through Conservative Propagation, [Paper]
(arXiv 2022.02) MeshLeTemp: Leveraging the Learnable Vertex-Vertex Relationship to Generalize Human Pose and Mesh Reconstruction for In-the-Wild Scenes, [Paper]
(arXiv 2022.02) ViNTER: Image Narrative Generation with Emotion-Arc-Aware Transformer, [Paper]
(arXiv 2022.02) Hyper-relationship Learning Network for Scene Graph Generation, [Paper]
(arXiv 2022.02) CommerceMM: Large-Scale Commerce MultiModal Representation Learning with Omni Retrieval, [Paper]
(arXiv 2022.02) Flowformer: Linearizing Transformers with Conservation Flows, [Paper]
(arXiv 2022.02) DialFRED: Dialogue-Enabled Agents for Embodied Instruction Following, [Paper], [Code]
(arXiv 2022.02) CATs++: Boosting Cost Aggregation with Convolutions and Transformers, [Paper]
(arXiv 2022.02) Geometric Transformer for Fast and Robust Point Cloud Registration, [Paper], [Code]
(arXiv 2022.02) I-Tuning: Tuning Language Models with Image for Caption Generation, [[Paper]](I-Tuning: Tuning Language Models with Image for Caption Generation)
(arXiv 2022.02) Multi-direction and Multi-scale Pyramid in Transformer for Video-based Pedestrian Retrieval, [Paper], [Code]
(arXiv 2022.02) Visual Acoustic Matching, [Paper]
(arXiv 2022.02) LighTN: Light-weight Transformer Network for Performance-overhead Tradeoff in Point Cloud Downsampling, [Paper]
(arXiv 2022.02) BViT: Broad Attention based Vision Transformer, [Paper], [Code]
(arXiv 2022.02) Task-Adaptive Feature Transformer with Semantic Enrichment for Few-Shot Segmentation, [Paper]
(arXiv 2022.02) Domain Adaptation via Prompt Learning, [Paper]
(arXiv 2022.02) Mixing and Shifting: Exploiting Global and Local Dependencies in Vision MLPs, [Paper], [Code]
(arXiv 2022.02) Wukong: 100 Million Large-scale Chinese Cross-modal Pre-training Dataset and A Foundation Framework, [Paper], [Project]
(arXiv 2022.02) HOW DO VISION TRANSFORMERS WORK? [Paper], [Code]
(arXiv 2022.02) ACORT: A Compact Object Relation Transformer for Parameter Efficient Image Captioning, [Paper], [Code]
(arXiv 2022.02) CLIPasso: Semantically-Aware Object Sketching, [Paper], [Code]
(arXiv 2022.02) Towards Weakly-Supervised Text Spotting using a Multi-Task Transformer, [Paper]
(arXiv 2022.02) DEEP SOCCER CAPTIONING WITH TRANSFORMER: DATASET, SEMANTICS-RELATED LOSSES, AND MULTI-LEVEL EVALUATION, [Paper], [Project]
(arXiv 2022.02) ENTROFORMER: A TRANSFORMER-BASED ENTROPY MODEL FOR LEARNED IMAGE COMPRESSION, [Paper], [Code]
(arXiv 2022.02) Image Difference Captioning with Pre-training and Contrastive Learning, [Paper], [Code]
(arXiv 2022.02) MaskGIT: Masked Generative Image Transformer, [Paper]
(arXiv 2022.02) Distillation with Contrast is All You Need for Self-Supervised Point Cloud Representation Learning, [Paper]
(arXiv 2022.02) Motion-Aware Transformer For Occluded Person Re-identification, [Paper]
(arXiv 2022.02) Conditional Motion In-betweening, [Paper], [Code]
(arXiv 2022.02) Memory-based gaze prediction in deep imitation learning for robot manipulation, [Paper]
(arXiv 2022.02) Spherical Transformer, [Paper]
(arXiv 2022.02) OWL (Observe, Watch, Listen): Localizing Actions in Egocentric Video via Audiovisual Temporal Context, [Paper]
(arXiv 2022.02) The Abduction of Sherlock Holmes: A Dataset for Visual Abductive Reasoning, [Paper], [Project]
(arXiv 2022.02) DALL-EVAL: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers, [Paper], [Code]
(arXiv 2022.02) Pre-Trained Language Models for Interactive Decision-Making, [Paper]
(arXiv 2022.02) TransFollower: Long-Sequence Car-Following Trajectory Prediction through Transformer, [Paper]
(arXiv 2022.02) The devil is in the labels: Semantic segmentation from sentences, [Paper]
(arXiv 2022.02) Webly Supervised Concept Expansion for General Purpose Vision Models, [Paper], [Project]
(arXiv 2022.02) VU-BERT: A UNIFIED FRAMEWORK FOR VISUAL DIALOG, [Paper]
(arXiv 2022.02) UNIFYING ARCHITECTURES, TASKS, AND MODALITIES THROUGH A SIMPLE SEQUENCE-TO-SEQUENCE LEARNING FRAMEWORK, [Paper], [Code]
(arXiv 2022.02) Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics, [Paper]
(arXiv 2022.02) TRANSDREAMER: REINFORCEMENT LEARNING WITH TRANSFORMER WORLD MODELS, [Paper]
(arXiv 2022.02) Vision-Language Pre-Training with Triple Contrastive Learning, [Paper], [Code]
(arXiv 2022.02) Corrupted Image Modeling for Self-Supervised Visual Pre-Training, [Paper]
(arXiv 2022.02) BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation, [Paper], [Code]
(arXiv 2022.02) DNNFuser: Generative Pre-Trained Transformer as a Generalized Mapper for Layer Fusion in DNN Accelerators, [Paper]
(arXiv 2022.02) Interactron: Embodied Adaptive Object Detection, [Paper]
(arXiv 2022.02) Local Feature Matching with Transformers for low-end devices LoFTR method adaptation approach, [Paper], [Code]
(arXiv 2022.02) Pre-Trained Language Models for Interactive Decision-Making, [Paper]
(arXiv 2022.02) Can Transformers be Strong Treatment Effect Estimators?, [Paper]
(arXiv 2022.02) Improving Sample Efficiency of Value Based Models Using Attention and Vision Transformers, [Paper]
(arXiv 2022.02) Detecting Human-Object Interactions with Object-Guided Cross-Modal Calibrated Semantics, [Paper], [Code]

2022.01

(arXiv 2022.01) O-ViT: Orthogonal Vision Transformer, [Paper]
(arXiv 2022.01) DynaMixer: A Vision MLP Architecture with Dynamic Mixing, [Paper]
(arXiv 2022.01) VRT: A Video Restoration Transformer, [Paper], [Code]
(arXiv 2022.01) DAB-DETR: DYNAMIC ANCHOR BOXES ARE BETTER QUERIES FOR DETR, [Paper], [Code]
(arXiv 2022.01) Plug-In Inversion: Model-Agnostic Inversion for Vision with Data Augmentations, [Paper]
(arXiv 2022.01) MVP: Multi-Stage Vision-Language Pre-Training via Multi-Level Semantic Alignment, [Paper]
(arXiv 2022.01) VC-GPT: Visual Conditioned GPT for End-to-End Generative Vision-and-Language Pre-training, [Paper]
(arXiv 2022.01) BOAT: Bilateral Local Attention Vision Transformer, [Paper]
(arXiv 2022.01) GRAPH SELF-ATTENTION FOR LEARNING GRAPH REPRESENTATION WITH TRANSFORMER, [Paper]
(arXiv 2022.01) Aggregating Global Features into Local Vision Transformer, [Paper], [Code]
(arXiv 2022.01) Transformer Module Networks for Systematic Generalization in Visual Question Answering, [Paper]
(arXiv 2022.01) Generalised Image Outpainting with U-Transformer, [Paper]
(arXiv 2022.01) RelTR: Relation Transformer for Scene Graph Generation, [Paper]
(arXiv 2022.01) DocSegTr: An Instance-Level End-to-End Document Image Segmentation Transformer, [Paper]
(arXiv 2022.01) Pre-Trained Language Transformers are Universal Image Classifiers, [Paper]
(arXiv 2022.01) Explore and Match: End-to-End Video Grounding with Transformer, [Paper]
(arXiv 2022.01) TGFuse: An Infrared and Visible Image Fusion Approach Based on Transformer and Generative Adversarial Network, [Paper]
(arXiv 2022.01) ViT-HGR: Vision Transformer-based Hand Gesture Recognition from High Density Surface EMG Signals, [Paper]
(arXiv 2022.01) ShapeFormer: Transformer-based Shape Completion via Sparse Representation, [Paper], [Project]
(arXiv 2022.01) CONVOLUTIONAL XFORMERS FOR VISION, [Paper], [Code]
(arXiv 2022.01) DocEnTr: An End-to-End Document Image Enhancement Transformer, [Paper], [Code]
(arXiv 2022.01) Zero-Shot Sketch Based Image Retrieval using Graph Transformer, [Paper]
(arXiv 2022.01) SA-VQA: Structured Alignment of Visual and Semantic Representations for Visual Question Answering, [Paper]
(arXiv 2022.01) DUAL-TASKS SIAMESE TRANSFORMER FRAMEWORK FOR BUILDING DAMAGE ASSESSMENT, [Paper]
(arXiv 2022.01) When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism, [Paper], [Code]
(arXiv 2022.01) Self-supervised 3D Semantic Representation Learning for Vision-and-Language Navigation, [Paper]
(arXiv 2022.01) Training Vision Transformers with Only 2040 Images, [Paper]
(arXiv 2022.01) Learning To Recognize Procedural Activities with Distant Supervision, [Paper]
(arXiv 2022.01) EVALUATING LANGUAGE-BIASED IMAGE CLASSIFICATION BASED ON SEMANTIC REPRESENTATIONS, [Paper]
(arXiv 2022.01) A Comprehensive Study of Vision Transformers on Dense Prediction Tasks, [Paper]
(arXiv 2022.01) UniFormer: Unifying Convolution and Self-attention for Visual Recognition, [Paper], [Code]
(arXiv 2022.01) Patches Are All You Need? [Paper], [Code]
(arXiv 2022.01) Reading-strategy Inspired Visual Representation Learning for Text-to-Video Retrieval, [Paper]
(arXiv 2022.01) LEARNING TO ACT WITH AFFORDANCE-AWARE MULTIMODAL NEURAL SLAM, [Paper]
(arXiv 2022.01) Visual Information Guided Zero-Shot Paraphrase Generation, [Paper]
(arXiv 2022.01) TerViT: An Efficient Ternary Vision Transformer, [Paper]
(arXiv 2022.01) End-to-end Generative Pretraining for Multimodal Video Captioning, [Paper]
(arXiv 2022.01) OMNIVORE: A Single Model for Many Visual Modalities, [Paper], [Project]
(arXiv 2022.01) MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition, [Paper]
(arXiv 2022.01) The CLEAR Benchmark: Continual LEArning on Real-World Imagery, [Paper], [Project]
(arXiv 2022.01) ProposalCLIP: Unsupervised Open-Category Object Proposal Generation via Exploiting CLIP Cues, [Paper]
(arXiv 2022.01) Cross-modal Contrastive Distillation for Instructional Activity Anticipation, [Paper]
(arXiv 2022.01) Transformers in Action: Weakly Supervised Action Segmentation, [Paper]
(arXiv 2022.01) VAQF: Fully Automatic Software-hardware Co-design Framework for Low-bit Vision Transformer, [Paper]
(arXiv 2022.01) CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks, [Paper]
(arXiv 2022.01) Domain Adaptation via Bidirectional Cross-Attention Transformer, [Paper]
(arXiv 2022.01) Continual Transformers: Redundancy-Free Attention for Online Inference, [Paper]
(arXiv 2022.01) Motion Inbetweening via Deep ∆-Interpolator, [Paper]
(arXiv 2022.01) RePre: Improving Self-Supervised Vision Transformer with Reconstructive Pre-training, [Paper]
(arXiv 2022.01) GTrans: Spatiotemporal Autoregressive Transformer with Graph Embeddings for Nowcasting Extreme Events, [Paper]
(arXiv 2022.01) TransFuse: A Unified Transformer-based Image Fusion Framework using Self-supervised Learning, [Paper]
(arXiv 2022.01) Q-ViT: Fully Differentiable Quantization for Vision Transformer, [Paper]
(arXiv 2022.01) Disentangled Latent Transformer for Interpretable Monocular Height Estimation, [Paper], [Project]
(arXiv 2022.01) Poseur: Direct Human Pose Regression with Transformers*, [Paper]
(arXiv 2022.01) SWINUNET3D - A HIERARCHICAL ARCHITECTURE FOR DEEP TRAFFIC PREDICTION USING SHIFTED WINDOW TRANSFORMERS, [Paper], [Code]
(arXiv 2022.01) SWIN-POSE: SWIN TRANSFORMER BASED HUMAN POSE ESTIMATION, [Paper]
(arXiv 2022.01) Look Closer: Bridging Egocentric and Third-Person Views with Transformers for Robotic Manipulation, [Paper], [Project]
(arXiv 2022.01) ViT2Hash: Unsupervised Information-Preserving Hashing, [Paper]
(arXiv 2022.01) LANGUAGE-DRIVEN SEMANTIC SEGMENTATION, [Paper], [Code]
(arXiv 2022.01) Pedestrian Detection: Domain Generalization, CNNs, Transformers and Beyond, [Paper], [Code]
(arXiv 2022.01) ImageSubject: A Large-scale Dataset for Subject Detection, [Paper]
(arXiv 2022.01) Detecting Twenty-thousand Classes using Image-level Supervision, [Paper], [Code]
(arXiv 2022.01) Generalized Category Discovery, [Paper], [Code]
(arXiv 2022.01) Video Summarization Based on Video-text Modelling, [Paper]
(arXiv 2022.01) Spatio-Temporal Tuples Transformer for Skeleton-Based Action Recognition, [Paper], [Code]
(arXiv 2022.01) QUADTREE ATTENTION FOR VISION TRANSFORMERS, [Paper], [Code]
(arXiv 2022.01) A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval, [Paper], [Project]
(arXiv 2022.01) MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound, [Paper], [Project]
(arXiv 2022.01) On the Efficacy of Co-Attention Transformer Layers in Visual Question Answering, [Paper]
(arXiv 2022.01) Pyramid Fusion Transformer for Semantic Segmentation, [Paper]
(arXiv 2022.01) Multiview Transformers for Video Recognition, [Paper]
(arXiv 2022.01) HYPERTRANSFORMER: MODEL GENERATION FOR SUPERVISED AND SEMI-SUPERVISED FEW-SHOT LEARNING, [Paper]
(arXiv 2022.01) UNIFORMER: UNIFIED TRANSFORMER FOR EFFICIENT SPATIOTEMPORAL REPRESENTATION LEARNING, [Paper], [Code]
(arXiv 2022.01) BridgeFormer: Bridging Video-text Retrieval with Multiple Choice Questions, [Paper], [Project]
(arXiv 2022.01) TransVOD: End-to-end Video Object Detection with Spatial-Temporal Transformers, [Paper]
(arXiv 2022.01) CLIP-Event: Connecting Text and Images with Event Structures, [Paper], [Code]
(arXiv 2022.01) Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training, [Paper]
(arXiv 2022.01) Lawin Transformer: Improving Semantic Segmentation Transformer with Multi-Scale Representations via Large Window Attention, [Paper], [Code]
(arXiv 2022.01) Self-Training Vision Language BERTs with a Unified Conditional Model, [Paper]
(arXiv 2022.01) TransVPR: Transformer-based TransVPR: Transformer-based place recognition with multi-level attention aggregation with multi-level attention aggregation, [Paper]
(arXiv 2022.01) Compact Bidirectional Transformer for Image Captioning, [Paper], [Code]
(arXiv 2022.01) Flow-Guided Sparse Transformer for Video Deblurring, [Paper]
(arXiv 2022.01) Stochastic Layers in Vision Transformers, [Paper]
(arXiv 2022.01) ERNIE-VILG: UNIFIED GENERATIVE PRE-TRAINING FOR BIDIRECTIONAL VISION-LANGUAGE GENERATION, [Paper]
(arXiv 2022.01) InverseMV: Composing Piano Scores with a Convolutional Video-Music Transformer, [Paper], [Code]
(arXiv 2022.01) CSformer: Bridging Convolution and Transformer for Compressive Sensing, [Paper]
(arXiv 2022.01) Persformer: A Transformer Architecture for Topological Machine Learning, [Paper]
(arXiv 2022.01) Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space, [Paper]
(arXiv 2022.01) Language as Queries for Referring Video Object Segmentation, [Paper], [Code]
(arXiv 2022.01) PyramidTNT: Improved Transformer-in-Transformer Baselines with Pyramid Architecture, [Paper], [Code]
(arXiv 2022.01) A TRANSFORMER-BASED SIAMESE NETWORK FOR CHANGE DETECTION, [Paper], [Code]
(arXiv 2022.01) Vision Transformer with Deformable Attention, [Paper], [Code]
(arXiv 2022.01) Splicing ViT Features for Semantic Appearance Transfer, [Paper], [Project]
(arXiv 2022.01) Detail-Preserving Transformer for Light Field Image Super-Resolution, [Paper], [Code]

2021.12

(arXiv 2021.12) Multi-Dimensional Model Compression of Vision Transformer, [Paper]
(arXiv 2021.12) Siamese Network with Interactive Transformer for Video Object Segmentation, [Paper], [Code]
(arXiv 2021.12) Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped Atention, [Paper], [Code]
(arXiv 2021.12) APRIL: Finding the Achilles’ Heel on Privacy for Vision Transformers, [Paper]
(arXiv 2021.12) Synchronized Audio-Visual Frames with Fractional Positional Encoding for Transformers in Video-to-Text Translation, [Paper]
(arXiv 2021.12) Does CLIP Benefit Visual Question Answering in the Medical Domain as Much as it Does in the General Domain?, [Paper]
(arXiv 2021.12) SPViT: Enabling Faster Vision Transformers via Soft Token Pruning, [Paper]
(arXiv 2021.12) A FISTFUL OF WORDS: LEARNING TRANSFERABLE VISUAL MODELS FROM BAG-OF-WORDS SUPERVISION, [Paper]
(arXiv 2021.12) StyleGAN-V: A Continuous Video Generator with the Price, Image Quality and Perks of StyleGAN2, [Paper], [Code]
(arXiv 2021.12) A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained Vision-language Model, [Paper], [Code]
(arXiv 2021.12) Miti-DETR: Object Detection based on Transformers with Mitigatory Self-Attention Convergence, [Paper]
(arXiv 2021.12) SIMVIT: EXPLORING A SIMPLE VISION TRANSFORMER WITH SLIDING WINDOWS, [Paper], [Code]
(arXiv 2021.12) SGTR: End-to-end Scene Graph Generation with Transformer, [Paper]
(arXiv 2021.12) Video Joint Modelling Based on Hierarchical Transformer for Co-summarization, [Paper]
(arXiv 2021.12) Vision Transformer for Small-Size Datasets, [Paper]
(arXiv 2021.12) Learning Generative Vision Transformer with Energy-Based Latent Space for Saliency Prediction, [Paper]
(arXiv 2021.12) ViR: the Vision Reservoir, [Paper]
(arXiv 2021.12) SeMask: Semantically Masked Transformers for Semantic Segmentation, [Paper], [Code]
(arXiv 2021.12) Open-Vocabulary Image Segmentation, [Paper]
(arXiv 2021.12) ELSA: Enhanced Local Self-Attention for Vision Transformer, [Paper], [Code]
(arXiv 2021.12) LaTr: Layout-Aware Transformer for Scene-Text VQA, [Paper]
(arXiv 2021.12) Multimodal Personality Recognition using Cross-Attention Transformer and Behaviour Encoding, [Paper]
(arXiv 2021.12) Fine-grained Multi-Modal Self-Supervised Learning, [Paper]
(arXiv 2021.12) SLIP: Self-supervision meets Language-Image Pre-training, [Paper], [Code]
(arXiv 2021.12) CLEVR3D: Compositional Language and Elementary Visual Reasoning for Question Answering in 3D Real-World Scenes, [Paper]
(arXiv 2021.12) MIA-Former: Efficient and Robust Vision Transformers via Multi-grained Input Adaptation, [Paper]
(arXiv 2021.12) iSegFormer: Interactive Image Segmentation with Transformers, [Paper], [Code]
(arXiv 2021.12) Contrastive Object Detection Using Knowledge Graph Embeddings, [Paper]
(arXiv 2021.12) RepMLPNet: Hierarchical Vision MLP with Re-parameterized Locality, [Paper], [Code]
(arXiv 2021.12) Lite Vision Transformer with Enhanced Self-Attention, [Paper], [Code]
(arXiv 2021.12) MPViT : Multi-Path Vision Transformer for Dense Prediction, [Paper], [Code]
(arXiv 2021.12) SOIT: Segmenting Objects with Instance-Aware Transformers, [Paper], [Code]
(arXiv 2021.12) Learned Queries for Efficient Local Attention, [Paper], [Code]
(arXiv 2021.12) On Efficient Transformer and Image Pre-training for Low-level Vision, [Paper], [Code]
(arXiv 2021.12) LOCFORMER: Enabling Transformers to Perform Temporal Moment Localization on Long Untrimmed Videos With a Feature Sampling Approach, [Paper]
(arXiv 2021.12) Tell me what you see: A zero-shot action recognition method based on natural language descriptions, [Paper], [Code]
(arXiv 2021.12) Pre-Training Transformers for Domain Adaptation, [Paper]
(arXiv 2021.12) ScanQA: 3D Question Answering for Spatial Scene Understanding, [Paper]
(arXiv 2021.12) Are Large-scale Datasets Necessary for Self-Supervised Pre-training? [Paper]
(arXiv 2021.12) StyleSwin: Transformer-based GAN for High-resolution Image Generation, [Paper], [Code]
(arXiv 2021.12) Mask2Former for Video Instance Segmentation, [Paper], [Code]
(arXiv 2021.12) GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models, [Paper], [Code]
(arXiv 2021.12) Efficient Visual Tracking with Exemplar Transformers, [Paper], [Code]
(arXiv 2021.12) Neuromorphic Camera Denoising using Graph Neural Network-driven Transformers, [Paper]
(arXiv 2021.12) Align and Prompt: Video-and-Language Pre-training with Entity Prompts, [Paper], [Code]
(arXiv 2021.12) DATA EFFICIENT LANGUAGE-SUPERVISED ZEROSHOT RECOGNITION WITH OPTIMAL TRANSPORT DISTILLATION, [Paper]
(arXiv 2021.12) SiamTrans: Zero-Shot Multi-Frame Image Restoration with Pre-Trained Siamese Transformers, [Paper]
(arXiv 2021.12) Full Transformer Framework for Robust Point Cloud Registration with Deep Information Interaction, [Paper], [Code]
(arXiv 2021.12) ZeroVL: A Strong Baseline for Aligning Vision-Language Representations with Limited Resources, [Paper]
(arXiv 2021.12) Towards End-to-End Image Compression and Analysis with Transformers, [Paper]
(arXiv 2021.12) How to augment your ViTs? Consistency loss and StyleAug, a random style transfer augmentation, [Paper]
(arXiv 2021.12) Learning to Prompt for Continual Learning, [Paper], [Code]
(arXiv 2021.12) Distilled Dual-Encoder Model for Vision-Language Understanding, [Paper], [Code]
(arXiv 2021.12) Dense Video Captioning Using Unsupervised Semantic Information, [Paper], [Code]
(arXiv 2021.12) Looking Outside the Box to Ground Language in 3D Scenes, [Paper], [Code]
(arXiv 2021.12) RegionCLIP: Region-based Language-Image Pretraining, [Paper], [Code]
(arXiv 2021.12) DProST: 6-DoF Object Pose Estimation Using Space Carving and Dynamic Projective Spatial Transformer, [Paper]
(arXiv 2021.12) Masked Feature Prediction for Self-Supervised Visual Pre-Training, [Paper]
(arXiv 2021.12) SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning, [Paper]
(arXiv 2021.12) TransZero++: Cross Attribute-Guided Transformer for Zero-Shot Learning, [Paper], [Code]
(arXiv 2021.12) Vision Transformer Based Video Hashing Retrieval for Tracing the Source of Fake Videos, [Paper], [Code]
(arXiv 2021.12) Co-training Transformer with Videos and Images Improves Action Recognition, [Paper]
(arXiv 2021.12) QAHOI: Query-Based Anchors for Human-Object Interaction Detection, [Paper], [Code]
(arXiv 2021.12) AdaViT: Adaptive Tokens for Efficient Vision Transformer, [Paper]
(arXiv 2021.12) CLIP-Lite: Information Efficient Visual Representation Learning from Textual Annotations, [Paper]
(arXiv 2021.12) Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text, [Paper]
(arXiv 2021.12) Deep ViT Features as Dense Visual Descriptors, [Paper], [Project]
(arXiv 2021.12) Geometry-Contrastive Transformer for Generalized 3D Pose Transfer, [Paper], [Code]
(arXiv 2021.12) Temporal Transformer Networks with Self-Supervision for Action Recognition, [Paper]
(arXiv 2021.12) COMPOSER: Compositional Learning of Group Activity in Videos, [Paper]
(arXiv 2021.12) Short and Long Range Relation Based Spatio-Temporal Transformer for Micro-Expression Recognition, [Paper]
(arXiv 2021.12) Improving and Diagnosing Knowledge-Based Visual Question Answering via Entity Enhanced Knowledge Injection, [Paper]
(arXiv 2021.12) SVIP: Sequence VerIfication for Procedures in Videos, [Paper]
(arXiv 2021.12) Improving Vision Transformers for Incremental Learning, [Paper]
(arXiv 2021.12) VL-ADAPTER: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks, [Paper], [Code]
(arXiv 2021.12) Embracing Single Stride 3D Object Detector with Sparse Transformer, [Paper], [Code]
(arXiv 2021.12) PartGlot: Learning Shape Part Segmentation from Language Reference Games, [Paper]
(arXiv 2021.12) Pedestrian Trajectory Prediction via Spatial Interaction Transformer Network, [Paper]
(arXiv 2021.12) LEARNING SEMANTIC-ALIGNED FEATURE REPRESENTATION FOR TEXT-BASED PERSON SEARCH, [Paper]
(arXiv 2021.12) L-Verse: Bidirectional Generation Between Image and Text, [Paper]
(arXiv 2021.12) SELF-ATTENTION DOES NOT NEED O(n^2) MEMORY, [Paper]
(arXiv 2021.12) Are Vision Transformers Robust to Patch Perturbations? [Paper]
(arXiv 2021.12) Mesa: A Memory-saving Training Framework for Transformers, [Paper], [Code]
(arXiv 2021.12) Injecting Semantic Concepts into End-to-End Image Captioning, [Paper]
(arXiv 2021.12) MAGMA – Multimodal Augmentation of Generative Models through Adapter-based Finetuning, [Paper]
(arXiv 2021.12) LCTR: On Awakening the Local Continuity of Transformer for Weakly Supervised Object Localization, [Paper]
(arXiv 2021.12) FaceFormer: Speech-Driven 3D Facial Animation with Transformers, [Paper]
(arXiv 2021.12) Rethinking the Two-Stage Framework for Grounded Situation Recognition, [Paper], [Code]
(arXiv 2021.12) CLIP2StyleGAN: Unsupervised Extraction of StyleGAN Edit Directions, [Paper]
(arXiv 2021.12) Couplformer: Rethinking Vision Transformer with Coupling Attention Map, [Paper]
(arXiv 2021.12) Unified Multimodal Pre-training and Prompt-based Tuning for Vision-Language Understanding and Generation, [Paper]
(arXiv 2021.12) Visual Transformers with Primal Object Queries for Multi-Label Image Classification, [Paper]
(arXiv 2021.12) Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training, [Paper], [Code]
(arXiv 2021.12) MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection, [Paper]
(arXiv 2021.12) Grounded Language-Image Pre-training, [Paper], [Code]
(arXiv 2021.12) U^2-Former: A Nested U-shaped Transformer for Image Restoration, [Paper]
(arXiv 2021.12) ADAPTIVE CHANNEL ENCODING TRANSFORMER FOR POINT CLOUD ANALYSIS, [Paper]
(arXiv 2021.12) Pose-guided Feature Disentangling for Occluded Person Re-identification Based on Transformer, [Paper], [Code]
(arXiv 2021.12) VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts, [Paper]
(arXiv 2021.12) PointCLIP: Point Cloud Understanding by CLIP, [Paper], [Code]
(arXiv 2021.12) Learning Tracking Representations via Dual-Branch Fully Transformer Networks, [Paper], [Code]
(arXiv 2021.12) DYNAMIC TOKEN NORMALIZATION IMPROVES VISION TRANSFORMER, [Paper], [Code]
(arXiv 2021.12) PTTR: Relational 3D Point Cloud Object Tracking with Transformer, [Paper], [Code]
(arXiv 2021.12) GETAM: Gradient-weighted Element-wise Transformer Attention Map for Weakly-supervised Semantic segmentation, [Paper]
(arXiv 2021.12) Text2Mesh: Text-Driven Neural Stylization for Meshes, [Paper], [Project]
(arXiv 2021.12) LMR-CBT: Learning Modality-fused Representations with CB-Transformer for Multimodal Emotion Recognition from Unaligned Multimodal Sequences, [Paper]
(arXiv 2021.12) Make A Long Image Short: Adaptive Token Length for Vision Transformers, [Paper]
(arXiv 2021.12) FuseDream: Training-Free Text-to-Image Generation with Improved CLIP+GAN Space Optimization, [Paper], [Code]
(arXiv 2021.12) TransZero: Attribute-guided Transformer for Zero-Shot Learning, [Paper], [Code]
(arXiv 2021.12) Learning Generalizable Vision-Tactile Robotic Grasping Strategy for Deformable Objects via Transformer, [Paper], [Code]
(arXiv 2021.12) Hformer: Hybrid CNN-Transformer for Fringe Order Prediction in Phase Unwrapping of Fringe Projection, [Paper]
(arXiv 2021.12) Pre-training and Fine-tuning Transformers for fMRI Prediction Tasks, [Paper]
(arXiv 2021.12) Transformer based trajectory prediction, [Paper]
(arXiv 2021.12) Evaluating Transformers for Lightweight Action Recognition, [Paper]
(arXiv 2021.12) Contextualized Spatio-Temporal Contrastive Learning with Self-Supervision, [Paper]
(arXiv 2021.12) CMA-CLIP: Cross-Modality Attention CLIP for Image-Text Classification, [Paper]
(arXiv 2021.12) Bootstrapping ViTs: Towards Liberating Vision Transformers from Pre-training, [Paper]
(arXiv 2021.12) Decision-based Black-box Attack Against Vision Transformers via Patch-wise Adversarial Removal, [Paper], [Code]
(arXiv 2021.12) DoodleFormer: Creative Sketch Drawing with Transformers, [Paper]
(arXiv 2021.12) Creating Multimodal Interactive Agents with Imitation and Self-Supervised Learning, [Paper]
(arXiv 2021.12) AUDIO-VISUAL SYNCHRONISATION IN THE WILD, [Paper], [Project]
(arXiv 2021.12) Classification-Then-Grounding: Reformulating Video Scene Graphs as Temporal Bipartite Graphs, [Paper]
(arXiv 2021.12) Garment4D: Garment Reconstruction from Point Cloud Sequences, [Paper], [Code]
(arXiv 2021.12) Locally Shifted Attention**** With Early Global Integration, [Paper], [Code]
(arXiv 2021.12) BLT: Bidirectional Layout Transformer for Controllable Layout Generation, [Paper]
(arXiv 2021.12) PE-former: Pose Estimation Transformer, [Paper], [Project]
(arXiv 2021.12) HairCLIP: Design Your Hair by Text and Reference Image, [Paper], [Project]
(arXiv 2021.12) CLIP-NeRF: Text-and-Image Driven Manipulation of Neural Radiance Fields, [Paper], [Code]
(arXiv 2021.12) A Bilingual, Open World Video Text Dataset and End-to-end Video Text Spotter with Transformer, [Paper], [Code], [Dataset]
(arXiv 2021.12) DualFormer: Local-Global Stratified Transformer for Efficient Video Recognition, [Paper], [Code]
(arXiv 2021.12) Recurrent Glimpse-based Decoder for Detection with Transformer, [Paper], [Code]
(arXiv 2021.12) Fast Point Transformer, [Paper]
(arXiv 2021.12) Assistive Tele-op: Leveraging Transformers to Collect Robotic Task Demonstrations, [Paper], [Project]
(arXiv 2021.12) Cross-Modality Fusion Transformer for Multispectral Object Detection, [Paper]
(arXiv 2021.12) PatchFormer: An Efficient Point Transformer with Patch Attention, [Paper]
(arXiv 2021.12) Transformer-Based Approach for Joint Handwriting and Named Entity Recognition in Historical documents, [Paper]
(arXiv 2021.12) MLP Architectures for Vision-and-Language Modeling: An Empirical Study, [Paper], [Code]
(arXiv 2021.12) Everything at Once – Multi-modal Fusion Transformer for Video Retrieval, [Paper]
(arXiv 2021.12) Prompting Visual-Language Models for Efficient Video Understanding, [Paper], [Project]
(arXiv 2021.12) FLAVA: A Foundational Language And Vision Alignment Model, [Paper]
(arXiv 2021.12) Embedding Arithmetic for Text-driven Image Transformation, [Paper]
(arXiv 2021.12) LAVT: Language-Aware Vision Transformer for Referring Image Segmentation, [Paper]
(arXiv 2021.12) Look at What I’m Doing: Self-Supervised Spatial Grounding of Narrations in Instructional Videos, [Paper], [Project]
(arXiv 2021.12) Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks, [Paper]
(arXiv 2021.12) DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting, [Paper], [Code]
(arXiv 2021.12) Self-supervised Video Transformer, [Paper], [Code]
(arXiv 2021.12) OW-DETR: Open-world Detection Transformer, [Paper]
(arXiv 2021.12) Zero-Shot Text-Guided Object Generation with Dream Fields, [Paper], [Project]
(arXiv 2021.12) Video-Text Pre-training with Learned Regions, [Paper], [Code]
(arXiv 2021.12) MTFNet: Mutual-Transformer Fusion Network for RGB-D Salient Object Detection, [Paper]
(arXiv 2021.12) TCTN: A 3D-Temporal Convolutional Transformer Network for Spatiotemporal Predictive Learning, [Paper]
(arXiv 2021.12) DenseCLIP: Extract Free Dense Labels from CLIP, [Paper]
(arXiv 2021.12) TransMEF: A Transformer-Based Multi-Exposure Image Fusion Framework using Self-Supervised Multi-Task Learning, [Paper]
(arXiv 2021.12) SwinTrack: A Simple and Strong Baseline for Transformer Tracking, [Paper], [Code]
(arXiv 2021.12) Object-Centric Unsupervised Image Captioning, [Paper]
(arXiv 2021.12) Vision Pair Learning: An Efficient Training Framework for Image Classification, [Paper]
(arXiv 2021.12) Visual-Semantic Transformer for Scene Text Recognition, [Paper]
(arXiv 2021.12) Differentiable Spatial Planning using Transformers, [Paper], [Project]
(arXiv 2021.12) Improved Multiscale Vision Transformers for Classification and Detection, [Paper]
(arXiv 2021.12) Masked-attention Mask Transformer for Universal Image Segmentation, [Paper], [Code]
(arXiv 2021.12) BEVT: BERT Pretraining of Video Transformers, [Paper]
(arXiv 2021.12) Human-Object Interaction Detection via Weak Supervision, [Paper]
(arXiv 2021.12) Learning Transformer Features for Image Quality Assessment, [Paper]
(arXiv 2021.12) CLIPstyler: Image Style Transfer with a Single Text Condition, [Paper]
(arXiv 2021.12) Multi-View Stereo with Transformer, [Paper]
(arXiv 2021.12) VoRTX: Volumetric 3D Reconstruction With Transformers for Voxelwise View Selection and Fusion, [Paper], [Code]
(arXiv 2021.12) Object-aware Video-language Pre-training for Retrieval, [Paper], [Code]

2021.11

(arXiv 2021.11) Multi-modal Transformers Excel at Class-agnostic Object Detection, [Paper], [Code]
(arXiv 2021.11) Predict, Prevent, and Evaluate: Disentangled Text-Driven Image Manipulation Empowered by Pre-Trained Vision-Language Model, [Paper]
(arXiv 2021.11) NomMer: Nominate Synergistic Context in Vision Transformer for Visual Recognition, [Paper], [Code]
(arXiv 2021.11) PolyViT: Co-training Vision Transformers on Images, Videos and Audio, [Paper]
(arXiv 2021.11) SWAT: Spatial Structure Within and Among Tokens, [Paper]
(arXiv 2021.11) ADAPTIVE FOURIER NEURAL OPERATORS: EFFICIENT TOKEN MIXERS FOR TRANSFORMERS, [Paper]
(arXiv 2021.11) DyTox: Transformers for Continual Learning with DYnamic TOken eXpansion, [Paper], [Code]
(arXiv 2021.11) DABS: A Domain-Agnostic Benchmark for Self-Supervised Learning, [Paper], [Code]
(arXiv 2021.11) Ice hockey player identification via transformers, [Paper]
(arXiv 2021.11) DBIA: Data-free Backdoor Injection Attack against Transformer Networks, [Paper], [Code]
(arXiv 2021.11) Sparse Fusion for Multimodal Transformers, [Paper]
(arXiv 2021.11) PhysFormer: Facial Video-based Physiological Measurement with Temporal Difference Transformer, [Paper], [Code]
(arXiv 2021.11) Self-Supervised Pre-Training for Transformer-Based Person Re-Identification, [Paper], [Code]
(arXiv 2021.11) DISCRETE REPRESENTATIONS STRENGTHEN VISION TRANSFORMER ROBUSTNESS, [Paper]
(arXiv 2021.11) TRAVLR: Now You See It, Now You Don’t! Evaluating Cross-Modal Transfer of Visio-Linguistic Reasoning, [Paper]
(arXiv 2021.11) Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling, [Paper]
(arXiv 2021.11) Semi-Supervised Vision Transformers, [Paper]
(arXiv 2021.11) CpT: Convolutional Point Transformer for 3D Point Cloud Processing, [Paper]
(arXiv 2021.11) ZERO-SHOT CERTIFIED DEFENSE AGAINST ADVERSARIAL PATCHES WITH VISION TRANSFORMERS, [Paper]
(arXiv 2021.11) PointMixer: MLP-Mixer for Point Cloud Understanding, [Paper]
(arXiv 2021.11) MetaFormer is Actually What You Need for Vision, [Paper], [Code]
(arXiv 2021.11) Florence: A New Foundation Model for Computer Vision, [Paper]
(arXiv 2021.11) Benchmarking Detection Transfer Learning with Vision Transformers, [Paper]
(arXiv 2021.11) Learning to Compose Visual Relations, [Paper], [Project]
(arXiv 2021.11) REFERENCE-BASED MAGNETIC RESONANCE IMAGE RECONSTRUCTION USING TEXTURE TRANSFORMER, [Paper]
(arXiv 2021.11) Induce, Edit, Retrieve: Language Grounded Multimodal Schema for Instructional Video Retrieval, [Paper]
(arXiv 2021.11) Swin Transformer V2: Scaling Up Capacity and Resolution, [Paper], [Code]
(arXiv 2021.11) SimMIM: A Simple Framework for Masked Image Modeling, [Paper], [Code]
(arXiv 2021.11) Restormer: Efficient Transformer for High-Resolution Image Restoration, [Paper], [Code]
(arXiv 2021.11) Simple but Effective: CLIP Embeddings for Embodied AI, [Paper]
(arXiv 2021.11) ClipCap: CLIP Prefix for Image Captioning, [Paper], [Code]
(arXiv 2021.11) TransMix: Attend to Mix for Vision Transformers, [Paper], [Code]
(arXiv 2021.11) TRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance, [Paper]
(arXiv 2021.11) Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts, [Paper], [Code]
(arXiv 2021.11) Explainable Semantic Space by Grounding Language to Vision with Cross-Modal Contrastive Learning, [Paper], [Code]
(arXiv 2021.11) Semantically Grounded Object Matching for Robust Robotic Scene Rearrangement, [Paper], [Code]
(arXiv 2021.11) Tracking People with 3D Representations, [Paper], [Code]
(arXiv 2021.11) LiT: Zero-Shot Transfer with Locked-image Text Tuning, [Paper]
(arXiv 2021.11) FILIP: FINE-GRAINED INTERACTIVE LANGUAGE-IMAGE PRE-TRAINING, [Paper]
(arXiv 2021.11) Graph Relation Transformer: Incorporating pairwise object features into the Transformer architecture, [Paper], [Code]
(arXiv 2021.11) Attention Approximates Sparse Distributed Memory, [Paper]
(arXiv 2021.11) SLICED RECURSIVE TRANSFORMER, [Paper], [Code]
(arXiv 2021.11) HYBRID BYOL-VIT: EFFICIENT APPROACH TO DEAL WITH SMALL DATASETS, [Paper]
(arXiv 2021.11) Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling, [Paper], [Code]
(arXiv 2021.11) Improving Visual Quality of Image Synthesis by A Token-based Generator with Transformers, [Paper]
(arXiv 2021.11) StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis, [Paper], [Code]
(arXiv 2021.11) Revisiting spatio-temporal layouts for compositional action recognition, [Paper], [Code]
(arXiv 2021.11) PatchGame: Learning to Signal Mid-level Patches in Referential Games, [Paper], [Code]
(arXiv 2021.11) CAN VISION TRANSFORMERS PERFORM CONVOLUTION? [Paper]
(arXiv 2021.11) Livestock Monitoring with Transformer, [Paper]
(arXiv 2021.11) With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition, [Paper], [Code]
(arXiv 2021.11) IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning, [Paper], [Project]
(arXiv 2021.11) BoxeR: Box-Attention for 2D and 3D Transformers, [Paper]
(arXiv 2021.11) VLDeformer: Vision-Language Decomposed Transformer for Fast Cross-Modal Retrieval, [Paper]
(arXiv 2021.11) Multi-Person 3D Motion Prediction with Multi-Range Transformers, [Paper], [Code]
(arXiv 2021.11) Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations, [Paper], [Project]
(arXiv 2021.11) Global Interaction Modelling in Vision Transformer via Super Tokens, [Paper]
(arXiv 2021.11) ML-Decoder: Scalable and Versatile Classification Head, [Paper], [Code]
(arXiv 2021.11) Exploiting Both Domain-specific and Invariant Knowledge via a Win-win Transformer for Unsupervised Domain Adaptation, [Paper]
(arXiv 2021.11) SWINBERT: End-to-End Transformers with Sparse Attention for Video Captioning, [Paper]
(arXiv 2021.11) Amortized Prompt: Lightweight Fine-Tuning for CLIP in Domain Generalization, [Paper]
(arXiv 2021.11) Universal Captioner: Long-Tail Vision-and-Language Model Training through Content-Style Separation, [Paper]
(arXiv 2021.11) Sparse is Enough in Scaling Transformers, [Paper]
(arXiv 2021.11) An implementation of the “Guess who?” game using CLIP, [Paper], [Code]
(arXiv 2021.11) HEAT: Holistic Edge Attention Transformer for Structured Reconstruction, [Paper]
(arXiv 2021.11) A Unified Pruning Framework for Vision Transformers, [Paper]
(arXiv 2021.11) Pyramid Adversarial Training Improves ViT Performance, [Paper]
(arXiv 2021.11) AssistSR: Affordance-centric Question-driven Video Segment Retrieval, [Paper], [Code & Data]
(arXiv 2021.11) DAFormer: Improving Network Architectures and Training Strategies for Domain-Adaptive Semantic Segmentation, [Paper], [Code]
(arXiv 2021.11) , [Paper]
(arXiv 2021.11) AdaViT: Adaptive Vision Transformers for Efficient Image Recognition, [Paper]
(arXiv 2021.11) ATS: Adaptive Token Sampling For Efficient Vision Transformers, [Paper]
(arXiv 2021.11) CLIP Meets Video Captioners: Attribute-Aware Representation Learning Promotes Accurate Captioning, [Paper]
(arXiv 2021.11) CRIS: CLIP-Driven Referring Image Segmentation, [Paper]
(arXiv 2021.11) Shunted Self-Attention via Multi-Scale Token Aggregation, [Paper], [Code]
(arXiv 2021.11) MC-SSL0.0: Towards Multi-Concept Self-Supervised Learning, [Paper]
(arXiv 2021.11) TransWeather: Transformer-based Restoration of Images Degraded by Adverse Weather Conditions, [Paper], [Code]
(arXiv 2021.11) Searching the Search Space of Vision Transformer, [Paper], [Code]
(arXiv 2021.11) TransMVSNet: Global Context-aware Multi-view Stereo Network with Transformers, [Paper], [Code]
(arXiv 2021.11) Recurrent Vision Transformer for Solving Visual Reasoning Problems, [Paper]
(arXiv 2021.11) Video Frame Interpolation Transformer, [Paper]
(arXiv 2021.11) FQ-ViT: Fully Quantized Vision Transformer without Retraining, [Paper], [Code]
(arXiv 2021.11) LAFITE : Towards Language-Free Training for Text-to-Image Generation, [Paper]
(arXiv 2021.11) SPARSE DETR: EFFICIENT END-TO-END OBJECT DETECTION WITH LEARNABLE SPARSITY, [Paper], [Code]
(arXiv 2021.11) End-to-End Referring Video Object Segmentation with Multimodal Transformers, [Paper], [Code]
(arXiv 2021.11) Point-BERT: Pre-training 3D Point Cloud Transformers with Masked Point Modeling, [Paper], [Code]
(arXiv 2021.11) Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic, [Paper], [Code]
(arXiv 2021.11) Blended Diffusion for Text-driven Editing of Natural Images, [Paper], [Code]
(arXiv 2021.11) Mask Transfiner for High-Quality Instance Segmentation, [Paper], [Code]
(arXiv 2021.11) MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation, [Paper], [Code]
(arXiv 2021.11) PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers, [Paper], [Code]
(arXiv 2021.11) Unleashing Transformers: Parallel Token Prediction with Discrete Absorbing Diffusion for Fast High-Resolution Image Generation from Vector-Quantized Codes, [Paper], [COde]
(arXiv 2021.11) Towards Tokenized Human Dynamics Representation, [Paper], [Code]
(arXiv 2021.11) Self-slimmed Vision Transformer, [Paper]
(arXiv 2021.11) VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling, [Paper], [Code]
(arXiv 2021.11) A Lightweight Graph Transformer Network for Human Mesh Reconstruction from 2D Human Pose, [Paper]
(arXiv 2021.11) MorphMLP: A Self-Attention Free, MLP-Like Backbone for Image and Video, [Paper]
(arXiv 2021.11) Octree Transformer: Autoregressive 3D Shape Generation on Hierarchically Structured Sequences, [Paper]
(arXiv 2021.11) Hierarchical Modular Network for Video Captioning, [Paper]
(arXiv 2021.11) NU¨WA: Visual Synthesis Pre-training for Neural visUal World creAtion, [Paper], [Code]
(arXiv 2021.11) An Image Patch is a Wave: Phase-Aware Vision MLP, [Paper]
(arXiv 2021.11) PTQ4ViT: Post-Training Quantization Framework for Vision Transformers, [Paper]
(arXiv 2021.11) PU-Transformer: Point Cloud Upsampling Transformer, [Paper]
(arXiv 2021.11) Scaling Up Vision-Language Pre-training for Image Captioning, [Paper]
(arXiv 2021.11) Cerberus Transformer: Joint Semantic, Affordance and Attribute Parsing, [Paper], [Code]
(arXiv 2021.11) Efficient Video Transformers with Spatial-Temporal Token Selection, [Paper]
(arXiv 2021.11) RedCaps: Web-curated image-text data created by the people, for the people, [Paper], [Project]
(arXiv 2021.11) EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching, [Paper], [Code]
(arXiv 2021.11) Compositional Transformers for Scene Generation, [Paper], [Code]
(arXiv 2021.11) Vis-TOP: Visual Transformer Overlay Processor, [Paper]
(arXiv 2021.11) Grounded Situation Recognition with Transformers, [Paper], [Code]
(arXiv 2021.11) Rethinking Query, Key, and Value Embedding in Vision Transformer under Tiny Model Constraints, [Paper]
(arXiv 2021.11) UFO: A UniFied TransfOrmer for Vision-Language Representation Learning, [Paper]
(arXiv 2021.11) Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions, [Paper]
(arXiv 2021.11) Combined Scaling for Zero-shot Transfer Learning, [Paper]
(arXiv 2021.11) Simple but Effective: CLIP Embeddings for Embodied AI, [Paper]
(arXiv 2021.11) Improved Robustness of Vision Transformer via PreLayerNorm in Patch Embedding, [Paper]
(arXiv 2021.11) IBOT: IMAGE BERT PRE-TRAINING WITH ONLINE TOKENIZER, [Paper], [Code]
(arXiv 2021.11) Masked Autoencoders Are Scalable Vision Learners, [Paper]
(arXiv 2021.11) Mask-guided Spectral-wise Transformer for Efficient Hyperspectral Image Reconstruction, [Paper]
(arXiv 2021.11) Are Transformers More Robust Than CNNs?, [Paper], [Code]
(arXiv 2021.11) CLIP2TV: An Empirical Study on Transformer-based Methods for Video-Text Retrieval, [Paper]
(arXiv 2021.11) Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation, [Paper]
(arXiv 2021.11) Improving Visual Quality of Image Synthesis by A Token-based Generator with Transformers, [Paper]
(arXiv 2021.11) VLMO: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts, [Paper], [Code]
(arXiv 2021.11) LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs, [Paper], [Project]
(arXiv 2021.11) An Empirical Study of Training End-to-End Vision-and-Language Transformers, [Paper], [Code]
(arXiv 2021.11) CAN VISION TRANSFORMERS PERFORM CONVOLUTION? [Paper]
(arXiv 2021.11) HRViT: Multi-Scale High-Resolution Vision Transformer, [Paper]

2021.10

(arXiv 2021.10) Visual Keyword Spotting with Attention, [Paper], [[Project]](Visual Keyword Spotting with Attention)
(arXiv 2021.10) Learning Co-segmentation by Segment Swapping for Retrieval and Discovery, [Paper], [Data & Code]
(arXiv 2021.10) Visual Spatio-Temporal Relation-Enhanced Network for Cross-Modal Text-Video Retrieval, [Paper], [Code]
(arXiv 2021.10) Dispensed Transformer Network for Unsupervised Domain Adaptation, [Paper]
(arXiv 2021.10) Scatterbrain: Unifying Sparse and Low-rank Attention Approximation, [Paper]
(arXiv 2021.10) 3D Object Tracking with Transformer, [Paper], [Code]
(arXiv 2021.10) Blending Anti-Aliasing into Vision Transformer, [Paper], [Code]
(arXiv 2021.10) UltraPose: Synthesizing Dense Pose with 1 Billion Points by Human-body Decoupling 3D Model, [Paper], [Data & Code]
(arXiv 2021.10) SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation, [Paper]
(arXiv 2021.10) Bangla Image Caption Generation through CNN-Transformer based Encoder-Decoder Network, [Paper]
(arXiv 2021.10) History Aware Multimodal Transformer for Vision-and-Language Navigation, [Paper], [Project]
(arXiv 2021.10) TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation, [Paper]
(arXiv 2021.10) TNTC: TWO-STREAM NETWORK WITH TRANSFORMER-BASED COMPLEMENTARITY FOR GAIT-BASED EMOTION RECOGNITION, [Paper]
(arXiv 2021.10) Contextual Similarity Aggregation with Self-attention for Visual Re-ranking, [Paper], [Code]
(arXiv 2021.10) IIP-Transformer: Intra-Inter-Part Transformer for Skeleton-Based Action Recognition, [Paper], [Code]
(arXiv 2021.10) IMAGE-BASED CLIP-GUIDED ESSENCE TRANSFER, [Paper], [Code]
(arXiv 2021.10) Sinkformers: Transformers with Doubly Stochastic Attention, [Paper]
(arXiv 2021.10) ILLITERATE DALL·E LEARNS TO COMPOSE, [Paper], [Project], [Code]
(arXiv 2021.10) Learning Text-Image Joint Embedding for Efficient Cross-Modal Retrieval with Deep Feature Engineering, [Paper]
(arXiv 2021.10) SOFT: Softmax-free Transformer with Linear Complexity, [Paper], [Code]
(arXiv 2021.10) Deep Two-Stream Video Inference for Human Body Pose and Shape Estimation, [Paper]
(arXiv 2021.10) TRANSFORMER ACCELERATION WITH DYNAMIC SPARSE ATTENTION, [Paper]
(arXiv 2021.10) CLOOB: MODERN HOPFIELD NETWORKS WITH INFOLOOB OUTPERFORM CLIP, [Paper], [Code]
(arXiv 2021.10) Integrating Visuospatial, Linguistic and Commonsense Structure into Story Visualization, [Paper]
(arXiv 2021.10) StructFormer: Learning Spatial Structure for Language-Guided Semantic Rearrangement of Novel Objects, [Paper], [Project]
(arXiv 2021.10) Gophormer: Ego-Graph Transformer for Node Classification, [Paper]
(arXiv 2021.10) STRANSGAN: AN EMPIRICAL STUDY ON TRANSFORMER IN GANS, [Paper], [Code]
(arXiv 2021.10) MVT: Multi-view Vision Transformer for 3D Object Recognition, [Paper]
(arXiv 2021.10) DocTr: Document Image Transformer for Geometric Unwarping and Illumination Correction, [Paper], [Code]
(arXiv 2021.10) Bangla Image Caption Generation through CNN-Transformer based Encoder-Decoder Network, [Paper]
(arXiv 2021.10) WAV2CLIP: LEARNING ROBUST AUDIO REPRESENTATIONS FROM CLIP, [Paper], [Code]
(arXiv 2021.10) AFTer-UNet: Axial Fusion Transformer UNet for Medical Image Segmentation, [Paper]
(arXiv 2021.10) CLOOB: MODERN HOPFIELD NETWORKS WITH INFOLOOB OUTPERFORM CLIP, [Paper], [Code]
(arXiv 2021.10) AniFormer: Data-driven 3D Animation with Transformer, [Paper], [Code]
(arXiv 2021.10) Few-Shot Temporal Action Localization with Query Adaptive Transformer, [Paper], [Code]
(arXiv 2021.10) 3D-ANAS v2: Grafting Transformer Module on Automatically Designed ConvNet for Hyperspectral Image Classification, [Paper], [Code]
(arXiv 2021.10) CMTR: Cross-modality Transformer for Visible-infrared Person Re-identification, [Paper]
(arXiv 2021.10) 3D-RETR: End-to-End Single and Multi-View 3D Reconstruction with Transformers, [Paper], [Code]
(arXiv 2021.10) HRFormer: High-Resolution Transformer for Dense Prediction, [Paper], [Code]
(arXiv 2021.10) Leveraging MoCap Data for Human Mesh Recovery, [Paper]
(arXiv 2021.10) A Good Prompt Is Worth Millions of Parameters? Low-resource Prompt-based Learning for Vision-Language Models, [Paper]
(arXiv 2021.10) ASFormer: Transformer for Action Segmentation, [Paper], [Code]
(arXiv 2021.10) Multimodal Dialogue Response Generation, [Paper]
(arXiv 2021.10) Understanding Procedural Knowledge by Sequencing Multimodal Instructional Manuals, [Paper]
(arXiv 2021.10) COMPOSITIONAL ATTENTION: DISENTANGLING SEARCH AND RETRIEVAL, [Paper], [Code]
(arXiv 2021.10) Spatial-Temporal Transformer for 3D Point Cloud Sequences, [Paper]
(arXiv 2021.10) TransFusion: Cross-view Fusion with Transformer for 3D Human Pose Estimation, [Paper], [Code]
(arXiv 2021.10) Unifying Multimodal Transformer for Bi-directional Image and Text Generation, [Paper]
(arXiv 2021.10) Transformer with a Mixture of Gaussian Keys, [Paper]
(arXiv 2021.10) DIFFUSIONCLIP: TEXT-GUIDED IMAGE MANIPULATION USING DIFFUSION MODELS, [Paper]
(arXiv 2021.10) Adversarial Robustness Comparison of Vision Transformer and MLP-Mixer to CNNs, [Paper], [Code]
(arXiv 2021.10) RIPPLE ATTENTION FOR VISUAL PERCEPTION WITH SUB-QUADRATIC COMPLEXITY, [Paper]
(arXiv 2021.10) Certified Patch Robustness via Smoothed Vision Transformers, [Paper], [Code]
(arXiv 2021.10) CLIP-Forge: Towards Zero-Shot Text-to-Shape Generation, [Paper]
(arXiv 2021.10) Understanding and Improving Robustness of Vision Transformers through Patch-based Negative Augmentation, [Paper]
(arXiv 2021.10) SPARSE MOES MEET EFFICIENT ENSEMBLES, [Paper]
(arXiv 2021.10) Shared Visual Representations of Drawing for Communication: How do different biases affect human interpretability and intent? [Paper]
(arXiv 2021.10) SignBERT: Pre-Training of Hand-Model-Aware Representation for Sign Language Recognition, [Paper]
(arXiv 2021.10) Revitalizing CNN Attentions via Transformers in Self-Supervised Visual Representation Learning, [Paper]
(arXiv 2021.10) Investigating Transfer Learning Capabilities of Vision Transformers and CNNs by Fine-Tuning a Single Trainable Block, [Paper]
(arXiv 2021.10) SUPERVISION EXISTS EVERYWHERE: A DATA EFFICIENT CONTRASTIVE LANGUAGE-IMAGE PRE-TRAINING PARADIGM, [Paper], [Code]
(arXiv 2021.10) CLIP4Caption ++: Multi-CLIP for Video Caption, [Paper]
(arXiv 2021.10) Transformer-based Dual Relation Graph for Multi-label Image Recognition, [Paper]
(arXiv 2021.10) VECTOR-QUANTIZED IMAGE MODELING WITH IMPROVED VQGAN, [Paper]
(arXiv 2021.10) Adaptively Multi-view and Temporal Fusing Transformer for 3D Human Pose Estimation, [Paper], [Code]
(arXiv 2021.10) NVIT: VISION TRANSFORMER COMPRESSION AND PARAMETER REDISTRIBUTION, [Paper]
(arXiv 2021.10) 6D-ViT: Category-Level 6D Object Pose Estimation via Transformer-based Instance Representation Learning, [Paper]
(arXiv 2021.10) CLIP-Adapter: Better Vision-Language Models with Feature Adapters, [Paper], [Code]
(arXiv 2021.10) ATISS: Autoregressive Transformers for Indoor Scene Synthesis, [Paper], [Code] ，
(arXiv 2021.10) MOBILEVIT: LIGHT-WEIGHT, GENERAL-PURPOSE, AND MOBILE-FRIENDLY VISION TRANSFORMER, [Paper]
(arXiv 2021.10) TOKEN POOLING IN VISION TRANSFORMERS, [Paper]
(arXiv 2021.10) VIDT: AN EFFICIENT AND EFFECTIVE FULLY TRANSFORMER-BASED OBJECT DETECTOR, [Paper], [Code]
(arXiv 2021.10) CLIP4Caption: CLIP for Video Caption, [Paper]
(arXiv 2021.10) OBJECT-REGION VIDEO TRANSFORMERS, [Paper], [Code]
(arXiv 2021.10) LEVERAGING REDUNDANCY IN ATTENTION WITH REUSE TRANSFORMERS, [Paper]
(arXiv 2021.10) Dynamic Inference with Neural Interpreters, [Paper]
(arXiv 2021.10) A CLIP-Enhanced Method for Video-Language Understanding, [Paper]
(arXiv 2021.10) Visual Relationship Detection Using Part-and-Sum Transformers with Composite Queries, [Paper]
(arXiv 2021.10) Discovering Human Interactions with Large-Vocabulary Objects via Query and Multi-Scale Detection, [Paper]
(arXiv 2021.10) Learning Structural Representations for Recipe Generation and Food Retrieval, [Paper]
(arXiv 2021.10) A FREE LUNCH FROM VIT: ADAPTIVE ATTENTION MULTI-SCALE FUSION TRANSFORMER FOR FINE-GRAINED VISUAL RECOGNITION, [Paper]

2021.09

(arXiv 2021.09) Joint Multimedia Event Extraction from Video and Article, [Paper]
(arXiv 2021.09) Long-Range Transformers for Dynamic Spatiotemporal Forecasting, [Paper]
(arXiv 2021.09) Visually Grounded Concept Composition, [Paper]
(arXiv 2021.09) CoSeg: Cognitively Inspired Unsupervised Generic Event Segmentation, [Paper]
(arXiv 2021.09) CCTrans: Simplifying and Improving Crowd Counting with Transformer, [Paper]
(arXiv 2021.09) UFO-ViT: High Performance Linear Vision Transformer without Softmax, [Paper]
(arXiv 2021.09) Infrared Small-Dim Target Detection with Transformer under Complex Backgrounds, [Paper]
(arXiv 2021.09) Localizing Objects with Self-Supervised Transformers and no Labels, [Paper], [Code]
(arXiv 2021.09) Geometry-Entangled Visual Semantic Transformer for Image Captioning, [Paper]
(arXiv 2021.09) VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding, [Paper], [Code]
(arXiv 2021.09) Fine-tuning Vision Transformers for the Prediction of State Variables in Ising Models, [Paper]
(arXiv 2021.09) CLIP-It! Language-Guided Video Summarization, [Paper], [Project]
(arXiv 2021.09) MFEVIT: A ROBUST LIGHTWEIGHT TRANSFORMER-BASED NETWORK FOR MULTIMODAL 2D+3D FACIAL EXPRESSION RECOGNITION, [Paper]
(arXiv 2021.09) Sparse Spatial Transformers for Few-Shot Learning, [Paper], [Code]
(arXiv 2021.09) Vision Transformer Hashing for Image Retrieval, [Paper]
(arXiv 2021.09) PETA: Photo Albums Event Recognition using Transformers Attention, [Paper]
(arXiv 2021.09) MLIM: VISION-AND-LANGUAGE MODEL PRE-TRAINING WITH MASKED LANGUAGE AND IMAGE MODELING, [Paper]
(arXiv 2021.09) Dense Contrastive Visual-Linguistic Pretraining, [Paper]
(arXiv 2021.09) CPT: COLORFUL PROMPT TUNING FOR PRE-TRAINED VISION-LANGUAGE MODELS, [Paper]
(arXiv 2021.09) Localizing ∞-shaped fishes: Sketch-guided object localization in the wild, [Paper], [Code]
(arXiv 2021.09) CLIPORT: What and Where Pathways for Robotic Manipulation, [Paper], [Project], [Code]
(arXiv 2021.09) GraFormer: Graph Convolution Transformer for 3D Pose Estimation, [Paper], [Code]
(arXiv 2021.09) Multimodal Incremental Transformer with Visual Grounding for Visual Dialogue Generation, [Paper]
(arXiv 2021.09) Expression Snippet Transformer for Robust Video-based Facial Expression Recognition, [Paper], [Code]
(arXiv 2021.09) LOTR: Face Landmark Localization Using Localization Transformer, [Paper]
(arXiv 2021.09) Dyadformer: A Multi-modal Transformer for Long-Range Modeling of Dyadic Interactions, [Paper]
(arXiv 2021.09) SDTP: Semantic-aware Decoupled Transformer Pyramid for Dense Image Prediction, [Paper]
(arXiv 2021.09) KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation, [Paper]
(arXiv 2021.09) T6D-Direct: Transformers for Multi-Object 6D Pose Direct Regression, [Paper]
(arXiv 2021.09) OH-Former: Omni-Relational High-Order Transformer for Person Re-Identification, [Paper]
(arXiv 2021.09) PIX2SEQ: A LANGUAGE MODELING FRAMEWORK FOR OBJECT DETECTION, [Paper]
(arXiv 2021.09) ActionCLIP: A New Paradigm for Video Action Recognition, [Paper]
(arXiv 2021.09) BGT-Net: Bidirectional GRU Transformer Network for Scene Graph Generation, [Paper]
(arXiv 2021.09) Neural Human Performer: Learning Generalizable Radiance Fields for Human Performance Rendering, [Paper], [Code]
(arXiv 2021.09) Anchor DETR: Query Design for Transformer-Based Detector, [Paper], [Code]
(arXiv 2021.09) An End-to-End Transformer Model for 3D Object Detection, [Paper], [Code]
(arXiv 2021.09) Hybrid Local-Global Transformer for Image Dehazing, [Paper]
(arXiv 2021.09) Semi-Supervised Wide-Angle Portraits Correction by Multi-Scale Transformer, [Paper]
(arXiv 2021.09) Label-Attention Transformer with Geometrically Coherent Objects for Image Captioning, [Paper]
(arXiv 2021.09) Pose Transformers (POTR): Human Motion Prediction with Non-Autoregressive Transformers, [Paper], [Code]
(arXiv 2021.09) PnP-DETR: Towards Efficient Visual Analysis with Transformers, [Paper], [Code]
(arXiv 2021.09) Learning to Ground Visual Objects for Visual Dialog, [Paper]
(arXiv 2021.09) On Pursuit of Designing Multi-modal Transformer for Video Grounding, [Paper], [Code]
(arXiv 2021.09) CDTrans: Cross-domain Transformer for Unsupervised Domain Adaptation, [Paper]
(arXiv 2021.09) IS ATTENTION BETTER THAN MATRIX DECOMPOSITION? [Paper], [Code]
(arXiv 2021.09) Temporal Pyramid Transformer with Multimodal Interaction for Video Question Answering, [Paper]
(arXiv 2021.09) Line as a Visual Sentence: Context-aware Line Descriptor for Visual Localization, [Paper]
(arXiv 2021.09) Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding, [Paper]
(arXiv 2021.09) LAViTeR: Learning Aligned Visual and Textual Representations Assisted by Image and Caption Generation, [Paper], [Code]
(arXiv 2021.09) Panoptic Narrative Grounding, [Paper]
(arXiv 2021.09) An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA, [Paper]
(arXiv 2021.09) PlaTe: Visually-Grounded Planning with Transformers in Procedural Tasks, [Paper], [Project]
(arXiv 2021.09) EfficientCLIP: Efficient Cross-Modal Pre-training by Ensemble Confident Learning and Language Modeling, [Paper]
(arXiv 2021.09) Scaled ReLU Matters for Training Vision Transformers, [Paper]
(arXiv 2021.09) FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting, [Paper], [Code]
(arXiv 2021.09) GCsT: Graph Convolutional Skeleton Transformer for Action Recognition, [Paper]
(arXiv 2021.09) WHYACT: Identifying Action Reasons in Lifestyle Vlogs, [Paper]
(arXiv 2021.09) Zero-Shot Open Set Detection by Extending CLIP, [Paper]
(arXiv 2021.09) Towards Transferable Adversarial Attacks on Vision Transformers, [Paper]
(arXiv 2021.09) Learning to Prompt for Vision-Language Models, [Paper], [Code]
(arXiv 2021.09) Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss, [Paper], [Code]
(arXiv 2021.09) UCTransNet: Rethinking the Skip Connections in U-Net from a Channel-wise Perspective with Transformer, [Paper], [Code]
(arXiv 2021.09) ConvMLP: Hierarchical Convolutional MLPs for Vision, [Paper], [Code]
(arXiv 2021.09) TxT: Crossmodal End-to-End Learning with Transformers, [Paper]
(arXiv 2021.09) Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers, [Paper]
(arXiv 2021.09) Sparse-MLP: A Fully-MLP Architecture with Conditional Computation, [Paper]
(arXiv 2021.09) SORNet: Spatial Object-Centric Representations for Sequential Manipulation, [Paper], [Project]
(arXiv 2021.09) Audio-Visual Transformer Based Crowd Counting, [Paper]
(arXiv 2021.09) Weakly Supervised Relative Spatial Reasoning for Visual Question Answering, [Paper], [Code]
(arXiv 2021.09) FUSFORMER: A TRANSFORMER-BASED FUSION APPROACH FOR HYPERSPECTRAL IMAGE SUPER-RESOLUTION, [Paper]
(arXiv 2021.09) CTRL-C: Camera calibration TRansformer with Line-Classification, [Paper], [Code]
(arXiv 2021.09) Learning to Generate Scene Graph from Natural Language Supervision, [Paper], [Code]
(arXiv 2021.09) The Animation Transformer: Visual Correspondence via Segment Matching, [Paper]
(arXiv 2021.09) Voxel Transformer for 3D Object Detection, [Paper]
(ICCV 2021.09) 3D Human Texture Estimation from a Single Image with Transformers, [Paper], [Code]
(arXiv 2021.09) Encoder-decoder with Multi-level Attention for 3D Human Shape and Pose Estimation, [Paper], [Code]
(arXiv 2021.09) Joint Graph Learning and Matching for Semantic Feature Correspondence, [Paper]
(arXiv 2021.09) Searching for Efficient Multi-Stage Vision Transformers, [Paper], [Code]

2021.08

(arXiv 2021.08) SIGN: Spatial-information Incorporated Generative Network for Generalized Zero-shot Semantic Segmentation, [Paper]
(arXiv 2021.08) GroupFormer: Group Activity Recognition with Clustered Spatial-Temporal Transformer, [Paper], [Code]
(arXiv 2021.08) A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP, [Paper]
(arXiv 2021.08) Exploring and Improving Mobile Level Vision Transformers, [Paper]
(arXiv 2021.08) Cross-category Video Highlight Detection via Set-based Learning, [Paper], [Code]
(arXiv 2021.08) Shifted Chunk Transformer for Spatio-Temporal Representational Learning, [Paper]
(arXiv 2021.08) SASRA: Semantically-aware Spatio-temporal Reasoning Agent for Vision-and-Language Navigation in Continuous Environments, [Paper]
(arXiv 2021.08) LocTex: Learning Data-Efficient Visual Representations from Localized Textual Supervision, [Paper], [Project]
(arXiv 2021.08) Guiding Query Position and Performing Similar Attention for Transformer-Based Detection Heads, [Paper]
(arXiv 2021.08) SIMVLM: SIMPLE VISUAL LANGUAGE MODEL PRETRAINING WITH WEAK SUPERVISION, [Paper]
(arXiv 2021.08) TransFER: Learning Relation-aware Facial Expression Representations with Transformers, [Paper]

Name		Name	Last commit message	Last commit date
Latest commit History 828 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Transformer-in-Vision

Resource

Survey

Recent Papers

2023.8

2023.5

2023.3

2023.2

2023.1

2022.12

2022.11

2022.10

2022.09

2022.08

2022.07

2022.06

2022.05

2022.04

2022.03

2022.02

2022.01

2021.12

2021.11

2021.10

2021.09

2021.08

About

Releases

Packages

Contributors 2

DirtyHarryLYL/Transformer-in-Vision

Folders and files

Latest commit

History

Repository files navigation

Transformer-in-Vision

Resource

Survey

Recent Papers

2023.8

2023.5

2023.3

2023.2

2023.1

2022.12

2022.11

2022.10

2022.09

2022.08

2022.07

2022.06

2022.05

2022.04

2022.03

2022.02

2022.01

2021.12

2021.11

2021.10

2021.09

2021.08

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages