Graph Convolutions Enrich the Self-Attention in Transformers!

Jeongwhan Choi^1*, Hyowon Wi^2*, Jayoung Kim², Yehjin Shin², Kookjin Lee³, Nathaniel Trask⁴, Noseong Park²,

¹Yonsei University, ²KAIST, ³Arizona State University, ⁴University of Pennsylvania

Introduction

Graph Filter-based Self-Attention (GFSA) is a novel approach to enhance the self-attention mechanism in Transformers.
By redesigning self-attention from a graph signal processing (GSP) perspective, GFSA addresses the oversmoothing problem and improves performance for various domains.

Key Features:

Easily integrates with existing Transformer models
Improves performance with minimal computational overhead
GFSA shows significant improvements across various tasks on multiple domains

Tasks and Directories

The detailed guidance is included in the README.md of each subdirectory:

🖼️ Image Classification 👉 ./Image
📚 Natural Language Understanding 👉 ./NLP
🧠 Causal Language Modeling 👉 ./NLP
🌐 Graph Regression 👉 ./Graph
🎙️ Speech Recognition 👉 ./Speech
💻 Code Classification 👉 ./Code

Implementation Example with the Pseudocode

GFSA's core implementation is shown in the following pseudocode:

def GFSA(att, K):
    """
    Graph Filter-based Self-Attention
    
    Args:
        att: original self-attention matrix
        K: order of high-order term
        
    Notes:
        w_0, w_1 can be set in two ways:
        1) As learnable parameters
        2) Fixed as hyperparameters (w_0=0, w_1=1)
    
    Returns:
        gf_att: GFSA attention matrix
    """
    # Initialize weights
    w_0 = torch.zeros(h)  # identity term weight
    w_1 = torch.ones(h)   # first-order term weight  
    w_K = torch.zeros(h)  # high-order term weight
    I = torch.eyes(n)[None, None, ...]
    
    # Compute high-order term using Taylor approximation
    att_K = att + (K-1) * (torch.mm(att,att) - att)
    
    # Combine terms with weights
    gf_att = w_0[None, :, None, None] * I + \
             w_1[None, :, None, None] * att + \
             w_K[None, :, None, None] * att_K
             
    return gf_att

Key Implementation Features

Weight Initialization: w_0, w_1 can be either learnable parameters or fixed hyperparameters
High-order Term: Uses Taylor approximation to reduce computational cost
Minimal Parameters: Adds only a small number of parameters compared to base models

Integration Example

from models.attention import GFSA

# Replace original self-attention with GFSA
attention_output = GFSA(
    att=attention_scores,  # original attention matrix
    K=3                    # order of high-order term
)

Citation

If you use this code for your research, please cite our paper:

@inproceedings{choi2024gfsa,
   title={Graph Convolutions Enrich the Self-Attention in Transformers!},
   author={Jeongwhan Choi and Hyowon Wi and Jayoung Kim and Yehjin Shin and Kookjin Lee and Nathaniel Trask and Noseong Park},
   booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
   year={2024},
   url={https://openreview.net/forum?id=ffNrpcBpi6}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Graph Convolutions Enrich the Self-Attention in Transformers!

Introduction

Key Features:

Tasks and Directories

Implementation Example with the Pseudocode

Key Implementation Features

Integration Example

Citation

Star History

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Code		Code
Graph		Graph
Image		Image
NLP		NLP
Speech		Speech
asset		asset
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

License

bigdyl-kaist/GFSA

Folders and files

Latest commit

History

Repository files navigation

Graph Convolutions Enrich the Self-Attention in Transformers!

Introduction

Key Features:

Tasks and Directories

Implementation Example with the Pseudocode

Key Implementation Features

Integration Example

Citation

Star History

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages