potential bugs for training dynamics #2

hayasick · 2024-09-17T06:29:07Z

Hi,

I've been working with the dynamics model and noticed a couple of potential issues. I wanted to check with you to see if my observations are correct:

A causal mask is needed for the temporal attention in the ST-Transformer.

jafar/utils/nn.py

Lines 50 to 54 in b72f848

    
           z = nn.MultiHeadAttention( 
        
               num_heads=self.num_heads, 
        
               qkv_features=self.dim, 
        
               dropout_rate=self.dropout, 
        
           )(z)

When computing the cross-entropy (CE) loss, the ground truth (GT) labels and the prediction logits need to be shifted. For example:

pred = outputs["token_logits"][:, :-1]
mask = outputs["mask"][:, :-1]
target = outputs["video_tokens"][:, 1:]
ce_loss = optax.softmax_cross_entropy_with_integer_labels(
    pred, target
)
ce_loss = (mask * ce_loss).sum() / mask.sum()
acc = pred.argmax(-1) == target

jafar/train_dynamics.py

Lines 122 to 124 in b72f848

    
           ce_loss = optax.softmax_cross_entropy_with_integer_labels( 
        
               outputs["token_logits"], outputs["video_tokens"] 
        
           )

Please let me know if further clarification is needed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

potential bugs for training dynamics #2

potential bugs for training dynamics #2

hayasick commented Sep 17, 2024 •

edited

Loading

potential bugs for training dynamics #2

potential bugs for training dynamics #2

Comments

hayasick commented Sep 17, 2024 • edited Loading

hayasick commented Sep 17, 2024 •

edited

Loading