Data Concatenation: how to avoid sample contamination during training #25

xiabingquan · 2024-09-11T06:52:00Z

Hello VITA team.

Thanks for this great work. After reading through the code and the preprint paper. I got a question of data concatenation.

The paper mentions that you use "Data Concanetation" technique to concatenate different samples into a sequence. In this case, to avoid different samples attend to each other, the causal mask should be modified.

For example, if a sequence with length 3 contains two samples with lengths 1 and 2.
The original causal mask is:

[
  [1, 0, 0],
  [1, 1, 0],
  [1, 1, 1],
]

And the modified mask should be:

[
  [1, 0, 0],
  [0, 1, 0],
  [0, 1, 1],
]

I noticed that this codebase supports three attention implementations: eager、SDPA and Flash Attention.
However, customized attention mask is not supported in Flash Attention (see issue).

My question is:
What attention implementation do you use for training? Do you consider the sample contamination problem when using data concatenation? Thanks!

xiabingquan · 2024-09-11T07:00:06Z

I noticed that Flash Attention is used when training mixtral-8x7b

VITA/vita/train/train.py

Lines 255 to 263 in b546cd0

    
           if model_args.model_type == "mixtral-8x7b": 
        
               torch_dtype = torch.float16 if training_args.fp16 else torch.bfloat16 
        
               model = VITAMixtralForCausalLM.from_pretrained( 
        
                   model_args.model_name_or_path, 
        
                   cache_dir=training_args.cache_dir, 
        
                   torch_dtype=torch_dtype, 
        
                   attn_implementation="flash_attention_2", 
        
                   **bnb_model_from_pretrained_args, 
        
               )

Is this setting also used in training smaller models?

Currently, the model checkpoints are not immediately available to me. Any replies will help, thanks!

linhaojia13 · 2024-09-11T12:41:11Z

Thanks for your attention. It is a good idea to modify the attention mask to avoid different samples attend to each other. However, we have not implemented such a mechanism because we found that the negative impact of simple concatenation is relatively small in our experiments.

We still believe that modifying the attention mask is necessary. If you come up with a solution in Flash Attention, welcome a pull request!

xiabingquan · 2024-09-11T13:33:34Z

Thanks for your attention. It is a good idea to modify the attention mask to avoid different samples attend to each other. However, we have not implemented such a mechanism because we found that the negative impact of simple concatenation is relatively small in our experiments.

We still believe that modifying the attention mask is necessary. If you come up with a solution in Flash Attention, welcome a pull request!

Got it. Thanks for your quick reply. A workaround is using customized flash attention kernels based on Trion, such as flashattention2-custom-mask, but its correctntess is not tested.

xiabingquan · 2024-09-12T02:45:15Z

@linhaojia13 The function flash_attn_varlen_func in flash attention could meet our needs (with some modifcations to cu_lens).

I've verified its correctness. Hope it helps! 😀

linhaojia13 · 2024-09-12T05:00:57Z

Thank you!

xiabingquan changed the title ~~How to avoid sample contamination during training~~ Data Concatenation: how to avoid sample contamination during training Sep 11, 2024

xiabingquan closed this as completed Sep 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Concatenation: how to avoid sample contamination during training #25

Data Concatenation: how to avoid sample contamination during training #25

xiabingquan commented Sep 11, 2024

xiabingquan commented Sep 11, 2024

linhaojia13 commented Sep 11, 2024

xiabingquan commented Sep 11, 2024

xiabingquan commented Sep 12, 2024

linhaojia13 commented Sep 12, 2024

Data Concatenation: how to avoid sample contamination during training #25

Data Concatenation: how to avoid sample contamination during training #25

Comments

xiabingquan commented Sep 11, 2024

xiabingquan commented Sep 11, 2024

linhaojia13 commented Sep 11, 2024

xiabingquan commented Sep 11, 2024

xiabingquan commented Sep 12, 2024

linhaojia13 commented Sep 12, 2024