Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow PaddingFree to work with DataCollatorForCompletionOnlyLM #78

Merged
merged 6 commits into from
Sep 5, 2024

Conversation

fabianlim
Copy link
Contributor

@fabianlim fabianlim commented Aug 30, 2024

Description

Currently padding free plugin only works with DataCollatorWithFlattening. This PR makes it also work with DataCollatorForCompletionOnlyLM that is used for the non-pretokenized use case

  • in the scenarios.yaml, we should be above to use the chat_templates, but set tokenize=False and remove the null settings on the required fields for tokenization.
  • verified that full-FT + paddingfree improvement is consistent at 22-23% for orca benches

Note:

Tests on Flan Subset (6000 samples)

  1. Verified that dataset data_cache.json is only formatted and untokenized. To ensure that the loss is masked, added a keyword 'RESPONSE:' in the chat template and as the response template that DataCollatorForCompletionOnlyLM will use to mask the loss.

    Example extracted from dataset['train']['output'][0]

    'Write the response. A 2 person conversation: -- Who was selected with the 5th pick in the 1974 NBA draft?. --  RESPONSE: Five other players from this draft, 6th pick Scott Wedman, 8th pick Campy Russell , 12th pick Brian Winters, 21st pick Billy Knight and 25th pick John Drew, were also selected to at least one All-Star Game.'
    
  2. Verified that using an untokenized dataset to SFTTrainer matches previous padding-free performance with a pretokenized dataset.

    Untokenized FLAN Dataset

    Framework Config Num Devices Per Device Batch Size Train Runtime (secs) Speedups
    full-FT 2 4 1516 baseline
    padding-free 2 4 848 1.78x
    padding-free + multipack 2 4 747 2.02x

    Tokenized FLAN Dataset

    Framework Config Num Devices Per Device Batch Size Train Runtime (secs) Speedups
    full-FT 2 4 1537 baseline
    padding-free 2 4 859 1.79 x
    padding-free + multipack 2 4 751 2.05 x

@fabianlim fabianlim marked this pull request as draft August 30, 2024 02:25
@fabianlim
Copy link
Contributor Author

@achew010 can you do a sanity check and open the data_cache.json to ensure it was not tokenized when tokenize=False

Signed-off-by: 1000850000 user <[email protected]>
Signed-off-by: 1000850000 user <[email protected]>
Signed-off-by: 1000850000 user <[email protected]>
@achew010 achew010 force-pushed the non-pretok-pf branch 3 times, most recently from f54550d to 895900b Compare September 5, 2024 04:33
Signed-off-by: 1000850000 user <[email protected]>
@fabianlim fabianlim merged commit 6028250 into main Sep 5, 2024
6 checks passed
@fabianlim fabianlim deleted the non-pretok-pf branch September 5, 2024 05:53
achew010 added a commit that referenced this pull request Sep 6, 2024
* allow for padding_free logic in LM data collator

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

* minor fixes to support non-pretok benchmarks

Signed-off-by: 1000850000 user <[email protected]>

* addressed code review

Signed-off-by: 1000850000 user <[email protected]>

* added trl dependency

Signed-off-by: 1000850000 user <[email protected]>

* fixes to installation of aadp

Signed-off-by: 1000850000 user <[email protected]>

* updated orca pf benchmarks

Signed-off-by: 1000850000 user <[email protected]>

---------

Signed-off-by: Yu Chin Fabian Lim <[email protected]>
Signed-off-by: 1000850000 user <[email protected]>
Co-authored-by: 1000850000 user <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants