Introduce Stateful Callbacks #29666

muellerzr · 2024-03-15T04:56:01Z

What does this PR do?

This PR builds a foundation for stateful callbacks inside Trainer. Right now these are isolated to the core callbacks that exist, but can be expanded upon as needed later.

Specifically a user must enable restore_callback_states in the TrainingArguments to enable this behavior.

System Design:

To keep with not having to deal with pickles/interact with the TrainerState properly (where callback data should be stored, likely), we need a way to recreate the exact state of a Callback. To do so callbacks should implement a save_state function which will create a dict of args and attributes that should be set.

From this we can then recreate the constructor of an existing callback with these new options. (this is done in case there is logic in a callback's __init__ that could be needed).

For example, here is the one for EarlyStoppingCallback:

        def save(self) -> dict:
            return {
                "args": {
                    "early_stopping_patience": self.early_stopping_patience,
                    "early_stopping_threshold": self.early_stopping_threshold,
                },
                "attributes": {
                    "early_stopping_patience_counter": self.early_stopping_patience_counter,
                }
            }

What about Callbacks that I forget to add back in?

As this relies on a similar state, I chose to keep aligned with how we deal with states/checkpointing in Accelerate, wherein users must recreate the exact same initial scenario when resuming training.

Or, in other words:

If you included the EarlyStoppingCheckpoint initially, but when you do resume_from_checkpoint you did not include that callback, we do not magically init and resume those states. Instead we will simply keep going and give you a nice warning message saying that things went amiss/some callback states weren't resumed.

Limitations

As we're maintaining that these exist in the TrainerState (which again, makes sense for us to use this here), items should be JSON-serializable.

The aim here is most likely not many callbacks actually need this, so only a few as needed over time can be added to this without much complexity.

Fixes #28544

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@amyeroberts

HuggingFaceDocBuilderDev · 2024-03-15T05:13:57Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

muellerzr · 2024-03-20T12:11:38Z

@amyeroberts let me know if we need more docs or if I should rewrite the code a bit if it feels to complex/magical :)

amyeroberts

Thanks for much for working on this - and for the detailed description in the PR!

I understand the choice to match with accelerate patterns, but I think we might want to iterate on that. Looking at the behaviour in the tests I found it surprising, especially in the case when one can pass in a callback with default and it's previous state is loaded in: it seems to go against needing to fully specify the same training to resume.

I'm also not sure on the loading of args vs attributes - it seems attributes we'd definitely want to store, but args we might want to be able to overwrite?

src/transformers/trainer_callback.py

tests/trainer/test_trainer_callback.py

amyeroberts

Thanks for iterating on this! I like this state control + warnings

src/transformers/trainer.py

amyeroberts · 2024-04-04T14:29:06Z

src/transformers/trainer.py

+            return
+        # Callback states are stored in stateful_callbacks
+        not_found = []
+        for stored_callback, data in self.state.stateful_callbacks.items():


Can we ever have more than one callback of the same type e.g. different metric loggers? Only concern with logic below is that we might load state from one class into another

Good call Amy, just in case we can, I've gone ahead and added logic in + a test to verify :)

muellerzr · 2024-04-04T16:21:53Z

@amyeroberts I'd like one more review please, just to make sure the multiple-versions logic makes sense to you!

amyeroberts

Thanks for iterating and for all the tests - looks great!

amyeroberts · 2024-04-18T19:35:28Z

tests/trainer/test_trainer_callback.py

+        assert len(cbs) == 2
+        assert cbs[0].my_test_state == "first"
+        assert cbs[1].my_test_state == "second"


amyeroberts · 2024-04-18T19:54:21Z

src/transformers/trainer_callback.py

+                        stateful_callbacks[name] = [stateful_callbacks[name]]
+                    stateful_callbacks[name].append(callback.state())
+                else:
+                    stateful_callbacks[name] = callback.state()


Is there a reason for not just always adding as a list directly. If we use a default dict we can just do

# Saveable callbacks get stored as dict of kwargs stateful_callbacks = default_dict(list) for callback in self.stateful_callbacks: if not isinstance(callback, (ExportableState)): raise TypeError( f"All callbacks passed to be saved must inherit `ExportableState`, but received {type(callback)}" ) name = callback.__class__.__name__ stateful_callbacks[name].append(callback.state())

Python serialization does not like defaultdict, come to find out 😢

Co-authored-by: amyeroberts <[email protected]>

* Introduce saveable callbacks * Add note * Test for non-present and flag * Support early stopping and refusing to train further * Update docstring * More saving * Import oopsie * Apply suggestions from code review Co-authored-by: amyeroberts <[email protected]> * Make it go through TrainerArguments * Document * Fix test * Apply suggestions from code review Co-authored-by: amyeroberts <[email protected]> * Rework to allow for duplicates * CLean * Fix failing tests --------- Co-authored-by: amyeroberts <[email protected]>

pedrobrs · 2024-07-19T18:08:45Z

Hi, it looks like the state of stateful callbacks are not updated before saving state to trainer_state.json on _save_checkpoint method:

transformers/src/transformers/trainer.py

Lines 2917 to 2920 in fe008d6

    
           if self.args.should_save: 
        
               # Update the `TrainerControl` state to where we are currently 
        
               self.state.stateful_callbacks["TrainerControl"] = self.control.state() 
        
               self.state.save_to_json(os.path.join(output_dir, TRAINER_STATE_NAME))

Only TrainerControl state is saved, instead of getting all ExportableState callbacks like here:

transformers/src/transformers/trainer.py

Lines 674 to 675 in fe008d6

    
           stateful_callbacks=[ 
        
               cb for cb in self.callback_handler.callbacks + [self.control] if isinstance(cb, ExportableState)

So it should be:

if self.args.should_save: 
    for cb in [cb for cb in self.callback_handler.callbacks  + [self.control] if isinstance(cb, ExportableState)]:
        self.state.stateful_callbacks[cb.__class__.__name__] = cb.state()
    self.state.save_to_json(os.path.join(output_dir, TRAINER_STATE_NAME))

Do you agree?

muellerzr requested a review from amyeroberts March 15, 2024 04:56

muellerzr mentioned this pull request Mar 15, 2024

Early stopping patience does not work when resuming from checkpoint #28544

Closed

4 tasks

amyeroberts reviewed Mar 20, 2024

View reviewed changes

muellerzr requested review from ArthurZucker and amyeroberts and removed request for ArthurZucker March 25, 2024 15:07

amyeroberts approved these changes Apr 4, 2024

View reviewed changes

muellerzr requested a review from amyeroberts April 4, 2024 16:21

amyeroberts approved these changes Apr 18, 2024

View reviewed changes

muellerzr and others added 15 commits April 25, 2024 09:11

Introduce saveable callbacks

86d265e

Add note

ad7917f

Test for non-present and flag

d7f1d2e

Support early stopping and refusing to train further

085073d

Update docstring

2bf324b

More saving

793254f

Import oopsie

5208e3c

Apply suggestions from code review

75fa5f9

Co-authored-by: amyeroberts <[email protected]>

Make it go through TrainerArguments

508c24b

Document

f6823fa

Fix test

33b178d

Apply suggestions from code review

de4c603

Co-authored-by: amyeroberts <[email protected]>

Rework to allow for duplicates

5548add

CLean

805daed

Fix failing tests

7e10394

muellerzr force-pushed the muellerzr-checkpoint-callbacks branch from e46a42e to 7e10394 Compare April 25, 2024 13:11

muellerzr merged commit ad697f1 into main Apr 25, 2024
21 checks passed

muellerzr deleted the muellerzr-checkpoint-callbacks branch April 25, 2024 15:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce Stateful Callbacks #29666

Introduce Stateful Callbacks #29666

muellerzr commented Mar 15, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Mar 15, 2024

muellerzr commented Mar 20, 2024

amyeroberts left a comment

amyeroberts left a comment

amyeroberts Apr 4, 2024

muellerzr Apr 4, 2024

muellerzr commented Apr 4, 2024

amyeroberts left a comment

amyeroberts Apr 18, 2024

amyeroberts Apr 18, 2024

muellerzr Apr 24, 2024

amyeroberts Apr 24, 2024

pedrobrs commented Jul 19, 2024 •

edited

Loading

Introduce Stateful Callbacks #29666

Introduce Stateful Callbacks #29666

Conversation

muellerzr commented Mar 15, 2024 • edited Loading

What does this PR do?

System Design:

What about Callbacks that I forget to add back in?

Limitations

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Mar 15, 2024

muellerzr commented Mar 20, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

amyeroberts left a comment

Choose a reason for hiding this comment

amyeroberts Apr 4, 2024

Choose a reason for hiding this comment

muellerzr Apr 4, 2024

Choose a reason for hiding this comment

muellerzr commented Apr 4, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

amyeroberts Apr 18, 2024

Choose a reason for hiding this comment

amyeroberts Apr 18, 2024

Choose a reason for hiding this comment

muellerzr Apr 24, 2024

Choose a reason for hiding this comment

amyeroberts Apr 24, 2024

Choose a reason for hiding this comment

pedrobrs commented Jul 19, 2024 • edited Loading

muellerzr commented Mar 15, 2024 •

edited

Loading

pedrobrs commented Jul 19, 2024 •

edited

Loading