KD trainer w/ logprobs #2202

winglian · 2024-12-19T05:38:13Z

Description

Motivation and Context

How has this been tested?

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

winglian · 2024-12-31T02:36:29Z

see https://wandb.ai/axolotl-ai/logprob-kd/runs/uj32mbv1/files/tmp/axolotl_config_301w_dfz.yml

src/axolotl/integrations/kd/trainer.py

SalmanMohammadi · 2025-01-06T13:33:43Z

src/axolotl/core/trainers/base.py

+        return super()._save_checkpoint(model, trial, **kwargs)
+
+
+class AxolotlMambaTrainer(AxolotlTrainer):


Not for this PR, but how would we feel about moving each of these trainers to their own file?

having a integrations/trl/ folder would be neat too.

fix loader default

…rect val

SalmanMohammadi · 2025-01-22T11:42:39Z

src/axolotl/cli/args.py

@@ -13,6 +13,12 @@ class PreprocessCliArgs:
    debug_num_examples: int = field(default=1)
    prompter: Optional[str] = field(default=None)
    download: Optional[bool] = field(default=True)
+    iterable: Optional[bool] = field(


Is it worth documenting somewhere (or raising an issue as a reminder) to show users that we offer support for iterable datasets?

SalmanMohammadi · 2025-01-22T11:55:11Z

src/axolotl/cli/main.py

@@ -39,6 +39,8 @@ def preprocess(config: str, **kwargs) -> None:
        kwargs: Additional keyword arguments which correspond to CLI args or `axolotl`
            config options.
    """
+    kwargs = {k: v for k, v in kwargs.items() if v is not None}


Shouldn't using @filter_none_kwargs address this?

SalmanMohammadi · 2025-01-22T12:31:29Z

src/axolotl/integrations/kd/topk_logprob/LICENSE.md

@@ -0,0 +1,58 @@
+### AXOLOTL COMMUNITY LICENSE AGREEMENT


Is it worth surfacing this community license somewhere in our docs/README and how it would be used? Fine to leave as a follow up/issue.

SalmanMohammadi · 2025-01-22T12:41:16Z

src/axolotl/integrations/kd/args.py

+    Input args for knowledge distillation.
+    """
+
+    kd_trainer: Optional[bool] = None  # whether to use KD trainer


Slightly confused - what happens when kd_trainer=False?

SalmanMohammadi · 2025-01-22T13:13:22Z

src/axolotl/utils/trainer.py

+    if "input_ids" not in sample:
+        # If there's no "input_ids", just return sample unchanged
+        return sample
+
+    input_ids = sample["input_ids"]
+
+    # Detect if it's a single example or a batch
+    if not input_ids:
+        # Edge case: empty
+        return sample


nit

Suggested change

if "input_ids" not in sample:

# If there's no "input_ids", just return sample unchanged

return sample

input_ids = sample["input_ids"]

# Detect if it's a single example or a batch

if not input_ids:

# Edge case: empty

return sample

# Return sample unchanged if "input_ids" is not present, or is empty

if "input_ids" not in sample or not sample["input_ids"]:

return sample

input_ids = sample["input_ids"]

SalmanMohammadi · 2025-01-22T13:31:04Z

src/axolotl/utils/trainer.py

+    input_ids = sample["input_ids"]
+
+    # Edge case: if input_ids is empty
+    if not input_ids:


Should you do the same check above for not sample["input_ids"]?

SalmanMohammadi · 2025-01-22T15:55:54Z

src/axolotl/utils/trainer.py

@@ -172,10 +209,31 @@ def add_length(sample):


 def drop_long_seq(sample, sequence_len=2048, min_sequence_len=2):


nit: this is more of a fiiltering function with the signature def filter_sequence_length(...) -> Union[bool, List[bool]] right?

SalmanMohammadi · 2025-01-22T15:56:17Z

src/axolotl/utils/trainer.py

+        max_input_len = np.max(get_dataset_lengths(train_dataset))
+        LOG.debug(f"max_input_len: {max_input_len}", main_process_only=True)
+    except AttributeError:
+        pass


Is there anything informative worth logging here?

SalmanMohammadi · 2025-01-22T15:57:09Z

src/axolotl/utils/trainer.py

+    try:
+        prior_len = len(train_dataset)
+    except TypeError:
+        # handle iterable datasets case


How come an isinstance check like above wouldn't work here?

SalmanMohammadi · 2025-01-22T16:12:00Z

src/axolotl/utils/trainer.py

+        # If it's a list, we assume we're dealing with a batch
+        if isinstance(labels[0], int):
+            # Single example: return a single bool
+            return np.sum(np.array(labels) != -100) > 0


nit

Suggested change

return np.sum(np.array(labels) != -100) > 0

return np.any(labels != -100)

SalmanMohammadi · 2025-01-22T16:12:39Z

src/axolotl/utils/trainer.py

+        results = []
+        for row_labels in labels:
+            # Each row_labels is a list[int]
+            results.append(np.sum(np.array(row_labels) != -100) > 0)


nit

Suggested change

results = []

for row_labels in labels:

# Each row_labels is a list[int]

results.append(np.sum(np.array(row_labels) != -100) > 0)

results = [np.any(row_labels != -100) for row_labels in labels]

SalmanMohammadi · 2025-01-22T16:36:13Z

tests/e2e/integrations/test_kd.py

+        "dataloader_prefetch_factor": 8,
+        "dataloader_num_workers": 4,
+        "dataloader_pin_memory": True,
+        # "dataset_prepared_path": str(Path(temp_dir) / "last_run_prepared"),


Suggested change

# "dataset_prepared_path": str(Path(temp_dir) / "last_run_prepared"),

might as well chop while we're here

SalmanMohammadi · 2025-01-22T16:38:31Z

src/axolotl/utils/data/shared.py

@@ -29,7 +29,9 @@ def get_ds_type(config_dataset: DictDefault):
    return ds_type


-def load_dataset_w_config(config_dataset, auth_token):
+def load_dataset_w_config(
+    config_dataset, auth_token, streaming=False


Worth folding streaming into config_dataset so you're doing config_dataset.streaming?

SalmanMohammadi · 2025-01-22T19:07:49Z

src/axolotl/integrations/kd/topk_logprob/forward_kl.py

+    teacher_seq_len = target_token_ids.shape[1]
+
+    # Slice student logits to match teacher-provided sequence length
+    student_logits_for_kd = student_logits[


When would the student be predicting a longer sequence length?

SalmanMohammadi · 2025-01-22T19:14:45Z

I've taken a first pass and it looks pretty good overall. I think the KD logic checks out. I'll make another pass tomorrow.

SalmanMohammadi · 2025-01-22T19:16:09Z

src/axolotl/integrations/kd/topk_logprob/forward_kl.py

+    kd_temperature: float = 1.0,
+) -> torch.Tensor:
+    """
+    A KD loss function that is TorchScript-friendly.


Could you add docs for what the parameters are? I'd mainly like to clarify what target_mask is used for but might as well doc them all : )

winglian changed the title ~~KD trainer~~ KD trainer w/ logprobs Dec 19, 2024

winglian force-pushed the kd-trainer branch from adce701 to af9c28c Compare December 24, 2024 21:19

SalmanMohammadi self-requested a review January 2, 2025 16:58

winglian force-pushed the kd-trainer branch 2 times, most recently from 9cc1a77 to a952e84 Compare January 4, 2025 00:45

SalmanMohammadi reviewed Jan 4, 2025

View reviewed changes

src/axolotl/integrations/kd/trainer.py Show resolved Hide resolved

SalmanMohammadi reviewed Jan 6, 2025

View reviewed changes

winglian mentioned this pull request Jan 8, 2025

refactor trainer to prevent circular dependencies later #2200

Closed

winglian force-pushed the kd-trainer branch 2 times, most recently from ab49180 to 4a0ab11 Compare January 13, 2025 20:19

winglian added 19 commits January 14, 2025 22:47

refactor trainer to prevent circular dependencies later

88b3198

fix loader default

KD dataset loading and KD with logprobs

303cfa7

filter bad rows

d584354

make batch smaller

e633a12

handle padding/collation for KD datasets

ddcf5c6

make it work

7fe0ad0

flipped the slice

b592c05

cross entropy loss coefficient during KD

ae545e0

make sure to multiply against the correct loss

00ce77e

chore: lint

ed49051

triton wip

0b59a24

no where support

c73acd7

v2 trial

119d586

no torch.exp inside triton kernel

18a46c3

no log etc

dc90c93

no torch.tensor

081928e

v3

e565694

fix kwarg

c0757e8

don't use triton for now

d8d817e

winglian added 5 commits January 14, 2025 22:47

make sure to use tensorboard to capture loss for checks

a5e0671

chore: lint

e8fceb7

chore: lint

7232cbd

improve logprob masking and shift in trainer

510cf45

more fixes

35a84f2

winglian force-pushed the kd-trainer branch from 2dcbc0d to 35a84f2 Compare January 15, 2025 04:30

winglian added 7 commits January 14, 2025 23:56

try tests for kd on l40s

483defb

don't shift student logits for kd

04efcb1

no batching for kd chat templates

32258c2

make sure to truncate logprobs if there are more than top_k

bb5e6f4

change up logic so we always truncate to top_k

bded6df

use iter instead of tuple

67c1c84

fix finding the top-k rather than assuming first position has the cor…

4e4a16c

…rect val

SalmanMohammadi reviewed Jan 22, 2025

View reviewed changes

winglian added the scheduled_release This PR is slated for the upcoming release label Jan 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KD trainer w/ logprobs #2202

KD trainer w/ logprobs #2202

winglian commented Dec 19, 2024

winglian commented Dec 31, 2024

SalmanMohammadi Jan 6, 2025

SalmanMohammadi Jan 6, 2025 •

edited

Loading

SalmanMohammadi Jan 22, 2025

SalmanMohammadi Jan 22, 2025

SalmanMohammadi Jan 22, 2025

SalmanMohammadi Jan 22, 2025

SalmanMohammadi Jan 22, 2025

SalmanMohammadi Jan 22, 2025

SalmanMohammadi Jan 22, 2025

SalmanMohammadi Jan 22, 2025

SalmanMohammadi Jan 22, 2025

SalmanMohammadi Jan 22, 2025

SalmanMohammadi Jan 22, 2025

SalmanMohammadi Jan 22, 2025

SalmanMohammadi Jan 22, 2025

SalmanMohammadi Jan 22, 2025

SalmanMohammadi commented Jan 22, 2025

SalmanMohammadi Jan 22, 2025

		return super()._save_checkpoint(model, trial, **kwargs)


		class AxolotlMambaTrainer(AxolotlTrainer):

		@@ -172,10 +209,31 @@ def add_length(sample):


		def drop_long_seq(sample, sequence_len=2048, min_sequence_len=2):

	return np.sum(np.array(labels) != -100) > 0
	return np.any(labels != -100)

KD trainer w/ logprobs #2202

Are you sure you want to change the base?

KD trainer w/ logprobs #2202

Conversation

winglian commented Dec 19, 2024

Description

Motivation and Context

How has this been tested?

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

winglian commented Dec 31, 2024

Choose a reason for hiding this comment

SalmanMohammadi Jan 6, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SalmanMohammadi commented Jan 22, 2025

Choose a reason for hiding this comment

SalmanMohammadi Jan 6, 2025 •

edited

Loading