Add instruction and conversational data support #211

artek0chumak · 2024-11-04T15:44:47Z

Issue #ENG-10912

Add support for instruction and conversational training data support for FT jobs.
The support goes in two ways:

File checks to make sure the data is formatted correctly
Add train_on_inputs flag to train on completion/assistance phrases

src/together/cli/api/finetune.py

src/together/cli/api/utils.py

src/together/cli/api/finetune.py

src/together/constants.py

tests/unit/test_files_checks.py

src/together/cli/api/finetune.py

mryab · 2024-11-08T14:00:12Z

src/together/cli/api/finetune.py

+    "--train-on-inputs",
+    type=BOOL_WITH_AUTO,
+    default="auto",
+    help="Whether to mask the user messages in conversational data or prompts in instruction data",


Nit: might be good to explain what happens in the default case and how auto is handled

Up (you can just put "auto" will automatically determine whether to mask the inputs based on the data format. here)

mryab · 2024-11-08T14:01:00Z

src/together/resources/finetune.py

@@ -154,6 +157,8 @@ def create(
                Defaults to False.
            model_limits (FinetuneTrainingLimits, optional): Limits for the hyperparameters the model in Fine-tuning.
                Defaults to None.
+            train_on_inputs (bool, optional): Whether to mask the user messages in conversational data or prompts in instruction data.


bool or "auto"

src/together/resources/finetune.py

mryab · 2024-11-08T14:01:34Z

src/together/resources/finetune.py

@@ -465,6 +472,7 @@ async def create(
                Defaults to False.
            model_limits (FinetuneTrainingLimits, optional): Limits for the hyperparameters the model in Fine-tuning.
                Defaults to None.
+            train_on_inputs (bool, optional): Whether to mask the inputs in conversational data. Defaults to "auto".


Looks like this docstring is outdated

mryab · 2024-11-08T14:43:47Z

tests/unit/test_files_checks.py

+        {"prompt": "Summarize the text.", "completion": "OpenAI creates advanced AI."},
+    ]
+    with file.open("w") as f:
+        f.write("\n".join([json.dumps(item) for item in content]))


Suggested change

f.write("\n".join([json.dumps(item) for item in content]))

f.write("\n".join(json.dumps(item) for item in content))

mryab · 2024-11-08T14:44:14Z

tests/unit/test_files_checks.py

+        },
+    ]
+    with file.open("w") as f:
+        f.write("\n".join([json.dumps(item) for item in content]))


Same here and onwards

mryab · 2024-11-08T14:46:55Z

src/together/utils/files.py

+                        for column in REQUIRED_COLUMNS_MESSAGE:
+                            if column not in turn:
+                                raise InvalidFileFormatError(
+                                    message=f"Field '{column}' is missing for a turn `{turn}` on line {idx + 1} "


Suggested change

message=f"Field '{column}' is missing for a turn `{turn}` on line {idx + 1} "

message=f"Field `{column}` is missing for a turn `{turn}` on line {idx + 1} "

mryab · 2024-11-08T14:47:12Z

src/together/utils/files.py

+
+                        if role not in POSSIBLE_ROLES_CONVERSATION:
+                            raise InvalidFileFormatError(
+                                message=f"Found invalid role '{role}' in the messages on the line {idx + 1}. "


Suggested change

message=f"Found invalid role '{role}' in the messages on the line {idx + 1}. "

message=f"Found invalid role `{role}` in the messages on the line {idx + 1}. "

mryab · 2024-11-08T14:47:44Z

src/together/utils/files.py

+                        if previous_role == role:
+                            raise InvalidFileFormatError(
+                                message=f"Invalid role turns on line {idx + 1} of the input file. "
+                                "'user' and 'assistant' roles must alternate user/assistant/user/assistant/...",


Suggested change

"'user' and 'assistant' roles must alternate user/assistant/user/assistant/...",

"`user` and `assistant` roles must alternate user/assistant/user/assistant/...",

mryab · 2024-11-12T10:14:23Z

src/together/cli/api/finetune.py

+    "--train-on-inputs",
+    type=BOOL_WITH_AUTO,
+    default="auto",
+    help="Whether to mask the user messages in conversational data or prompts in instruction data",


Up (you can just put "auto" will automatically determine whether to mask the inputs based on the data format. here)

mryab · 2024-11-12T10:18:07Z

src/together/resources/finetune.py

+                "auto" will automatically determine whether to mask the inputs based on the data format.
+                Dataset with "text" (General format) field will not mask the inputs by default.
+                Dataset with "messages" (Conversational format) or "prompt" and "completion" (Instruction format)
+                fields will mask the inputs by default.


Suggested change

"auto" will automatically determine whether to mask the inputs based on the data format.

Dataset with "text" (General format) field will not mask the inputs by default.

Dataset with "messages" (Conversational format) or "prompt" and "completion" (Instruction format)

fields will mask the inputs by default.

"auto" will automatically determine whether to mask the inputs based on the data format.

For datasets with the "text" field (general format), inputs will not be masked.

For datasets with "messages" (conversational format) or "prompt" and "completion" (instruction format)

fields, inputs will be masked.

src/together/resources/finetune.py

mryab · 2024-11-12T10:21:27Z

src/together/utils/files.py

                        )

-                        report_dict["is_check_passed"] = False
+                    previous_role = ""


Suggested change

previous_role = ""

previous_role = None

mryab · 2024-11-12T10:48:48Z

src/together/utils/files.py

+                if current_format == DatasetFormat.CONVERSATION:
+                    message_column = JSONL_REQUIRED_COLUMNS_MAP[
+                        DatasetFormat.CONVERSATION
+                    ][0]


Nit: this is dangerous, we implicitly assume that the first item in the list is the message column name

zainhas · 2024-11-13T15:20:02Z

When I pass in train_on_inputs="true" or train_on_inputs="false" it should give me an error however it submits and completes the job - this is probably because "true", "false" are all truthy?

We should just be opinionated on what we accept as values for train_on_inputs and then error out for anything else and have detailed documentation.

Should we change the fact that users can pass in train_on_inputs=None and this will default to functionality like train_on_inputs="auto" since its not obvious that None should be == "auto" functionality.

zainhas · 2024-11-13T16:19:47Z

Reviewed new changes and lgtm👍

mryab · 2024-11-13T16:28:08Z

src/together/cli/api/finetune.py

    )

    model_limits: FinetuneTrainingLimits = client.fine_tuning.get_model_limits(
        model=model
    )

    if lora:
+        log_warn_once(
+            "LoRA rank default has been changed from 8 to 64 as the maximum available for each model."


Does this mean that every model supported by the API has 64 as the maximum possible rank? I think the current wording is confusing if this is not true

mryab · 2024-11-13T16:41:03Z

src/together/resources/finetune.py

+    if train_on_inputs is None:
+        raise ValueError("train_on_inputs cannot be None")


This is not allowed by the typing annotation itself, I believe we should not check for inputs that are not documented as permissible for the function

mryab · 2024-11-13T16:42:39Z

src/together/resources/finetune.py

+    if train_on_inputs is None:
+        raise ValueError("train_on_inputs cannot be None")
+
+    train_on_inputs_bool = train_on_inputs if train_on_inputs != "auto" else None


It is not of boolean type if it gets mapped to None in one case

mryab · 2024-11-13T16:49:00Z

tests/unit/test_files_checks.py

+            "messages": [
+                {"role": "user", "content": "Who won the game last night?"},


Can we add a system message at the beginning of this conversation to ensure that these three roles are handled properly?

src/together/cli/api/finetune.py

mryab · 2024-11-14T17:00:51Z

src/together/types/finetune.py

@@ -230,6 +231,7 @@ class FinetuneResponse(BaseModel):
    # training file metadata
    training_file_num_lines: int | None = Field(None, alias="TrainingFileNumLines")
    training_file_size: int | None = Field(None, alias="TrainingFileSize")
+    train_on_inputs: StrictBool | Literal["auto"] | None = "auto"


Is None still possible as a response, because older jobs don't have that attribute?

Yup, it's for retrieve or any other command that can see an old data

Co-authored-by: Max Ryabinin <[email protected]>

artek0chumak added 4 commits November 4, 2024 16:36

Add format checks

f55bb8c

add tests

8c5106c

add train on inputs flag

d90b4a5

style

8be70ac

mryab requested changes Nov 5, 2024

View reviewed changes

artek0chumak added 5 commits November 7, 2024 14:00

PR feedback

a5d666a

style

ff47c02

more tests

8a6b63d

enhance logic

ad9d0a8

enhance logic

487fbae

mryab reviewed Nov 8, 2024

View reviewed changes

artek0chumak added 2 commits November 8, 2024 16:26

pr feedback part 1

a400517

style and fixed

c933f16

mryab reviewed Nov 12, 2024

View reviewed changes

artek0chumak added 4 commits November 12, 2024 17:10

pr feedback

6423330

style

85c51c4

style

6268151

fix typing

6f3f8e3

artek0chumak added 2 commits November 13, 2024 16:46

change to strict boolean

567abda

error out on train_on_inputs

268ca77

artek0chumak requested a review from zainhas November 13, 2024 16:21

zainhas approved these changes Nov 13, 2024

View reviewed changes

mryab reviewed Nov 13, 2024

View reviewed changes

artek0chumak added 3 commits November 14, 2024 17:37

use "auto" directly

f8c6166

add system message

b1f3a17

version bump

387a23b

mryab approved these changes Nov 14, 2024

View reviewed changes

Update src/together/cli/api/finetune.py

34d9177

Co-authored-by: Max Ryabinin <[email protected]>

artek0chumak requested a review from orangetin November 14, 2024 17:06

orangetin merged commit e157fcd into main Nov 14, 2024
9 of 13 checks passed

orangetin deleted the artek0chumak/add_format_check branch November 14, 2024 18:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add instruction and conversational data support #211

Add instruction and conversational data support #211

artek0chumak commented Nov 4, 2024

mryab Nov 8, 2024

mryab Nov 12, 2024

mryab Nov 8, 2024

mryab Nov 8, 2024

mryab Nov 8, 2024

mryab Nov 8, 2024

mryab Nov 8, 2024

mryab Nov 8, 2024

mryab Nov 8, 2024

mryab Nov 12, 2024

mryab Nov 12, 2024

mryab Nov 12, 2024

mryab Nov 12, 2024

zainhas commented Nov 13, 2024

zainhas commented Nov 13, 2024

mryab Nov 13, 2024

mryab Nov 13, 2024

mryab Nov 13, 2024

mryab Nov 13, 2024

mryab Nov 14, 2024

artek0chumak Nov 14, 2024

	f.write("\n".join([json.dumps(item) for item in content]))
	f.write("\n".join(json.dumps(item) for item in content))

	message=f"Field '{column}' is missing for a turn `{turn}` on line {idx + 1} "
	message=f"Field `{column}` is missing for a turn `{turn}` on line {idx + 1} "

	message=f"Found invalid role '{role}' in the messages on the line {idx + 1}. "
	message=f"Found invalid role `{role}` in the messages on the line {idx + 1}. "

	"'user' and 'assistant' roles must alternate user/assistant/user/assistant/...",
	"`user` and `assistant` roles must alternate user/assistant/user/assistant/...",

		if train_on_inputs is None:
		raise ValueError("train_on_inputs cannot be None")

		"messages": [
		{"role": "user", "content": "Who won the game last night?"},

Add instruction and conversational data support #211

Add instruction and conversational data support #211

Conversation

artek0chumak commented Nov 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zainhas commented Nov 13, 2024

zainhas commented Nov 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment