Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add instruction and conversational data support #211

Merged
merged 21 commits into from
Nov 14, 2024

Conversation

artek0chumak
Copy link
Contributor

Issue #ENG-10912

Add support for instruction and conversational training data support for FT jobs.
The support goes in two ways:

  • File checks to make sure the data is formatted correctly
  • Add train_on_inputs flag to train on completion/assistance phrases

src/together/cli/api/finetune.py Outdated Show resolved Hide resolved
src/together/cli/api/finetune.py Show resolved Hide resolved
src/together/cli/api/utils.py Outdated Show resolved Hide resolved
src/together/cli/api/finetune.py Outdated Show resolved Hide resolved
src/together/constants.py Outdated Show resolved Hide resolved
tests/unit/test_files_checks.py Outdated Show resolved Hide resolved
tests/unit/test_files_checks.py Outdated Show resolved Hide resolved
tests/unit/test_files_checks.py Outdated Show resolved Hide resolved
tests/unit/test_files_checks.py Show resolved Hide resolved
tests/unit/test_files_checks.py Outdated Show resolved Hide resolved
src/together/cli/api/finetune.py Show resolved Hide resolved
"--train-on-inputs",
type=BOOL_WITH_AUTO,
default="auto",
help="Whether to mask the user messages in conversational data or prompts in instruction data",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: might be good to explain what happens in the default case and how auto is handled

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up (you can just put "auto" will automatically determine whether to mask the inputs based on the data format. here)

@@ -154,6 +157,8 @@ def create(
Defaults to False.
model_limits (FinetuneTrainingLimits, optional): Limits for the hyperparameters the model in Fine-tuning.
Defaults to None.
train_on_inputs (bool, optional): Whether to mask the user messages in conversational data or prompts in instruction data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bool or "auto"

src/together/resources/finetune.py Show resolved Hide resolved
@@ -465,6 +472,7 @@ async def create(
Defaults to False.
model_limits (FinetuneTrainingLimits, optional): Limits for the hyperparameters the model in Fine-tuning.
Defaults to None.
train_on_inputs (bool, optional): Whether to mask the inputs in conversational data. Defaults to "auto".
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this docstring is outdated

{"prompt": "Summarize the text.", "completion": "OpenAI creates advanced AI."},
]
with file.open("w") as f:
f.write("\n".join([json.dumps(item) for item in content]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
f.write("\n".join([json.dumps(item) for item in content]))
f.write("\n".join(json.dumps(item) for item in content))

},
]
with file.open("w") as f:
f.write("\n".join([json.dumps(item) for item in content]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here and onwards

for column in REQUIRED_COLUMNS_MESSAGE:
if column not in turn:
raise InvalidFileFormatError(
message=f"Field '{column}' is missing for a turn `{turn}` on line {idx + 1} "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
message=f"Field '{column}' is missing for a turn `{turn}` on line {idx + 1} "
message=f"Field `{column}` is missing for a turn `{turn}` on line {idx + 1} "


if role not in POSSIBLE_ROLES_CONVERSATION:
raise InvalidFileFormatError(
message=f"Found invalid role '{role}' in the messages on the line {idx + 1}. "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
message=f"Found invalid role '{role}' in the messages on the line {idx + 1}. "
message=f"Found invalid role `{role}` in the messages on the line {idx + 1}. "

if previous_role == role:
raise InvalidFileFormatError(
message=f"Invalid role turns on line {idx + 1} of the input file. "
"'user' and 'assistant' roles must alternate user/assistant/user/assistant/...",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"'user' and 'assistant' roles must alternate user/assistant/user/assistant/...",
"`user` and `assistant` roles must alternate user/assistant/user/assistant/...",

"--train-on-inputs",
type=BOOL_WITH_AUTO,
default="auto",
help="Whether to mask the user messages in conversational data or prompts in instruction data",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up (you can just put "auto" will automatically determine whether to mask the inputs based on the data format. here)

Comment on lines 161 to 164
"auto" will automatically determine whether to mask the inputs based on the data format.
Dataset with "text" (General format) field will not mask the inputs by default.
Dataset with "messages" (Conversational format) or "prompt" and "completion" (Instruction format)
fields will mask the inputs by default.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"auto" will automatically determine whether to mask the inputs based on the data format.
Dataset with "text" (General format) field will not mask the inputs by default.
Dataset with "messages" (Conversational format) or "prompt" and "completion" (Instruction format)
fields will mask the inputs by default.
"auto" will automatically determine whether to mask the inputs based on the data format.
For datasets with the "text" field (general format), inputs will not be masked.
For datasets with "messages" (conversational format) or "prompt" and "completion" (instruction format)
fields, inputs will be masked.

src/together/resources/finetune.py Show resolved Hide resolved
)

report_dict["is_check_passed"] = False
previous_role = ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
previous_role = ""
previous_role = None

if current_format == DatasetFormat.CONVERSATION:
message_column = JSONL_REQUIRED_COLUMNS_MAP[
DatasetFormat.CONVERSATION
][0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this is dangerous, we implicitly assume that the first item in the list is the message column name

@zainhas
Copy link

zainhas commented Nov 13, 2024

When I pass in train_on_inputs="true" or train_on_inputs="false" it should give me an error however it submits and completes the job - this is probably because "true", "false" are all truthy?

We should just be opinionated on what we accept as values for train_on_inputs and then error out for anything else and have detailed documentation.

Should we change the fact that users can pass in train_on_inputs=None and this will default to functionality like train_on_inputs="auto" since its not obvious that None should be == "auto" functionality.

@zainhas
Copy link

zainhas commented Nov 13, 2024

Reviewed new changes and lgtm👍

@artek0chumak artek0chumak requested a review from zainhas November 13, 2024 16:21
)

model_limits: FinetuneTrainingLimits = client.fine_tuning.get_model_limits(
model=model
)

if lora:
log_warn_once(
"LoRA rank default has been changed from 8 to 64 as the maximum available for each model."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that every model supported by the API has 64 as the maximum possible rank? I think the current wording is confusing if this is not true

Comment on lines 86 to 87
if train_on_inputs is None:
raise ValueError("train_on_inputs cannot be None")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not allowed by the typing annotation itself, I believe we should not check for inputs that are not documented as permissible for the function

if train_on_inputs is None:
raise ValueError("train_on_inputs cannot be None")

train_on_inputs_bool = train_on_inputs if train_on_inputs != "auto" else None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not of boolean type if it gets mapped to None in one case

Comment on lines +89 to +90
"messages": [
{"role": "user", "content": "Who won the game last night?"},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a system message at the beginning of this conversation to ensure that these three roles are handled properly?

src/together/cli/api/finetune.py Outdated Show resolved Hide resolved
@@ -230,6 +231,7 @@ class FinetuneResponse(BaseModel):
# training file metadata
training_file_num_lines: int | None = Field(None, alias="TrainingFileNumLines")
training_file_size: int | None = Field(None, alias="TrainingFileSize")
train_on_inputs: StrictBool | Literal["auto"] | None = "auto"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is None still possible as a response, because older jobs don't have that attribute?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, it's for retrieve or any other command that can see an old data

@orangetin orangetin merged commit e157fcd into main Nov 14, 2024
9 of 13 checks passed
@orangetin orangetin deleted the artek0chumak/add_format_check branch November 14, 2024 18:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants