-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add instruction and conversational data support #211
Conversation
src/together/cli/api/finetune.py
Outdated
"--train-on-inputs", | ||
type=BOOL_WITH_AUTO, | ||
default="auto", | ||
help="Whether to mask the user messages in conversational data or prompts in instruction data", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: might be good to explain what happens in the default case and how auto
is handled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Up (you can just put "auto" will automatically determine whether to mask the inputs based on the data format.
here)
src/together/resources/finetune.py
Outdated
@@ -154,6 +157,8 @@ def create( | |||
Defaults to False. | |||
model_limits (FinetuneTrainingLimits, optional): Limits for the hyperparameters the model in Fine-tuning. | |||
Defaults to None. | |||
train_on_inputs (bool, optional): Whether to mask the user messages in conversational data or prompts in instruction data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bool or "auto"
src/together/resources/finetune.py
Outdated
@@ -465,6 +472,7 @@ async def create( | |||
Defaults to False. | |||
model_limits (FinetuneTrainingLimits, optional): Limits for the hyperparameters the model in Fine-tuning. | |||
Defaults to None. | |||
train_on_inputs (bool, optional): Whether to mask the inputs in conversational data. Defaults to "auto". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this docstring is outdated
tests/unit/test_files_checks.py
Outdated
{"prompt": "Summarize the text.", "completion": "OpenAI creates advanced AI."}, | ||
] | ||
with file.open("w") as f: | ||
f.write("\n".join([json.dumps(item) for item in content])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f.write("\n".join([json.dumps(item) for item in content])) | |
f.write("\n".join(json.dumps(item) for item in content)) |
tests/unit/test_files_checks.py
Outdated
}, | ||
] | ||
with file.open("w") as f: | ||
f.write("\n".join([json.dumps(item) for item in content])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here and onwards
src/together/utils/files.py
Outdated
for column in REQUIRED_COLUMNS_MESSAGE: | ||
if column not in turn: | ||
raise InvalidFileFormatError( | ||
message=f"Field '{column}' is missing for a turn `{turn}` on line {idx + 1} " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
message=f"Field '{column}' is missing for a turn `{turn}` on line {idx + 1} " | |
message=f"Field `{column}` is missing for a turn `{turn}` on line {idx + 1} " |
src/together/utils/files.py
Outdated
|
||
if role not in POSSIBLE_ROLES_CONVERSATION: | ||
raise InvalidFileFormatError( | ||
message=f"Found invalid role '{role}' in the messages on the line {idx + 1}. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
message=f"Found invalid role '{role}' in the messages on the line {idx + 1}. " | |
message=f"Found invalid role `{role}` in the messages on the line {idx + 1}. " |
src/together/utils/files.py
Outdated
if previous_role == role: | ||
raise InvalidFileFormatError( | ||
message=f"Invalid role turns on line {idx + 1} of the input file. " | ||
"'user' and 'assistant' roles must alternate user/assistant/user/assistant/...", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"'user' and 'assistant' roles must alternate user/assistant/user/assistant/...", | |
"`user` and `assistant` roles must alternate user/assistant/user/assistant/...", |
src/together/cli/api/finetune.py
Outdated
"--train-on-inputs", | ||
type=BOOL_WITH_AUTO, | ||
default="auto", | ||
help="Whether to mask the user messages in conversational data or prompts in instruction data", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Up (you can just put "auto" will automatically determine whether to mask the inputs based on the data format.
here)
src/together/resources/finetune.py
Outdated
"auto" will automatically determine whether to mask the inputs based on the data format. | ||
Dataset with "text" (General format) field will not mask the inputs by default. | ||
Dataset with "messages" (Conversational format) or "prompt" and "completion" (Instruction format) | ||
fields will mask the inputs by default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"auto" will automatically determine whether to mask the inputs based on the data format. | |
Dataset with "text" (General format) field will not mask the inputs by default. | |
Dataset with "messages" (Conversational format) or "prompt" and "completion" (Instruction format) | |
fields will mask the inputs by default. | |
"auto" will automatically determine whether to mask the inputs based on the data format. | |
For datasets with the "text" field (general format), inputs will not be masked. | |
For datasets with "messages" (conversational format) or "prompt" and "completion" (instruction format) | |
fields, inputs will be masked. |
src/together/utils/files.py
Outdated
) | ||
|
||
report_dict["is_check_passed"] = False | ||
previous_role = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
previous_role = "" | |
previous_role = None |
if current_format == DatasetFormat.CONVERSATION: | ||
message_column = JSONL_REQUIRED_COLUMNS_MAP[ | ||
DatasetFormat.CONVERSATION | ||
][0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: this is dangerous, we implicitly assume that the first item in the list is the message column name
When I pass in We should just be opinionated on what we accept as values for train_on_inputs and then error out for anything else and have detailed documentation. Should we change the fact that users can pass in |
Reviewed new changes and lgtm👍 |
src/together/cli/api/finetune.py
Outdated
) | ||
|
||
model_limits: FinetuneTrainingLimits = client.fine_tuning.get_model_limits( | ||
model=model | ||
) | ||
|
||
if lora: | ||
log_warn_once( | ||
"LoRA rank default has been changed from 8 to 64 as the maximum available for each model." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that every model supported by the API has 64 as the maximum possible rank? I think the current wording is confusing if this is not true
src/together/resources/finetune.py
Outdated
if train_on_inputs is None: | ||
raise ValueError("train_on_inputs cannot be None") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not allowed by the typing annotation itself, I believe we should not check for inputs that are not documented as permissible for the function
src/together/resources/finetune.py
Outdated
if train_on_inputs is None: | ||
raise ValueError("train_on_inputs cannot be None") | ||
|
||
train_on_inputs_bool = train_on_inputs if train_on_inputs != "auto" else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not of boolean type if it gets mapped to None in one case
"messages": [ | ||
{"role": "user", "content": "Who won the game last night?"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a system
message at the beginning of this conversation to ensure that these three roles are handled properly?
@@ -230,6 +231,7 @@ class FinetuneResponse(BaseModel): | |||
# training file metadata | |||
training_file_num_lines: int | None = Field(None, alias="TrainingFileNumLines") | |||
training_file_size: int | None = Field(None, alias="TrainingFileSize") | |||
train_on_inputs: StrictBool | Literal["auto"] | None = "auto" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is None still possible as a response, because older jobs don't have that attribute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, it's for retrieve or any other command that can see an old data
Co-authored-by: Max Ryabinin <[email protected]>
Issue #ENG-10912
Add support for instruction and conversational training data support for FT jobs.
The support goes in two ways:
train_on_inputs
flag to train on completion/assistance phrases