-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sample command #374
Sample command #374
Conversation
mentat/parsers/git_parser.py
Outdated
@@ -72,6 +77,9 @@ def parse_string(self, git_diff: str) -> ParsedLLMResponse: | |||
end_line = start_line + 1 | |||
else: | |||
end_line = start_line + int(a_b[1]) | |||
# TODO: This breaks some of the parser tests, but is required to make | |||
# an insertion diff valid? Need to investigate. | |||
end_line += 1 # FileEdits use exclusive end_line |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jakethekoenig I had to add this to get my sample_eval's diff (insert 2 lines) to work. I wonder if we adjusted file_edits to use exclusive end_line and this class was missed? Or maybe I need to dig more 😄
I'm a bit confused on this; so the sample creates a json file detailing the changes a model made to an exact commit on a repo? Kind of like a commit except it's in a json file and has extra information (like user and assistant messages)? |
mentat/sampler/README.md
Outdated
@@ -0,0 +1,23 @@ | |||
# Mentat Sampler API | |||
The Sampler API is an open-source standard to captures interactions between a developer and an LLM-based AI Assistant. It's intended to facilitate sharing of benchmarks and fine-tuning data throughout the open-source community and industry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was a bit confused by this and I work on mentat; I think it would be good if the description was a bit clearer. Maybe we could give a description of what it actually is? (A sample contains the git diff of an edit made by an LLM with the user and assistant messages leading up to that edit)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I think it would be a lot simpler to just point to a git commit on github with the specified diff instead of having diff_edit, merge_base, diff_merge_base, and other arguments. Is there a reason that wouldn't work as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm mostly concerned with being able to generate samples as I work, rather than having to sit down and 'make samples'. In practice I'm hardly ever working directly from a permanent commit because we squash-merge everything. They certainly can be blank, and maybe they usually will be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure the commits will still be there (even if they are squash-merged). Git squashes all the commits into one commit and puts it on main, but the commits are still valid and can still be accessed via their hash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh you're right! That takes out a lot of cases.
Still though if you're working from a branch or commit that could get abandoned, and you want to create a sample, it'd be nice to have this option to make sure it'll always be valid.
I do think I can rearrange/rename some of the variables to make this less verbose and confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still though if you're working from a branch or commit that could get abandoned, and you want to create a sample, it'd be nice to have this option to make sure it'll always be valid.
Maybe, but as far as I know, git will only ever delete a commit if the user does it (via a command), which I don't think we should worry about. If the commit is taken down off of it's origin (like github), then odds are all the other commits also were, in which case this entire sample is moot anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By abandoned I mean never pushed to GH/remote in the first place. Then it'd only available on my machine, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see, yeah.
sample_repo: str | None = attr.field( | ||
default=None, | ||
metadata={ | ||
"description": "A public url for a cloneable git repository to sample from." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can get this for the user: git remote get-url origin
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, didn't think of that! I'm going to leave this optional available, and if it's missing, I'll grab the repo url and ask the user to confirm or enter a new one.
mentat/command/commands/sample.py
Outdated
samples_dir.mkdir(exist_ok=True) | ||
fpath = samples_dir / fname | ||
sample.save(str(fpath)) | ||
SESSION_CONTEXT.get().stream.send(f"Sample saved to {fname}.", color="green") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should print the whole path? It was confusing to find it after I saved it. I expected it to appear in my working directory, possibly under samples.
@@ -0,0 +1,45 @@ | |||
import argparse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made one sample and tried to run this script and got the following error:
git.exc.GitCommandError: Cmd('git') failed due to: exit code(128)
cmdline: git checkout f5057f1658b9c7edb5e45a2fa8c2198ded5b5c00
stderr: 'fatal: reference is not a tree: f5057f1658b9c7edb5e45a2fa8c2198ded5b5c00'
that is a correct hexsha for our project though.
Edit: the problem is that I had a stale mentat repo in benchmarking_repos
. Maybe we should pull before evaluating.
mentat/git_handler.py
Outdated
@@ -199,3 +214,81 @@ def get_default_branch() -> str: | |||
except subprocess.CalledProcessError: | |||
# Handle error if needed or raise an exception | |||
raise Exception("Unable to determine the default branch.") | |||
|
|||
|
|||
def get_merge_base() -> str | None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get the following error when I don't set --sample-merge-base-target
myself:
Unhandled Exception: Traceback (most recent call last):
File "/home/jake/Development/mentat/mentat/session.py", line 193, in run_main
await self._main()
File "/home/jake/Development/mentat/mentat/session.py", line 133, in _main
message = await collect_input_with_commands()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jake/Development/mentat/mentat/session_input.py", line 62, in collect_input_with_commands
await command.apply(*arguments[1:])
File "/home/jake/Development/mentat/mentat/command/commands/sample.py", line 12, in apply
sample = await Sample.from_context()
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jake/Development/mentat/mentat/sampler/sample.py", line 170, in from_context
raise HistoryError("EditHistory.merge_base was not set.")
mentat.errors.HistoryError: EditHistory.merge_base was not set.
But it looks like this function is trying to set a reasonable default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aah ya these are not coordinated. Good catch.
mentat/code_file_manager.py
Outdated
@@ -72,6 +72,9 @@ async def write_changes_to_files( | |||
if not file_edits: | |||
return [] | |||
|
|||
# Set pre-edit context in case /sample is called later | |||
self.history.set_sample_diffs() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this have to be done here? Would it not work if it was called in the sample command?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It really only needs to be the diff_active
. It's here so that, later, we can determine what changes were made in response to the last interaction. It requires a bit of diff algebra, which I also need to solve so I'll write it out here.
We want to do:
- Set
diff_active
here, i.e.git diff HEAD
- Mentat updates the code
- You update the code
- Set
diff_edit
to the diff between the current code anddiff_active
In order to do #4, we need to call git diff HEAD
again, and 'subtract' the outputs from #1.
|
||
def save(self, fname: str) -> None: | ||
with open(fname, "w") as f: | ||
json.dump(attr.asdict(self), f, indent=4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the plan to rename and check into version control the benchmarks we like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya we should somehow. Or setup a public repo that anyone can contribute to, and curate lists of IDs for different purposes.
mentat/sampler/sample.py
Outdated
os.remove(f".sample_{temp_id}.diff") | ||
|
||
|
||
def setup_repo(sample: Sample, path_to_repo: Path | str | None) -> Path: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this and the following method should be Sample instance methods.
mentat/sampler/README.md
Outdated
| title | `str` | plaintext by creator | | ||
| description | `str` | plaintext by creator | | ||
| id | `uuid` | | | ||
| parent_id | `uuid` | id of sample immediately before this | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parent_id
is set if you make multiple samples in the same conversation? What's the point of tracking that information?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm imagining if you have 3 interactions in a row, you could use it to benchmark or file-tune an agent LLM.
mentat/sampler/README.md
Outdated
|------------------|---------------|-------------| | ||
| title | `str` | plaintext by creator | | ||
| description | `str` | plaintext by creator | | ||
| id | `uuid` | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need an id
at all? Isn't a title enough? I guess we plan to make so many (eventually) that we won't have time to name them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't want to rely on names being unique - esp if it's public.
mentat/sampler/sample.py
Outdated
text=True, | ||
) | ||
# TODO: Run the LLM evaluations from benchmark | ||
# TODO: Save the results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume these TODOs are for another PR? Doesn't seem necessary to nail down how samples will turned into benchmarks in this one.
if response == "y": | ||
repo = remote_url | ||
else: | ||
repo = response |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The text says the default is yes so if the response is "y", "Y" or "" you should use remote_url
.
repo = response | ||
config.sample_repo = repo | ||
|
||
stream.send("Sample Title:") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this UX. But I guess if only we're using it it's fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what don't you like about it?
def parse_message(message: ChatCompletionMessageParam) -> dict[str, str]: | ||
ctx = SESSION_CONTEXT.get() | ||
content = message.get("content") | ||
text, code = "", "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code
is never reassigned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was intentional - I mean to convert code messages to diff format, but we don't have an easy way of doing that (per your comment below), so it's a placeholder for now.
text = content | ||
output = list[str]() | ||
in_special = False | ||
for line in content.splitlines(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point of this for loop is to extract the llm's non code editing message correct? I don't like how it uses _starts_special
and _ends_special
because it assumes some things about the parser that aren't necessarily true. For instance I don't think this would work if the user was using the json parser. Part of the reason we have to do this is because the parser only exposes a stream_and_parse_llm_response
that acts on an AyncIterator[ChatCompletionChunk]
as opposed to one that acts on a string. Maybe it would be better to add a parse_string
method to the parser that wraps the string in an asynciterator and then calls stream_and_parse_llm_response
and then this method could use that? Maybe @PCSwingle has thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed - I nearly took a crack at this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a parse string method would be good; we'd only need to implement in the parser superclass so it wouldn't any overhead to new parsers which would be nice. I agree with jake that we shouldn't really be using _starts_special and _ends_special; those are really just meant to assist with streaming. I agree with jake that the best method would be to add a parse_string method, call that, and then extract the code from the output fileedits. We also might need to add a switch to turn off streaming for the stream and parse function if you don't want to stream the output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given this is for our own use I guess it's good enough. I'd really like to see the parse_message
function improved though. Another idea would be to have the parser return the non code message separately and have the conversation hold onto that data.
Maybe we should hide the sample command and the config settings related to sampling? I think they'd confuse users.
I definitely think that until we change it to a parse_string function we shouldn't have this command be visible to users (since it's going to be super buggy if we use _starts_special and a few other parser specific functions in here which aren't meant to be used outside the parser). |
This adds a new command
/sample
, which bundles the last interaction together with active code (after you correct it) into a generic 'sample'. See mentat/sampler/README.md for more info.This is a rebasing of #352.
Steps to complete:
/sample
command and an integration test of the Sample classSample.evaluate
method