Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sample command #374

Merged
merged 43 commits into from
Dec 20, 2023
Merged

Sample command #374

merged 43 commits into from
Dec 20, 2023

Conversation

granawkins
Copy link
Member

@granawkins granawkins commented Dec 9, 2023

This adds a new command /sample, which bundles the last interaction together with active code (after you correct it) into a generic 'sample'. See mentat/sampler/README.md for more info.

This is a rebasing of #352.

Steps to complete:

  • Add git commands to create required diffs/hexsha
  • Add a sampler class, which builds itself from context, with tests
  • Add a /sample command and an integration test of the Sample class
  • Add Sample.evaluate method
  • Add script for model-benchmarking with samples.

@granawkins granawkins marked this pull request as draft December 9, 2023 06:02
@@ -72,6 +77,9 @@ def parse_string(self, git_diff: str) -> ParsedLLMResponse:
end_line = start_line + 1
else:
end_line = start_line + int(a_b[1])
# TODO: This breaks some of the parser tests, but is required to make
# an insertion diff valid? Need to investigate.
end_line += 1 # FileEdits use exclusive end_line
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakethekoenig I had to add this to get my sample_eval's diff (insert 2 lines) to work. I wonder if we adjusted file_edits to use exclusive end_line and this class was missed? Or maybe I need to dig more 😄

@granawkins granawkins marked this pull request as ready for review December 10, 2023 15:12
mentat/sampler/README.md Outdated Show resolved Hide resolved
@PCSwingle
Copy link
Member

PCSwingle commented Dec 12, 2023

I'm a bit confused on this; so the sample creates a json file detailing the changes a model made to an exact commit on a repo? Kind of like a commit except it's in a json file and has extra information (like user and assistant messages)? Also, what happens if there are already diffs when you /sample; do they get lumped in with the model changes? EDIT: I see the diff_active field now, I assume that's where they go.

@@ -0,0 +1,23 @@
# Mentat Sampler API
The Sampler API is an open-source standard to captures interactions between a developer and an LLM-based AI Assistant. It's intended to facilitate sharing of benchmarks and fine-tuning data throughout the open-source community and industry.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was a bit confused by this and I work on mentat; I think it would be good if the description was a bit clearer. Maybe we could give a description of what it actually is? (A sample contains the git diff of an edit made by an LLM with the user and assistant messages leading up to that edit)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think it would be a lot simpler to just point to a git commit on github with the specified diff instead of having diff_edit, merge_base, diff_merge_base, and other arguments. Is there a reason that wouldn't work as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm mostly concerned with being able to generate samples as I work, rather than having to sit down and 'make samples'. In practice I'm hardly ever working directly from a permanent commit because we squash-merge everything. They certainly can be blank, and maybe they usually will be.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure the commits will still be there (even if they are squash-merged). Git squashes all the commits into one commit and puts it on main, but the commits are still valid and can still be accessed via their hash.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh you're right! That takes out a lot of cases.

Still though if you're working from a branch or commit that could get abandoned, and you want to create a sample, it'd be nice to have this option to make sure it'll always be valid.

I do think I can rearrange/rename some of the variables to make this less verbose and confusing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still though if you're working from a branch or commit that could get abandoned, and you want to create a sample, it'd be nice to have this option to make sure it'll always be valid.

Maybe, but as far as I know, git will only ever delete a commit if the user does it (via a command), which I don't think we should worry about. If the commit is taken down off of it's origin (like github), then odds are all the other commits also were, in which case this entire sample is moot anyways.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By abandoned I mean never pushed to GH/remote in the first place. Then it'd only available on my machine, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see, yeah.

sample_repo: str | None = attr.field(
default=None,
metadata={
"description": "A public url for a cloneable git repository to sample from."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can get this for the user: git remote get-url origin.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, didn't think of that! I'm going to leave this optional available, and if it's missing, I'll grab the repo url and ask the user to confirm or enter a new one.

samples_dir.mkdir(exist_ok=True)
fpath = samples_dir / fname
sample.save(str(fpath))
SESSION_CONTEXT.get().stream.send(f"Sample saved to {fname}.", color="green")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should print the whole path? It was confusing to find it after I saved it. I expected it to appear in my working directory, possibly under samples.

@@ -0,0 +1,45 @@
import argparse
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made one sample and tried to run this script and got the following error:

git.exc.GitCommandError: Cmd('git') failed due to: exit code(128)
  cmdline: git checkout f5057f1658b9c7edb5e45a2fa8c2198ded5b5c00
  stderr: 'fatal: reference is not a tree: f5057f1658b9c7edb5e45a2fa8c2198ded5b5c00'

that is a correct hexsha for our project though.

Edit: the problem is that I had a stale mentat repo in benchmarking_repos. Maybe we should pull before evaluating.

@@ -199,3 +214,81 @@ def get_default_branch() -> str:
except subprocess.CalledProcessError:
# Handle error if needed or raise an exception
raise Exception("Unable to determine the default branch.")


def get_merge_base() -> str | None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the following error when I don't set --sample-merge-base-target myself:


Unhandled Exception: Traceback (most recent call last):
  File "/home/jake/Development/mentat/mentat/session.py", line 193, in run_main
    await self._main()
  File "/home/jake/Development/mentat/mentat/session.py", line 133, in _main
    message = await collect_input_with_commands()
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jake/Development/mentat/mentat/session_input.py", line 62, in collect_input_with_commands
    await command.apply(*arguments[1:])
  File "/home/jake/Development/mentat/mentat/command/commands/sample.py", line 12, in apply
    sample = await Sample.from_context()
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jake/Development/mentat/mentat/sampler/sample.py", line 170, in from_context
    raise HistoryError("EditHistory.merge_base was not set.")
mentat.errors.HistoryError: EditHistory.merge_base was not set.

But it looks like this function is trying to set a reasonable default?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aah ya these are not coordinated. Good catch.

@@ -72,6 +72,9 @@ async def write_changes_to_files(
if not file_edits:
return []

# Set pre-edit context in case /sample is called later
self.history.set_sample_diffs()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this have to be done here? Would it not work if it was called in the sample command?

Copy link
Member Author

@granawkins granawkins Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It really only needs to be the diff_active. It's here so that, later, we can determine what changes were made in response to the last interaction. It requires a bit of diff algebra, which I also need to solve so I'll write it out here.

We want to do:

  1. Set diff_active here, i.e. git diff HEAD
  2. Mentat updates the code
  3. You update the code
  4. Set diff_edit to the diff between the current code and diff_active

In order to do #4, we need to call git diff HEAD again, and 'subtract' the outputs from #1.


def save(self, fname: str) -> None:
with open(fname, "w") as f:
json.dump(attr.asdict(self), f, indent=4)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the plan to rename and check into version control the benchmarks we like?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya we should somehow. Or setup a public repo that anyone can contribute to, and curate lists of IDs for different purposes.

os.remove(f".sample_{temp_id}.diff")


def setup_repo(sample: Sample, path_to_repo: Path | str | None) -> Path:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this and the following method should be Sample instance methods.

| title | `str` | plaintext by creator |
| description | `str` | plaintext by creator |
| id | `uuid` | |
| parent_id | `uuid` | id of sample immediately before this |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parent_id is set if you make multiple samples in the same conversation? What's the point of tracking that information?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm imagining if you have 3 interactions in a row, you could use it to benchmark or file-tune an agent LLM.

|------------------|---------------|-------------|
| title | `str` | plaintext by creator |
| description | `str` | plaintext by creator |
| id | `uuid` | |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need an id at all? Isn't a title enough? I guess we plan to make so many (eventually) that we won't have time to name them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't want to rely on names being unique - esp if it's public.

text=True,
)
# TODO: Run the LLM evaluations from benchmark
# TODO: Save the results
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume these TODOs are for another PR? Doesn't seem necessary to nail down how samples will turned into benchmarks in this one.

@jakethekoenig
Copy link
Member

I get the following exception when I run the evaluate samples script with a dirty git repo:
image

if response == "y":
repo = remote_url
else:
repo = response
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The text says the default is yes so if the response is "y", "Y" or "" you should use remote_url.

repo = response
config.sample_repo = repo

stream.send("Sample Title:")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this UX. But I guess if only we're using it it's fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what don't you like about it?

def parse_message(message: ChatCompletionMessageParam) -> dict[str, str]:
ctx = SESSION_CONTEXT.get()
content = message.get("content")
text, code = "", ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code is never reassigned.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was intentional - I mean to convert code messages to diff format, but we don't have an easy way of doing that (per your comment below), so it's a placeholder for now.

text = content
output = list[str]()
in_special = False
for line in content.splitlines():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of this for loop is to extract the llm's non code editing message correct? I don't like how it uses _starts_special and _ends_special because it assumes some things about the parser that aren't necessarily true. For instance I don't think this would work if the user was using the json parser. Part of the reason we have to do this is because the parser only exposes a stream_and_parse_llm_response that acts on an AyncIterator[ChatCompletionChunk] as opposed to one that acts on a string. Maybe it would be better to add a parse_string method to the parser that wraps the string in an asynciterator and then calls stream_and_parse_llm_response and then this method could use that? Maybe @PCSwingle has thoughts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed - I nearly took a crack at this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a parse string method would be good; we'd only need to implement in the parser superclass so it wouldn't any overhead to new parsers which would be nice. I agree with jake that we shouldn't really be using _starts_special and _ends_special; those are really just meant to assist with streaming. I agree with jake that the best method would be to add a parse_string method, call that, and then extract the code from the output fileedits. We also might need to add a switch to turn off streaming for the stream and parse function if you don't want to stream the output.

@granawkins
Copy link
Member Author

I get the following exception when I run the evaluate samples script with a dirty git repo: image

Im checking for text files now, so hopefully this won't happen.

Copy link
Member

@jakethekoenig jakethekoenig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this is for our own use I guess it's good enough. I'd really like to see the parse_message function improved though. Another idea would be to have the parser return the non code message separately and have the conversation hold onto that data.

Maybe we should hide the sample command and the config settings related to sampling? I think they'd confuse users.

@PCSwingle
Copy link
Member

I definitely think that until we change it to a parse_string function we shouldn't have this command be visible to users (since it's going to be super buggy if we use _starts_special and a few other parser specific functions in here which aren't meant to be used outside the parser).

@granawkins granawkins merged commit cae4930 into main Dec 20, 2023
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants