Sample command #374

granawkins · 2023-12-09T06:02:41Z

This adds a new command /sample, which bundles the last interaction together with active code (after you correct it) into a generic 'sample'. See mentat/sampler/README.md for more info.

This is a rebasing of #352.

Steps to complete:

Add git commands to create required diffs/hexsha
Add a sampler class, which builds itself from context, with tests
Add a /sample command and an integration test of the Sample class
Add Sample.evaluate method
Add script for model-benchmarking with samples.

granawkins · 2023-12-10T11:10:34Z

mentat/parsers/git_parser.py

@@ -72,6 +77,9 @@ def parse_string(self, git_diff: str) -> ParsedLLMResponse:
                        end_line = start_line + 1
                    else:
                        end_line = start_line + int(a_b[1])
+                    # TODO: This breaks some of the parser tests, but is required to make
+                    # an insertion diff valid? Need to investigate.
+                    end_line += 1  # FileEdits use exclusive end_line


@jakethekoenig I had to add this to get my sample_eval's diff (insert 2 lines) to work. I wonder if we adjusted file_edits to use exclusive end_line and this class was missed? Or maybe I need to dig more 😄

mentat/command/commands/sample.py

mentat/sampler/README.md

PCSwingle · 2023-12-12T18:57:52Z

I'm a bit confused on this; so the sample creates a json file detailing the changes a model made to an exact commit on a repo? Kind of like a commit except it's in a json file and has extra information (like user and assistant messages)? ~~Also, what happens if there are already diffs when you /sample; do they get lumped in with the model changes?~~ EDIT: I see the diff_active field now, I assume that's where they go.

PCSwingle · 2023-12-12T18:59:42Z

mentat/sampler/README.md

@@ -0,0 +1,23 @@
+# Mentat Sampler API
+The Sampler API is an open-source standard to captures interactions between a developer and an LLM-based AI Assistant. It's intended to facilitate sharing of benchmarks and fine-tuning data throughout the open-source community and industry.


I was a bit confused by this and I work on mentat; I think it would be good if the description was a bit clearer. Maybe we could give a description of what it actually is? (A sample contains the git diff of an edit made by an LLM with the user and assistant messages leading up to that edit)

Also, I think it would be a lot simpler to just point to a git commit on github with the specified diff instead of having diff_edit, merge_base, diff_merge_base, and other arguments. Is there a reason that wouldn't work as well?

I'm mostly concerned with being able to generate samples as I work, rather than having to sit down and 'make samples'. In practice I'm hardly ever working directly from a permanent commit because we squash-merge everything. They certainly can be blank, and maybe they usually will be.

I'm pretty sure the commits will still be there (even if they are squash-merged). Git squashes all the commits into one commit and puts it on main, but the commits are still valid and can still be accessed via their hash.

Oh you're right! That takes out a lot of cases.

Still though if you're working from a branch or commit that could get abandoned, and you want to create a sample, it'd be nice to have this option to make sure it'll always be valid.

I do think I can rearrange/rename some of the variables to make this less verbose and confusing.

Still though if you're working from a branch or commit that could get abandoned, and you want to create a sample, it'd be nice to have this option to make sure it'll always be valid.

Maybe, but as far as I know, git will only ever delete a commit if the user does it (via a command), which I don't think we should worry about. If the commit is taken down off of it's origin (like github), then odds are all the other commits also were, in which case this entire sample is moot anyways.

By abandoned I mean never pushed to GH/remote in the first place. Then it'd only available on my machine, right?

Oh, I see, yeah.

jakethekoenig · 2023-12-12T16:42:34Z

mentat/config.py

+    sample_repo: str | None = attr.field(
+        default=None,
+        metadata={
+            "description": "A public url for a cloneable git repository to sample from."


We can get this for the user: git remote get-url origin.

Nice, didn't think of that! I'm going to leave this optional available, and if it's missing, I'll grab the repo url and ask the user to confirm or enter a new one.

jakethekoenig · 2023-12-12T16:56:43Z

mentat/command/commands/sample.py

+            samples_dir.mkdir(exist_ok=True)
+            fpath = samples_dir / fname
+        sample.save(str(fpath))
+        SESSION_CONTEXT.get().stream.send(f"Sample saved to {fname}.", color="green")


Maybe we should print the whole path? It was confusing to find it after I saved it. I expected it to appear in my working directory, possibly under samples.

jakethekoenig · 2023-12-12T17:01:54Z

scripts/evaluate_samples.py

@@ -0,0 +1,45 @@
+import argparse


I made one sample and tried to run this script and got the following error:

git.exc.GitCommandError: Cmd('git') failed due to: exit code(128) cmdline: git checkout f5057f1658b9c7edb5e45a2fa8c2198ded5b5c00 stderr: 'fatal: reference is not a tree: f5057f1658b9c7edb5e45a2fa8c2198ded5b5c00'

that is a correct hexsha for our project though.

Edit: the problem is that I had a stale mentat repo in benchmarking_repos. Maybe we should pull before evaluating.

jakethekoenig · 2023-12-12T17:07:11Z

mentat/git_handler.py

@@ -199,3 +214,81 @@ def get_default_branch() -> str:
    except subprocess.CalledProcessError:
        # Handle error if needed or raise an exception
        raise Exception("Unable to determine the default branch.")
+
+
+def get_merge_base() -> str | None:


I get the following error when I don't set --sample-merge-base-target myself:

Unhandled Exception: Traceback (most recent call last): File "/home/jake/Development/mentat/mentat/session.py", line 193, in run_main await self._main() File "/home/jake/Development/mentat/mentat/session.py", line 133, in _main message = await collect_input_with_commands() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jake/Development/mentat/mentat/session_input.py", line 62, in collect_input_with_commands await command.apply(*arguments[1:]) File "/home/jake/Development/mentat/mentat/command/commands/sample.py", line 12, in apply sample = await Sample.from_context() ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jake/Development/mentat/mentat/sampler/sample.py", line 170, in from_context raise HistoryError("EditHistory.merge_base was not set.") mentat.errors.HistoryError: EditHistory.merge_base was not set.

But it looks like this function is trying to set a reasonable default?

Aah ya these are not coordinated. Good catch.

jakethekoenig · 2023-12-12T18:43:31Z

mentat/code_file_manager.py

@@ -72,6 +72,9 @@ async def write_changes_to_files(
        if not file_edits:
            return []

+        # Set pre-edit context in case /sample is called later
+        self.history.set_sample_diffs()


Why does this have to be done here? Would it not work if it was called in the sample command?

It really only needs to be the diff_active. It's here so that, later, we can determine what changes were made in response to the last interaction. It requires a bit of diff algebra, which I also need to solve so I'll write it out here.

We want to do:

Set diff_active here, i.e. git diff HEAD

Mentat updates the code

You update the code

Set diff_edit to the diff between the current code and diff_active

In order to do #4, we need to call git diff HEAD again, and 'subtract' the outputs from #1.

jakethekoenig · 2023-12-12T18:44:54Z

mentat/sampler/sample.py

+
+    def save(self, fname: str) -> None:
+        with open(fname, "w") as f:
+            json.dump(attr.asdict(self), f, indent=4)


Is the plan to rename and check into version control the benchmarks we like?

Ya we should somehow. Or setup a public repo that anyone can contribute to, and curate lists of IDs for different purposes.

jakethekoenig · 2023-12-12T18:47:41Z

mentat/sampler/sample.py

+        os.remove(f".sample_{temp_id}.diff")
+
+
+def setup_repo(sample: Sample, path_to_repo: Path | str | None) -> Path:


I think this and the following method should be Sample instance methods.

jakethekoenig · 2023-12-12T19:38:51Z

mentat/sampler/README.md

+| title            | `str`         | plaintext by creator |
+| description      | `str`         | plaintext by creator |
+| id               | `uuid`        |  |
+| parent_id        | `uuid`        | id of sample immediately before this |


parent_id is set if you make multiple samples in the same conversation? What's the point of tracking that information?

I'm imagining if you have 3 interactions in a row, you could use it to benchmark or file-tune an agent LLM.

jakethekoenig · 2023-12-12T19:41:15Z

mentat/sampler/README.md

+|------------------|---------------|-------------|
+| title            | `str`         | plaintext by creator |
+| description      | `str`         | plaintext by creator |
+| id               | `uuid`        |  |


Why do we need an id at all? Isn't a title enough? I guess we plan to make so many (eventually) that we won't have time to name them?

I wouldn't want to rely on names being unique - esp if it's public.

jakethekoenig · 2023-12-15T21:38:31Z

mentat/sampler/sample.py

+                text=True,
+            )
+        # TODO: Run the LLM evaluations from benchmark
+        # TODO: Save the results


I assume these TODOs are for another PR? Doesn't seem necessary to nail down how samples will turned into benchmarks in this one.

jakethekoenig · 2023-12-18T19:30:22Z

I get the following exception when I run the evaluate samples script with a dirty git repo:

jakethekoenig · 2023-12-18T19:24:48Z

mentat/sampler/sampler.py

+            if response == "y":
+                repo = remote_url
+            else:
+                repo = response


The text says the default is yes so if the response is "y", "Y" or "" you should use remote_url.

jakethekoenig · 2023-12-18T19:46:37Z

mentat/sampler/sampler.py

+                repo = response
+            config.sample_repo = repo
+
+        stream.send("Sample Title:")


I don't like this UX. But I guess if only we're using it it's fine.

what don't you like about it?

jakethekoenig · 2023-12-18T19:48:23Z

mentat/sampler/sampler.py

+def parse_message(message: ChatCompletionMessageParam) -> dict[str, str]:
+    ctx = SESSION_CONTEXT.get()
+    content = message.get("content")
+    text, code = "", ""


code is never reassigned.

This was intentional - I mean to convert code messages to diff format, but we don't have an easy way of doing that (per your comment below), so it's a placeholder for now.

jakethekoenig · 2023-12-18T19:57:27Z

mentat/sampler/sampler.py

+            text = content
+        output = list[str]()
+        in_special = False
+        for line in content.splitlines():


The point of this for loop is to extract the llm's non code editing message correct? I don't like how it uses _starts_special and _ends_special because it assumes some things about the parser that aren't necessarily true. For instance I don't think this would work if the user was using the json parser. Part of the reason we have to do this is because the parser only exposes a stream_and_parse_llm_response that acts on an AyncIterator[ChatCompletionChunk] as opposed to one that acts on a string. Maybe it would be better to add a parse_string method to the parser that wraps the string in an asynciterator and then calls stream_and_parse_llm_response and then this method could use that? Maybe @PCSwingle has thoughts.

Agreed - I nearly took a crack at this.

I think a parse string method would be good; we'd only need to implement in the parser superclass so it wouldn't any overhead to new parsers which would be nice. I agree with jake that we shouldn't really be using _starts_special and _ends_special; those are really just meant to assist with streaming. I agree with jake that the best method would be to add a parse_string method, call that, and then extract the code from the output fileedits. We also might need to add a switch to turn off streaming for the stream and parse function if you don't want to stream the output.

granawkins · 2023-12-20T10:02:36Z

I get the following exception when I run the evaluate samples script with a dirty git repo:

Im checking for text files now, so hopefully this won't happen.

jakethekoenig

Given this is for our own use I guess it's good enough. I'd really like to see the parse_message function improved though. Another idea would be to have the parser return the non code message separately and have the conversation hold onto that data.

Maybe we should hide the sample command and the config settings related to sampling? I think they'd confuse users.

PCSwingle · 2023-12-20T20:51:48Z

I definitely think that until we change it to a parse_string function we shouldn't have this command be visible to users (since it's going to be super buggy if we use _starts_special and a few other parser specific functions in here which aren't meant to be used outside the parser).

granawkins added 2 commits December 9, 2023 12:40

add git_handler funcs for getting diffs and hexsha

c065aae

add sampler class and test

39fc450

granawkins marked this pull request as draft December 9, 2023 06:02

granawkins added 10 commits December 9, 2023 13:07

fix broken tests

176dc64

format fixes

95a480e

add sample command and test

c9b8e36

save samples to mentat_dir unless path arg provided

3f87e2b

ignore private import

60e4aff

remove unused import

76660d7

save paths as posix

c5b8d67

move clone_repo to mentat.utils

56781ae

Sample.eval first take

8b8cb66

test Sample.eval test is passing, but broke git-parser tests; see TODOs

dd5f169

granawkins commented Dec 10, 2023

View reviewed changes

granawkins marked this pull request as ready for review December 10, 2023 15:12

granawkins added 3 commits December 11, 2023 09:01

replace hexshas with simple uuid: id and parent_id

b537809

add evaluate_samples script

a823b97

linter fixes

f6e250f

PCSwingle reviewed Dec 12, 2023

View reviewed changes

mentat/command/commands/sample.py Outdated Show resolved Hide resolved

PCSwingle reviewed Dec 12, 2023

View reviewed changes

mentat/sampler/README.md Outdated Show resolved Hide resolved

PCSwingle reviewed Dec 12, 2023

View reviewed changes

jakethekoenig reviewed Dec 12, 2023

View reviewed changes

granawkins added 4 commits December 13, 2023 14:38

fixes from feedback @PCSwingle @jakethekoenig

cd39afa

complete missing functionality in sample

8bd9668

improve merge_base workflow

03e910d

refactor sample into a Sampler class and the evaluate script

cd4909d

granawkins mentioned this pull request Dec 15, 2023

Benchmark Runner results shown in viewer #390

Merged

5 tasks

add tests for active_snapshot_commit

2f36b43

jakethekoenig reviewed Dec 15, 2023

View reviewed changes

granawkins added 16 commits December 16, 2023 08:37

Merge remote-tracking branch 'upstream/main' into sample-command

d63dd9b

integration test passing

9a8ec40

linter fixes

cbfc221

add __init__.py to sampler

dca3761

clean up sampler pr

c5c58c0

list and test fixes

5ecec10

more misc updates

6b15979

more fixes

a8306b0

use Paths to save/load

cc8121e

make sure stream values is a list

01f6207

rename get_git_commit and make pytest verbose

ce12d65

make sure path is passed to save

e31e032

fix windows tests

63b775d

remove unused import

6deda27

add llm grading from benchmark_runner to evaluate_samples

6373fc7

linter fixes

4641c4f

jakethekoenig reviewed Dec 18, 2023

View reviewed changes

granawkins added 3 commits December 20, 2023 16:41

sampler ignores non text files

e0cc2da

Merge remote-tracking branch 'upstream/main' into sample-command

720b795

linter fixes

e9a8b0f

granawkins requested review from PCSwingle and jakethekoenig December 20, 2023 10:03

jakethekoenig approved these changes Dec 20, 2023

View reviewed changes

granawkins added 3 commits December 21, 2023 05:23

comment out command and config vars until parse_message is fixed

98efc63

Merge remote-tracking branch 'upstream/main' into sample-command

4deff71

hide command instead

10f81ef

granawkins merged commit cae4930 into main Dec 20, 2023
16 checks passed

		@@ -0,0 +1,23 @@
		# Mentat Sampler API
		The Sampler API is an open-source standard to captures interactions between a developer and an LLM-based AI Assistant. It's intended to facilitate sharing of benchmarks and fine-tuning data throughout the open-source community and industry.

		os.remove(f".sample_{temp_id}.diff")


		def setup_repo(sample: Sample, path_to_repo: Path \| str \| None) -> Path:

Sample command #374

Sample command #374

Conversation

granawkins commented Dec 9, 2023 • edited Loading

Choose a reason for hiding this comment

PCSwingle commented Dec 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

granawkins Dec 13, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jakethekoenig commented Dec 18, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

granawkins commented Dec 20, 2023

jakethekoenig left a comment

Choose a reason for hiding this comment

PCSwingle commented Dec 20, 2023

granawkins commented Dec 9, 2023 •

edited

Loading

PCSwingle commented Dec 12, 2023 •

edited

Loading

granawkins Dec 13, 2023 •

edited

Loading