Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mutate sample #454

Merged
merged 8 commits into from
Jan 5, 2024
Merged

Mutate sample #454

merged 8 commits into from
Jan 5, 2024

Conversation

granawkins
Copy link
Member

This PR refactors and adds to the Sampler utility scripts. Now we have:

  • validate: check repo/context/diffs each item in your samples dir
  • evaluate: run mentat on the sample and return the resulting diff
  • add-context: create n duplicate but with lots of extra (useless) context
  • remove-context: create n duplicates but missing some context, and with a rejection message (interactive)
  • finetune: generate finetuning examples from Samples.

Run them with e.g. python scripts/sampler -v, python scripts/sampler --add-context -n 10

Copy link
Member

@jakethekoenig jakethekoenig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All 5 operations worked great for me. Some small notes.

example = await generate_finetune_gpt(sample)
example_file = FINETUNE_DIR / f"finetune_{sample.id}.json"
with open(example_file, "w") as f:
json.dump(example, f, indent=4)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suddenly realizing there was no reason for me to have each fine tuning example on it's own line. Something about jsonl makes me think "json per line". Oops.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just finally read the actual difference between .json and .jsonl -

I think your interpretation was actually correct for jsonl: each line should be independently json-serializable. I guess theoretically you'd be able to read-in and deserialize one at a time, making it better for extremely large runs where the whole thing is too big to read into memory.

I guess I'm more used to working with .json and don't expect we'll hit that limit soon, right? Any other thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No other thoughts.

print(f"Generating fine-tuning example for sample {sample.id[:8]}")
try:
example = await generate_finetune_gpt(sample)
example_file = FINETUNE_DIR / f"finetune_{sample.id}.json"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we print this? It was hard to find.

Copy link
Member Author

@granawkins granawkins Jan 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll print a the FINETUNE_DIR path at the end with the results.

Just kidding inline is great.

SAMPLES_DIR = mentat_dir_path / "samples"
os.makedirs(SAMPLES_DIR, exist_ok=True)
FINETUNE_DIR = mentat_dir_path / "finetune"
os.makedirs(FINETUNE_DIR, exist_ok=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine for now but I wonder if we should think about the ergonomics of it if we want other people to use it. I'd kind of expect these things to be generated in the user's working directory. I guess if the goal is to collect them from many repos to put together a finetuning/benchmarking data set it makes sense to put them all in one place.

4. If using a Coding Assistant tool, process the response to apply edits to codebase.
5. Return the text portion of the conversation and the git diff, corresponding to `message_edit` and `diff_edit`

We provide two implementations of this:
- Run `scripts/evaluate_samples.py [<id>...]` from the command line, in the mentat repo. Prints to terminal.
- Run `python scripts/samples [<id>...]` from the command line, in the mentat repo. Prints to terminal.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scripts/sampler not scripts/samples

"--validate",
"-v",
action="store_true",
help="Validate samples instead of evaluating",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change this help message to "Validate samples conform to spec"

if not sample_file.exists():
warn(f"Sample file {sample_file} does not exist.")
continue
sample = Sample.load(sample_file)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we load samples before validating them certain errors are thrown as exceptions instead of logged e.g. if the sample is not valid json or has extra fields.

continue
try:
new_sample = await remove_context(sample)
new_sample.save(SAMPLES_DIR / f"sample_{new_sample.id}.json")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here and above. Can we print the files generated?


from mentat.sampler.sample import Sample
from mentat.utils import mentat_dir_path
from tests.benchmarks.benchmark_runner import (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't run this until I added tests to find_packages in setup.py. Btw what do you think about moving the benchmark code out of the tests directory into its own top level directory benchmarking?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah makes sense, I'll add that to setup.py.

Ya I like the idea of moving benchmarks to their own directory, mostly because I always forget the --benchmark flag.

diff_active=sample.diff_active,
)
cwd = Path(repo.working_dir)
python_client = PythonClient(cwd=Path("."), paths=[])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The python client needs to be shut down.

else:
print(f"Evaluating sample {sample.id[:8]}")
print(f" Prompt: {sample.message_prompt}")
diff_eval = await evaluate_sample(sample)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me evaluate_sample would imply running and grading the sample. Do you think run_sample or execute_sample would be better?

from mentat.session_context import SESSION_CONTEXT


async def evaluate_sample(sample, cwd: Path | str | None = None):
Copy link
Member

@jakethekoenig jakethekoenig Jan 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about making this one an instance method of samples? I'd like to use it in the benchmark runner itself and it'd be easy to get there.

I changed my mind on this one because it leads to a hard to fix circular import and I wanted to write it slightly differently for the benchmarks here

jakethekoenig added a commit that referenced this pull request Jan 5, 2024
If you put a sample in the tests/benchmarks/benchmarks directory it will
be picked up and evaluated by the benchmark runner. Test with:
```
pytest tests/benchmarks/benchmark_runner.py --benchmark -s --benchmarks \
Dummy clojure
```
Based off PR #454 to make use of setup_repo. There's some duplication of
sample evaluation.
@granawkins granawkins merged commit 9560fa5 into main Jan 5, 2024
16 checks passed
jakethekoenig added a commit that referenced this pull request Jan 5, 2024
If you put a sample in the tests/benchmarks/benchmarks directory it will
be picked up and evaluated by the benchmark runner. Test with:
```
pytest tests/benchmarks/benchmark_runner.py --benchmark -s --benchmarks \
Dummy clojure
```
Based off PR #454 to make use of setup_repo. There's some duplication of
sample evaluation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants