-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mutate sample #454
Mutate sample #454
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All 5 operations worked great for me. Some small notes.
example = await generate_finetune_gpt(sample) | ||
example_file = FINETUNE_DIR / f"finetune_{sample.id}.json" | ||
with open(example_file, "w") as f: | ||
json.dump(example, f, indent=4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suddenly realizing there was no reason for me to have each fine tuning example on it's own line. Something about jsonl makes me think "json per line". Oops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just finally read the actual difference between .json
and .jsonl
-
I think your interpretation was actually correct for jsonl: each line should be independently json-serializable. I guess theoretically you'd be able to read-in and deserialize one at a time, making it better for extremely large runs where the whole thing is too big to read into memory.
I guess I'm more used to working with .json
and don't expect we'll hit that limit soon, right? Any other thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No other thoughts.
print(f"Generating fine-tuning example for sample {sample.id[:8]}") | ||
try: | ||
example = await generate_finetune_gpt(sample) | ||
example_file = FINETUNE_DIR / f"finetune_{sample.id}.json" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we print this? It was hard to find.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll print a the FINETUNE_DIR path at the end with the results.
Just kidding inline is great.
SAMPLES_DIR = mentat_dir_path / "samples" | ||
os.makedirs(SAMPLES_DIR, exist_ok=True) | ||
FINETUNE_DIR = mentat_dir_path / "finetune" | ||
os.makedirs(FINETUNE_DIR, exist_ok=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine for now but I wonder if we should think about the ergonomics of it if we want other people to use it. I'd kind of expect these things to be generated in the user's working directory. I guess if the goal is to collect them from many repos to put together a finetuning/benchmarking data set it makes sense to put them all in one place.
mentat/sampler/README.md
Outdated
4. If using a Coding Assistant tool, process the response to apply edits to codebase. | ||
5. Return the text portion of the conversation and the git diff, corresponding to `message_edit` and `diff_edit` | ||
|
||
We provide two implementations of this: | ||
- Run `scripts/evaluate_samples.py [<id>...]` from the command line, in the mentat repo. Prints to terminal. | ||
- Run `python scripts/samples [<id>...]` from the command line, in the mentat repo. Prints to terminal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scripts/sampler
not scripts/samples
scripts/sampler/__main__.py
Outdated
"--validate", | ||
"-v", | ||
action="store_true", | ||
help="Validate samples instead of evaluating", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we change this help message to "Validate samples conform to spec"
scripts/sampler/__main__.py
Outdated
if not sample_file.exists(): | ||
warn(f"Sample file {sample_file} does not exist.") | ||
continue | ||
sample = Sample.load(sample_file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we load samples before validating them certain errors are thrown as exceptions instead of logged e.g. if the sample is not valid json or has extra fields.
continue | ||
try: | ||
new_sample = await remove_context(sample) | ||
new_sample.save(SAMPLES_DIR / f"sample_{new_sample.id}.json") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here and above. Can we print the files generated?
|
||
from mentat.sampler.sample import Sample | ||
from mentat.utils import mentat_dir_path | ||
from tests.benchmarks.benchmark_runner import ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't run this until I added tests
to find_packages
in setup.py. Btw what do you think about moving the benchmark code out of the tests directory into its own top level directory benchmarking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah makes sense, I'll add that to setup.py.
Ya I like the idea of moving benchmarks to their own directory, mostly because I always forget the --benchmark
flag.
diff_active=sample.diff_active, | ||
) | ||
cwd = Path(repo.working_dir) | ||
python_client = PythonClient(cwd=Path("."), paths=[]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The python client needs to be shut down.
scripts/sampler/__main__.py
Outdated
else: | ||
print(f"Evaluating sample {sample.id[:8]}") | ||
print(f" Prompt: {sample.message_prompt}") | ||
diff_eval = await evaluate_sample(sample) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me evaluate_sample
would imply running and grading the sample. Do you think run_sample
or execute_sample
would be better?
scripts/sampler/evaluate.py
Outdated
from mentat.session_context import SESSION_CONTEXT | ||
|
||
|
||
async def evaluate_sample(sample, cwd: Path | str | None = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about making this one an instance method of samples? I'd like to use it in the benchmark runner itself and it'd be easy to get there.
I changed my mind on this one because it leads to a hard to fix circular import and I wanted to write it slightly differently for the benchmarks here
If you put a sample in the tests/benchmarks/benchmarks directory it will be picked up and evaluated by the benchmark runner. Test with: ``` pytest tests/benchmarks/benchmark_runner.py --benchmark -s --benchmarks \ Dummy clojure ``` Based off PR #454 to make use of setup_repo. There's some duplication of sample evaluation.
If you put a sample in the tests/benchmarks/benchmarks directory it will be picked up and evaluated by the benchmark runner. Test with: ``` pytest tests/benchmarks/benchmark_runner.py --benchmark -s --benchmarks \ Dummy clojure ``` Based off PR #454 to make use of setup_repo. There's some duplication of sample evaluation.
This PR refactors and adds to the Sampler utility scripts. Now we have:
Run them with e.g.
python scripts/sampler -v
,python scripts/sampler --add-context -n 10