Add the ability to run a subsample of a benchmark via the `autogenbench run` command. #1123

afourney · 2024-01-02T23:48:13Z

At present, it can be quite expensive to run full benchmarks in AutoGenBench. When developing or quickly iterating, it is often desirable to instead run just on a subset. HumanEval, MATH, and AutoGPT already serve a subset of the benchmark (e.g., HumanEval produced the r_human_eval.jsonl tasks specifications, which only contain 26 problems). But each of these is a special case. It is worth discussing how to generalize this.

afourney · 2024-01-03T00:03:25Z

@kevin666aa @qingyun-wu What are your thoughts on when subsampling should occur? As I see it, there are two options:

At task generation time (init_tasks.py), in which case we can output different task lists, including different subsamples.
At run time (autogenbench run), in which case we would read in all the tasks, and subsample in memory.

The advantages to (1) is that autogenbench is already well-equipped to deal with this, and it is already implemented (more or less) for HumanEval at least. It is also easier to resume an interrupted run, compare the same tasks between runs (without worrying about random seeds used for sampling), share subsamples across users, and allow the Benchmark contributor to have control over how sampling is done (e.g., to perhaps sample evenly across difficulties, etc,.) The disadvantages are that you need to re-run init_tasks.py to generate a new sample, and each subset is viewed as a separate experiment (making it harder to start small, then expand without re-running everything)

The advantages to (2) are that its universal, so there's less work for the Benchmark contributor to do when onboarding a new benchmark, new subsamples can be generated on the fly, and you can start a benchmark with a subsample, then continue it by incrementing the sample size (without having to re-run everything) The disadvantages are that AutoGenBench is general and might not know the nuances of the dataset (so it would always be a uniform random sample), and we need to be very careful about managing seeds, and adding/removing any calls to the prng, or else it will be hard to run an experiment repeatedly over the same tasks.

afourney · 2024-01-19T23:43:39Z

The cli now supports subsample option.

afourney mentioned this issue Jan 2, 2024

[Meta-issue]: AutoGenBench Work Items #973

Closed

afourney self-assigned this Jan 2, 2024

afourney closed this as completed Jan 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add the ability to run a subsample of a benchmark via the `autogenbench run` command. #1123

Add the ability to run a subsample of a benchmark via the `autogenbench run` command. #1123

afourney commented Jan 2, 2024 •

edited

Loading

afourney commented Jan 3, 2024 •

edited

Loading

afourney commented Jan 19, 2024

Add the ability to run a subsample of a benchmark via the autogenbench run command. #1123

Add the ability to run a subsample of a benchmark via the autogenbench run command. #1123

Comments

afourney commented Jan 2, 2024 • edited Loading

afourney commented Jan 3, 2024 • edited Loading

afourney commented Jan 19, 2024

Add the ability to run a subsample of a benchmark via the `autogenbench run` command. #1123

Add the ability to run a subsample of a benchmark via the `autogenbench run` command. #1123

afourney commented Jan 2, 2024 •

edited

Loading

afourney commented Jan 3, 2024 •

edited

Loading