You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
At present, it can be quite expensive to run full benchmarks in AutoGenBench. When developing or quickly iterating, it is often desirable to instead run just on a subset. HumanEval, MATH, and AutoGPT already serve a subset of the benchmark (e.g., HumanEval produced the r_human_eval.jsonl tasks specifications, which only contain 26 problems). But each of these is a special case. It is worth discussing how to generalize this.
The text was updated successfully, but these errors were encountered:
@kevin666aa@qingyun-wu What are your thoughts on when subsampling should occur? As I see it, there are two options:
At task generation time (init_tasks.py), in which case we can output different task lists, including different subsamples.
At run time (autogenbench run), in which case we would read in all the tasks, and subsample in memory.
The advantages to (1) is that autogenbench is already well-equipped to deal with this, and it is already implemented (more or less) for HumanEval at least. It is also easier to resume an interrupted run, compare the same tasks between runs (without worrying about random seeds used for sampling), share subsamples across users, and allow the Benchmark contributor to have control over how sampling is done (e.g., to perhaps sample evenly across difficulties, etc,.) The disadvantages are that you need to re-run init_tasks.py to generate a new sample, and each subset is viewed as a separate experiment (making it harder to start small, then expand without re-running everything)
The advantages to (2) are that its universal, so there's less work for the Benchmark contributor to do when onboarding a new benchmark, new subsamples can be generated on the fly, and you can start a benchmark with a subsample, then continue it by incrementing the sample size (without having to re-run everything) The disadvantages are that AutoGenBench is general and might not know the nuances of the dataset (so it would always be a uniform random sample), and we need to be very careful about managing seeds, and adding/removing any calls to the prng, or else it will be hard to run an experiment repeatedly over the same tasks.
At present, it can be quite expensive to run full benchmarks in AutoGenBench. When developing or quickly iterating, it is often desirable to instead run just on a subset. HumanEval, MATH, and AutoGPT already serve a subset of the benchmark (e.g., HumanEval produced the r_human_eval.jsonl tasks specifications, which only contain 26 problems). But each of these is a special case. It is worth discussing how to generalize this.
The text was updated successfully, but these errors were encountered: