Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the ability to run a subsample of a benchmark via the autogenbench run command. #1123

Closed
Tracked by #973
afourney opened this issue Jan 2, 2024 · 2 comments
Closed
Tracked by #973
Assignees

Comments

@afourney
Copy link
Member

afourney commented Jan 2, 2024

At present, it can be quite expensive to run full benchmarks in AutoGenBench. When developing or quickly iterating, it is often desirable to instead run just on a subset. HumanEval, MATH, and AutoGPT already serve a subset of the benchmark (e.g., HumanEval produced the r_human_eval.jsonl tasks specifications, which only contain 26 problems). But each of these is a special case. It is worth discussing how to generalize this.

@afourney
Copy link
Member Author

afourney commented Jan 3, 2024

@kevin666aa @qingyun-wu What are your thoughts on when subsampling should occur? As I see it, there are two options:

  1. At task generation time (init_tasks.py), in which case we can output different task lists, including different subsamples.
  2. At run time (autogenbench run), in which case we would read in all the tasks, and subsample in memory.

The advantages to (1) is that autogenbench is already well-equipped to deal with this, and it is already implemented (more or less) for HumanEval at least. It is also easier to resume an interrupted run, compare the same tasks between runs (without worrying about random seeds used for sampling), share subsamples across users, and allow the Benchmark contributor to have control over how sampling is done (e.g., to perhaps sample evenly across difficulties, etc,.) The disadvantages are that you need to re-run init_tasks.py to generate a new sample, and each subset is viewed as a separate experiment (making it harder to start small, then expand without re-running everything)

The advantages to (2) are that its universal, so there's less work for the Benchmark contributor to do when onboarding a new benchmark, new subsamples can be generated on the fly, and you can start a benchmark with a subsample, then continue it by incrementing the sample size (without having to re-run everything) The disadvantages are that AutoGenBench is general and might not know the nuances of the dataset (so it would always be a uniform random sample), and we need to be very careful about managing seeds, and adding/removing any calls to the prng, or else it will be hard to run an experiment repeatedly over the same tasks.

@afourney
Copy link
Member Author

The cli now supports subsample option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant