First impressions porting the pandas repository #194

qwhelan · 2015-01-19T01:26:36Z

@mdboom First of all, it's really impressive to see a v0.1 with this level of polish in presentation, especially given how even mature open source projects tend to be lacking in this regard.

Due to pandas-dev/pandas#8361, I decided to take a look at converting pandas to asv from https://github.com/pydata/vbench and ran into a few usability issues. Most of the following comments can be standalone issues, but I wanted to put everything in context first. I'm also willing to help fix many of these but I'd like feedback from you first.

The first snag was the relative location of asv.conf.json with regard to the git repo. I thought it would be something like this:

pandas
├── asv.conf.json
├── pandas
└── setup.py

But it instead seems to expect this:

asv.conf.json
pandas
├── pandas
└── setup.py

Which poses some issues with distributing it with the repo, as well as conflicts between multiple projects in the same directory. The former fails due to the assumption that ./project is the project root and thus that project/setup.py exists. Due to my moving it down a level, asv.environment.install() tries to install from pandas/pandas and fails. Many of the following issues result from my patching asv locally to support the former approach, so please let me know if trying to support this is completely unreasonable.

A few other things I noticed:

asv doesn't create a secondary repo to run benchmarks in. Again this is only an issue due to my desire to distribute asv.conf.json with the repo and this is the direct root cause of many of the following issues.
asv run does a git checkout -f master which discards any uncommitted changes and committed changes on a detached HEAD (more on why this is a concern later)
- This bit me several times while trying to just get asv run to work - I can guarantee this will cause a developer with a substantial change to lose their work.
- Probably best to remove the -f and let git yell at the user
asv run does a git clean -fxd which forcibly discards any untracked files.
- I realize the need to get into a consistent state, but the lack of a secondary repo means that works-in-progress or personal utility scripts will get nuked.
- It took a couple tries to realize this was why asv.conf.json, benchmarks/ and env/ were being deleted after every asv run
- This also nukes the envs/ directory, but that's consistent with everything else
asv run doesn't return the repo to the state it was in prior to the run
- Before: On branch master
- After: HEAD detached at 2ef4ba0
- Combined with the git checkout -f, this leads to users losing commits (or at least digging into git reflog)
Since my layout change moved benchmarks/ into the repo, benchmarks cannot be run retrospectively. Again, totally my fault.
- vbench solves this with a secondary repo
Might just be me, but asv publish didn't seem like the obvious choice to generate the html files. I think of publishing as making something widely available, but I might just be unaware of a common convention here.
asv run in the asv/tests directory fails due to the benchmarks/ directory being called benchmark/
- Also, the astropy test case fails to run due to dependencies missing from setup.py (at least jinja2)
- Fix example asv configuration in test/ #195 resolves these issues
Tests fail under OSX 10.10 with ccache from homebrew; works fine without ccache. This is a pretty common issue, so just FYI.
- The following works: CC=/usr/bin/gcc py.test

Once again, great job on asv!

The text was updated successfully, but these errors were encountered:

mdboom · 2015-01-20T19:44:54Z

I've been using the terms "project" to refer to the project being benchmarked and "benchmark suite" to refer to a collection of benchmarks. I'll use that terminology below to respond.

Most projects using asv thus far put the project and the benchmark suite in separate repositories. This is because, logically, the two don't need to follow the same history. The benchmark suite is designed to benchmark the project over time in revisions before and after the benchmark suite was written. When the docs say "the benchmark suite may live inside the project's repository", I probably should have actually tested that more. There are enough problems that you've noted in trying that out that I think I should just remove mention of it and instead insist on separate repositories. That's the logical thing to do any way -- it just doesn't make sense to say "run the benchmarks in pandas 1.0 against pandas 2.0" -- it makes more sense to say "run pandas-benchmarks 1.0 against pandas 2.0" where the history and versioning are completely separate.

Just to clarify a few things:

asv does check out a secondary repo of the project -- it just happens that this fails when a directory with the same name already exists (as you're seeing here). When you effectively have only one repository, things cross talk. There might be a way to fix that -- use a special directory name for the secondary repo, for example -- but I'd push more toward just encouraging one way to use the tool, which is in separate repositories.

Might just be me, but asv publish didn't seem like the obvious choice to generate the html files. I think of publishing as making something widely available, but I might just be unaware of a common convention here.

Yeah -- perhaps generate or html would be better. I'm definitely open to a name change given the 0.x nature of the project at the moment.

Tests fail under OSX 10.10 with ccache from homebrew; works fine without ccache. This is a pretty common issue, so just FYI.

Any sense of why? I use ccache from MacPorts (and on Linux) and I haven't had an issue. I may set up a homebrew environment and try it out some time. In the meantime, I'll create an issue just for this.

Thanks for the valuable feedback, and I'm sorry the documentation sent you on a wild goose chase!

qwhelan · 2015-01-21T23:07:59Z

@mdboom Thanks for the response. I knew I was digging my own hole here but I think I might be able to get what I'm looking for by doing something like this:

pandas
├── pandas
└── bench
    ├──asv.conf.json
    └──pandas
       └── pandas

The reason for my insistence getting a single-repo setup going is that it's very useful to benchmark performance fixes before they're merged. Having a single repo allows the default asv.conf.json to point at the developer's repo (or the canonical), while having two would require everyone update it to their setup (causing pointless merge conflicts, etc.).

I'm going to take another stab at getting this set up and hopefully put together a script to automatically convert vbench -> asv but I probably won't get to it until this weekend at best.

Thanks again!

mdboom · 2015-01-23T17:43:20Z

The reason for my insistence getting a single-repo setup going is that it's very useful to benchmark performance fixes before they're merged. Having a single repo allows the default asv.conf.json to point at the developer's repo (or the canonical), while having two would require everyone update it to their setup (causing pointless merge conflicts, etc.).

The usual mode of working to test the developer's repo (i.e. unchecked-in changes) is to install it to a virtualenv and then use asv dev. That, IMHO, works better when compiled code is involved because you don't have any of the issues with setup.py build --inplace etc.

qwhelan · 2015-01-26T03:01:18Z

Yeah, that's a very manual process for something I might do half a dozen times for a single PR. Here's how the vbench infrastructure currently handles it:

I commit changes locally (so no attempt to deal with unchecked-in changes) and give a target commit of local HEAD
I give a baseline commit of remote HEAD
vbench clones/fetches the local repo (not the remote one as asv does) to a working directory
It then checks out each commit, builds them, and runs the benchmarks (same as asv, so no setup.py build --inplace concerns).
Finally, vbench produces a table showing the results for each commit and the percent change between them.

There's no fundamental difference between the two tools here, it's just a matter of getting asv to clone a local repo (which I can get around by modifying the remotes post-clone).

Also, asv dev errored out on me - I'll open a ticket explaining why. #197

mdboom · 2015-01-26T22:15:26Z

@qwhelan: I can see the value in working that way. I'll tinker a bit to see what it would take to make that work.

mdboom · 2015-01-26T22:39:15Z

So here's the moving pieces that I think would be involved. It might be worth discussing first to make sure I'm on the right track for this use case before too much implementation.

(a) For simple/stupid reasons, asv doesn't support cloning a repo from the local filesystem. That is easy enough to fix.
(b) asv doesn't support using commits from multiple remotes. How does vbench handle it? How are the local and remote HEAD commits specified in your example? The way I'd try to do this in asv is to allow the repo key in asv.conf.json to be a dictionary of git remote names to repo URLs (local and/or remote) which would be turned into git remotes in the workspace checkout of the project. Then the user could refer to commits using the usual git remote/tag format, i.e. master vs. origin/master. To refer to a local checkout in which the benchmark directory lives (as in your example), you could use a relative path to refer to the project, such that each user wouldn't have to edit the asv.conf.json for their local environment. Does that make sense and seem doable?
(c) asv doesn't have a side-by-side comparison of two commits, though asv continuous comes awfully close. (It tests a commit against its immediate parent -- it should be trivial to optionally accept two commits instead).

mdboom · 2015-01-26T23:07:40Z

Duh -- we actually already have (c) in asv compare. (A great contribution from @astrofrog).

pv · 2015-02-26T23:39:46Z

(a)+(b) should work now, since asv now uses git --mirror. Setting asv.conf to point to the local repo e.g. repo: ".." should make the above workflow possible, you can then refer to remote and local branches as usual, origin/master, somelocalbranch etc.

asv continuous could be made to work also on a commit range to make (c) nicer, currently it's limited to two commits. Aside from this, the name of asv publish, and ccache on homebrew (gh-196), the above issues are probably fixed now.

jreback · 2015-03-20T11:59:04Z

@qwhelan @mdboom

This looks very exciting. vbench has served us well in pandas, but actively maintained projects are a big big +1.

qwhelan · 2015-03-23T06:38:08Z

@mdboom @pv Sorry for disappearing on this. I've got a short script that converts vbench benchmarks into asv ones that's working on a subset of the pandas cases that I've tried so far. At this point, I think it's mostly a matter of checking that the suites are mostly working and letting things run in parallel for a bit.

qwhelan mentioned this issue Jan 19, 2015

Fix example asv configuration in test/ #195

Merged

mdboom mentioned this issue Jan 21, 2015

Doesn't work with homebrew ccache #196

Closed

qwhelan mentioned this issue Jan 26, 2015

asv dev fails to find executable #197

Closed

mdboom mentioned this issue Jan 26, 2015

Support cloning of repositories on the local filesystem #199

Merged

This was referenced Mar 20, 2015

PERF: make vbench runnable in python 3 pandas-dev/pandas#9660

Closed

Use airspeed velocity for benchmarking pandas-dev/pandas#8361

Closed

pv added the question Triaged as a question, not requiring action after answering label Aug 20, 2015

pv closed this as completed Sep 28, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First impressions porting the pandas repository #194

First impressions porting the pandas repository #194

qwhelan commented Jan 19, 2015

mdboom commented Jan 20, 2015

qwhelan commented Jan 21, 2015

mdboom commented Jan 23, 2015

qwhelan commented Jan 26, 2015

mdboom commented Jan 26, 2015

mdboom commented Jan 26, 2015

mdboom commented Jan 26, 2015

pv commented Feb 26, 2015

jreback commented Mar 20, 2015

qwhelan commented Mar 23, 2015

First impressions porting the pandas repository #194

First impressions porting the pandas repository #194

Comments

qwhelan commented Jan 19, 2015

mdboom commented Jan 20, 2015

qwhelan commented Jan 21, 2015

mdboom commented Jan 23, 2015

qwhelan commented Jan 26, 2015

mdboom commented Jan 26, 2015

mdboom commented Jan 26, 2015

mdboom commented Jan 26, 2015

pv commented Feb 26, 2015

jreback commented Mar 20, 2015

qwhelan commented Mar 23, 2015