Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First impressions porting the pandas repository #194

Closed
qwhelan opened this issue Jan 19, 2015 · 10 comments
Closed

First impressions porting the pandas repository #194

qwhelan opened this issue Jan 19, 2015 · 10 comments
Labels
question Triaged as a question, not requiring action after answering

Comments

@qwhelan
Copy link
Contributor

qwhelan commented Jan 19, 2015

@mdboom First of all, it's really impressive to see a v0.1 with this level of polish in presentation, especially given how even mature open source projects tend to be lacking in this regard.

Due to pandas-dev/pandas#8361, I decided to take a look at converting pandas to asv from https://github.com/pydata/vbench and ran into a few usability issues. Most of the following comments can be standalone issues, but I wanted to put everything in context first. I'm also willing to help fix many of these but I'd like feedback from you first.

The first snag was the relative location of asv.conf.json with regard to the git repo. I thought it would be something like this:

pandas
├── asv.conf.json
├── pandas
└── setup.py

But it instead seems to expect this:

asv.conf.json
pandas
├── pandas
└── setup.py

Which poses some issues with distributing it with the repo, as well as conflicts between multiple projects in the same directory. The former fails due to the assumption that ./project is the project root and thus that project/setup.py exists. Due to my moving it down a level, asv.environment.install() tries to install from pandas/pandas and fails. Many of the following issues result from my patching asv locally to support the former approach, so please let me know if trying to support this is completely unreasonable.

A few other things I noticed:

  • asv doesn't create a secondary repo to run benchmarks in. Again this is only an issue due to my desire to distribute asv.conf.json with the repo and this is the direct root cause of many of the following issues.
  • asv run does a git checkout -f master which discards any uncommitted changes and committed changes on a detached HEAD (more on why this is a concern later)
    • This bit me several times while trying to just get asv run to work - I can guarantee this will cause a developer with a substantial change to lose their work.
    • Probably best to remove the -f and let git yell at the user
  • asv run does a git clean -fxd which forcibly discards any untracked files.
    • I realize the need to get into a consistent state, but the lack of a secondary repo means that works-in-progress or personal utility scripts will get nuked.
    • It took a couple tries to realize this was why asv.conf.json, benchmarks/ and env/ were being deleted after every asv run
    • This also nukes the envs/ directory, but that's consistent with everything else
  • asv run doesn't return the repo to the state it was in prior to the run
    • Before: On branch master
    • After: HEAD detached at 2ef4ba0
    • Combined with the git checkout -f, this leads to users losing commits (or at least digging into git reflog)
  • Since my layout change moved benchmarks/ into the repo, benchmarks cannot be run retrospectively. Again, totally my fault.
    • vbench solves this with a secondary repo
  • Might just be me, but asv publish didn't seem like the obvious choice to generate the html files. I think of publishing as making something widely available, but I might just be unaware of a common convention here.
  • asv run in the asv/tests directory fails due to the benchmarks/ directory being called benchmark/
  • Tests fail under OSX 10.10 with ccache from homebrew; works fine without ccache. This is a pretty common issue, so just FYI.
    • The following works: CC=/usr/bin/gcc py.test

Once again, great job on asv!

@mdboom
Copy link
Collaborator

mdboom commented Jan 20, 2015

I've been using the terms "project" to refer to the project being benchmarked and "benchmark suite" to refer to a collection of benchmarks. I'll use that terminology below to respond.

Most projects using asv thus far put the project and the benchmark suite in separate repositories. This is because, logically, the two don't need to follow the same history. The benchmark suite is designed to benchmark the project over time in revisions before and after the benchmark suite was written. When the docs say "the benchmark suite may live inside the project's repository", I probably should have actually tested that more. There are enough problems that you've noted in trying that out that I think I should just remove mention of it and instead insist on separate repositories. That's the logical thing to do any way -- it just doesn't make sense to say "run the benchmarks in pandas 1.0 against pandas 2.0" -- it makes more sense to say "run pandas-benchmarks 1.0 against pandas 2.0" where the history and versioning are completely separate.

Just to clarify a few things:

asv does check out a secondary repo of the project -- it just happens that this fails when a directory with the same name already exists (as you're seeing here). When you effectively have only one repository, things cross talk. There might be a way to fix that -- use a special directory name for the secondary repo, for example -- but I'd push more toward just encouraging one way to use the tool, which is in separate repositories.

Might just be me, but asv publish didn't seem like the obvious choice to generate the html files. I think of publishing as making something widely available, but I might just be unaware of a common convention here.

Yeah -- perhaps generate or html would be better. I'm definitely open to a name change given the 0.x nature of the project at the moment.

Tests fail under OSX 10.10 with ccache from homebrew; works fine without ccache. This is a pretty common issue, so just FYI.

Any sense of why? I use ccache from MacPorts (and on Linux) and I haven't had an issue. I may set up a homebrew environment and try it out some time. In the meantime, I'll create an issue just for this.

Thanks for the valuable feedback, and I'm sorry the documentation sent you on a wild goose chase!

@qwhelan
Copy link
Contributor Author

qwhelan commented Jan 21, 2015

@mdboom Thanks for the response. I knew I was digging my own hole here but I think I might be able to get what I'm looking for by doing something like this:

pandas
├── pandas
└── bench
    ├──asv.conf.json
    └──pandas
       └── pandas

The reason for my insistence getting a single-repo setup going is that it's very useful to benchmark performance fixes before they're merged. Having a single repo allows the default asv.conf.json to point at the developer's repo (or the canonical), while having two would require everyone update it to their setup (causing pointless merge conflicts, etc.).

I'm going to take another stab at getting this set up and hopefully put together a script to automatically convert vbench -> asv but I probably won't get to it until this weekend at best.

Thanks again!

@mdboom
Copy link
Collaborator

mdboom commented Jan 23, 2015

The reason for my insistence getting a single-repo setup going is that it's very useful to benchmark performance fixes before they're merged. Having a single repo allows the default asv.conf.json to point at the developer's repo (or the canonical), while having two would require everyone update it to their setup (causing pointless merge conflicts, etc.).

The usual mode of working to test the developer's repo (i.e. unchecked-in changes) is to install it to a virtualenv and then use asv dev. That, IMHO, works better when compiled code is involved because you don't have any of the issues with setup.py build --inplace etc.

@qwhelan
Copy link
Contributor Author

qwhelan commented Jan 26, 2015

Yeah, that's a very manual process for something I might do half a dozen times for a single PR. Here's how the vbench infrastructure currently handles it:

  • I commit changes locally (so no attempt to deal with unchecked-in changes) and give a target commit of local HEAD
  • I give a baseline commit of remote HEAD
  • vbench clones/fetches the local repo (not the remote one as asv does) to a working directory
  • It then checks out each commit, builds them, and runs the benchmarks (same as asv, so no setup.py build --inplace concerns).
  • Finally, vbench produces a table showing the results for each commit and the percent change between them.

There's no fundamental difference between the two tools here, it's just a matter of getting asv to clone a local repo (which I can get around by modifying the remotes post-clone).

Also, asv dev errored out on me - I'll open a ticket explaining why. #197

@mdboom
Copy link
Collaborator

mdboom commented Jan 26, 2015

@qwhelan: I can see the value in working that way. I'll tinker a bit to see what it would take to make that work.

@mdboom
Copy link
Collaborator

mdboom commented Jan 26, 2015

So here's the moving pieces that I think would be involved. It might be worth discussing first to make sure I'm on the right track for this use case before too much implementation.

  • (a) For simple/stupid reasons, asv doesn't support cloning a repo from the local filesystem. That is easy enough to fix.
  • (b) asv doesn't support using commits from multiple remotes. How does vbench handle it? How are the local and remote HEAD commits specified in your example? The way I'd try to do this in asv is to allow the repo key in asv.conf.json to be a dictionary of git remote names to repo URLs (local and/or remote) which would be turned into git remotes in the workspace checkout of the project. Then the user could refer to commits using the usual git remote/tag format, i.e. master vs. origin/master. To refer to a local checkout in which the benchmark directory lives (as in your example), you could use a relative path to refer to the project, such that each user wouldn't have to edit the asv.conf.json for their local environment. Does that make sense and seem doable?
  • (c) asv doesn't have a side-by-side comparison of two commits, though asv continuous comes awfully close. (It tests a commit against its immediate parent -- it should be trivial to optionally accept two commits instead).

@mdboom
Copy link
Collaborator

mdboom commented Jan 26, 2015

Duh -- we actually already have (c) in asv compare. (A great contribution from @astrofrog).

@pv
Copy link
Collaborator

pv commented Feb 26, 2015

(a)+(b) should work now, since asv now uses git --mirror. Setting asv.conf to point to the local repo e.g. repo: ".." should make the above workflow possible, you can then refer to remote and local branches as usual, origin/master, somelocalbranch etc.

asv continuous could be made to work also on a commit range to make (c) nicer, currently it's limited to two commits. Aside from this, the name of asv publish, and ccache on homebrew (gh-196), the above issues are probably fixed now.

@jreback
Copy link

jreback commented Mar 20, 2015

@qwhelan @mdboom

This looks very exciting. vbench has served us well in pandas, but actively maintained projects are a big big +1.

@qwhelan
Copy link
Contributor Author

qwhelan commented Mar 23, 2015

@mdboom @pv Sorry for disappearing on this. I've got a short script that converts vbench benchmarks into asv ones that's working on a subset of the pandas cases that I've tried so far. At this point, I think it's mostly a matter of checking that the suites are mostly working and letting things run in parallel for a bit.

@pv pv added the question Triaged as a question, not requiring action after answering label Aug 20, 2015
@pv pv closed this as completed Sep 28, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Triaged as a question, not requiring action after answering
Projects
None yet
Development

No branches or pull requests

4 participants