Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catch performance regressions in CI #1560

Open
Sean-Der opened this issue Nov 30, 2020 · 11 comments
Open

Catch performance regressions in CI #1560

Sean-Der opened this issue Nov 30, 2020 · 11 comments

Comments

@Sean-Der
Copy link
Member

Currently we have nothing to catch performance regressions. We have a few benchmarks spread out across repos. No one consistently runs them though.

We should use benchstat, and commit baseline reports in the .github folder. When we see a percentage change (5%?) we should fail the build. IPFS has already implemented it here we just need to port it to GitHub Actions.

This also looks relevant

@Sean-Der
Copy link
Member Author

@at-wat @bshi does this sound like a good plan? Any mistakes I am running head first into :)

@ashellunts @scorpionknifes if you are interested this could be a really fun task to take on!

@Sean-Der
Copy link
Member Author

I wrote this benchmark, but didn't commit it

func Benchmark_TrackLocalStaticRTP_SendOnly_WriteRTP(b *testing.B) {
       offer, answer, err := newPair()
       assert.NoError(b, err)

       track, err := NewTrackLocalStaticRTP(RTPCodecCapability{MimeType: "video/vp8"}, "video", "pion")
       assert.NoError(b, err)

       _, err = offer.AddTrack(track)
       assert.NoError(b, err)

       assert.NoError(b, signalPair(offer, answer))

       buffer := make([]byte, 1000)
       for i := 0; i <= b.N; i++ {
               assert.NoError(b, track.WriteRTP(rtp.Packet{Payload: buffer}))
       }

       assert.NoError(b, offer.Close())
       assert.NoError(b, answer.Close())
}

@at-wat
Copy link
Member

at-wat commented Nov 30, 2020

@Sean-Der I think it's better to make it pointer not for the performance but to make it less confusing as I wrote: #1540 (review)

Before we would keep a reference internally. Users would be suprised when values on the passed *rtp.Packet would change.

The contents stored as slices are not copied by passing as a value.
https://github.com/pion/rtp/blob/c8b3d3ec7ff956797d8d29c5ab1cfb054c9e4028/packet.go#L17-L38
I think it's more confusing than passing as pointer since it's partially copied and partially shared. (both for users and developers)
IMO, it would be better to pass it as a pointer and clone whole packet including its slice members where copy is needed.

@scorpionknifes
Copy link
Member

scorpionknifes commented Dec 2, 2020

I was trying out benchstat on github actions. The results are sometimes inconsistent. I'm currently testing using reference. Idk if this is a worker issue since the test runs on two different jobs in github actions.

The exact code I used is https://github.com/scorpionknifes/webrtc/blob/bench/.github/workflows/bench.yaml

old time and new time are tested on the same code.

name                                      old time/op  new time/op   delta
DataChannelSend2-2                         932ns ±16%  3254ns ±141%  +249.33%  (p=0.016 n=5+5)
DataChannelSend4-2                        1.11µs ±23%   1.40µs ±16%      ~     (p=0.063 n=5+4)
DataChannelSend8-2                         885ns ±29%   1402ns ±28%   +58.37%  (p=0.016 n=5+5)
DataChannelSend16-2                        982ns ±29%   2365ns ±93%      ~     (p=0.151 n=5+5)
DataChannelSend32-2                        1.21s ± 0%    1.21s ± 0%      ~     (p=0.222 n=5+5)
_TrackLocalStaticRTP_SendOnly_WriteRTP-2  10.5µs ± 1%    8.9µs ± 1%   -15.04%  (p=0.008 n=5+5)
name                                      old time/op   new time/op  delta
DataChannelSend2-2                        4.02µs ±106%  2.41µs ±99%      ~     (p=0.222 n=5+5)
DataChannelSend4-2                          837ns ±29%  2476ns ±55%  +195.68%  (p=0.008 n=5+5)
DataChannelSend8-2                          881ns ±46%  3477ns ±78%      ~     (p=0.056 n=5+5)
DataChannelSend16-2                        1.53µs ±80%  2.98µs ±77%      ~     (p=0.151 n=5+5)
DataChannelSend32-2                         1.21s ± 0%   1.21s ± 0%      ~     (p=0.286 n=5+4)
_TrackLocalStaticRTP_SendOnly_WriteRTP-2   9.09µs ± 2%  8.37µs ± 1%    -8.02%  (p=0.008 n=5+5)
name                                      old time/op  new time/op  delta
DataChannelSend2-2                        1.00µs ±45%  1.29µs ±37%   ~     (p=0.056 n=5+5)
DataChannelSend4-2                        1.06µs ±33%  1.00µs ±35%   ~     (p=0.841 n=5+5)
DataChannelSend8-2                         824ns ±13%  1203ns ±83%   ~     (p=0.548 n=5+5)
DataChannelSend16-2                        709ns ±14%   684ns ±20%   ~     (p=0.841 n=5+4)
DataChannelSend32-2                        1.21s ± 0%   1.21s ± 0%   ~     (p=0.690 n=5+5)
_TrackLocalStaticRTP_SendOnly_WriteRTP-2  11.3µs ±10%  11.9µs ± 2%   ~     (p=0.095 n=5+5)

All we need to add is to detect % change. I don't think we need to use benchstat. We can compare overall the time taken. (below are overall timing in pairs)

ok  	github.com/pion/webrtc/v3	87.781s
ok  	github.com/pion/webrtc/v3	71.302s

ok  	github.com/pion/webrtc/v3	69.556s
ok  	github.com/pion/webrtc/v3	57.600s

ok  	github.com/pion/webrtc/v3	60.695s
ok  	github.com/pion/webrtc/v3	56.249s

idk about committing reports, as github workers may differ benchmark results

@at-wat
Copy link
Member

at-wat commented Dec 2, 2020

I think the performance on the CI environment really depends on the timing since many CI jobs shares a host machine. (Even if the virtual core is assigned to the job, memory and i/o bandwidth is limited by the host.)
I guess it's better to compare US working hours vs weekend to determine the tolerance.

@at-wat
Copy link
Member

at-wat commented Dec 3, 2020

@Sean-Der any thoughts about #1560 (comment)?

@Sean-Der
Copy link
Member Author

Sean-Der commented Dec 3, 2020

@at-wat sorry missed that!

I am 100% in support of leaving it as a pointer, agree with your points. /v3 needs to go out and I don't think changing it this late is a good idea either.

@Sean-Der
Copy link
Member Author

Sean-Der commented Dec 3, 2020

@scorpionknifes that is a very good point. Maybe we just test the current commit against master? We can always make it optional, even if we get some false positives I think this is a big step in the right direction :)

@Sean-Der
Copy link
Member Author

I spent some time doing research today and found https://pythonspeed.com/articles/consistent-benchmarking-in-ci/

cachegrind looks promising! This might be a 'master only' thing depending on how much longer this makes the tests run.

@bshi
Copy link

bshi commented Dec 10, 2020

I'm late to the party but have a couple cents. The discussion is focused on how to track progress with an emphasis on microbenchmarking but I'd like to point out that perhaps this is premature given we don't have much clarity on what we should be measuring (and by extension optimizing). The latter is, admittedly, a trickier problem but my intuition is that we could be more effective if we could find a way to get good profiling data [a] in either real world (which would need to be volunteered) or synthetic (a.k.a macrobenchmarks; this would require compute resources / $$$ that the project may not have access to).

So it seems like it's worth (1) polling users to gather data about the top N bottlenecks [b] that exist in practice. What we learn could then (2) feed into efforts to build the appropriate micro/macro-benchmarks that can reproduce said bottlenecks. And finally, after we have experience on what works well and what doesn't work well (to put another way: what mix of micro & macro benchmarking is appropriate) we could built the appropriate tooling to track progress.

[a] In #1516, I volunteered data collected by adding a couple lines of code to my synthetic load test application. So the friction in enabling data collection is IMO low but is it low enough that we actually convince interested parties to contribute?
[b] To make an analogy, what are current key scaling dimensions and limits of typical systems that are being built.

@jech
Copy link
Member

jech commented Dec 19, 2020

I think Sean is merely speaking about catching regressions, not about general performance tuning. While microbenchmarks are misleading, they are good for catching regressions, and every performance regression should be investigated, even if it's only so that a human can decide that it doesn't matter in practice.
Having realistic benchmarks would be great, but it's way more ambitious than what Sean is suggesting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

5 participants