-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loadtesting metrics, updated #737
Loadtesting metrics, updated #737
Conversation
As discussed offline: |
itest/loadtest/load_test.go
Outdated
|
||
pusher := push.New(pushURL, "load_test"). | ||
Collector(testDuration). | ||
Grouping("test_case", tc.name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the test name include factors such as the number of assets created, number before/after, etc? If not, we may want to add additional metrics, but as new time series rather than labels (don't want the labels to get too long).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment is still relevant, not necessarily blocking though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The label was updated so that each gauge gets a unique one, in order to distinguish the pushed metrics. On promQL it's fairly easy to aggregate everything into a single time series regardless of being individual gauges
The assets minted / sent are configured in the loadtest.conf
file so that number is always stable (for now)
We can rely on the tapd prom metrics to relate loadtest durations with tapd behavior
} | ||
|
||
// Construct the endpoint for Prometheus PushGateway. | ||
cfg.PrometheusGateway.PushURL = fmt.Sprintf( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this can be part of the config creation? Otherwise, something meant to validate is actually mutating the underlying config.
6bfd6f4
to
f14c531
Compare
itest/loadtest/load_test.go
Outdated
[]string{"test_name"}, | ||
) | ||
|
||
memTotalAlloc = prometheus.NewGaugeVec( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this given the prom metrics already export memory related to the test? https://github.com/danielfm/prometheus-for-developers/blob/master/README.md#measuring-memorycpu-usage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead, I think we need to focus on metrics including (some of this is a matter of enableing the gRPC prom metrics, and utilizing what we've already exported w/ new queries) :
- time to execute the various components of the test (list asset, coin selection, etc)
- this may require deeper telemetry within tapd itself, depending on the way the load tests are written
- proof size growth as a result of test instance (size after compared to size before)
- total db size (for both sqlite and postgres)
- total number of assets created
- total number of assets sent
cc @calvinrzachman as he may already have some of this posted in the dashboard
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this given the prom metrics already export memory related to the test? https://github.com/danielfm/prometheus-for-developers/blob/master/README.md#measuring-memorycpu-usage
Probably not, will take a look at this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
time to execute the various components of the test (list asset, coin selection, etc)
this may require deeper telemetry within tapd itself, depending on the way the load tests are written
I think it's better to fully rely on the long-running tapd prometheus metrics that were updated on tapd/pull/716. We could probably come up with some fancy collectors to retrieve everything.
Given that the tapd instances that the loadtests run against are long-running, the metrics will persist across multiple loadtest runs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So next steps:
- Strip any redundant changes from this PR: Let's keep the push metrics that are introduced here only for tracking the duration of the test
- Enhance tapd itself with any type of collector we deem appropriate (gRPC, proof related etc): Decorate our grafana dashboards with the prom metrics of the long-running alice & bob nodes
Tracking extra things in this layer (itest/loadtest
) is kind of useless or very limiting as this is the client side and we don't really care about this code area, or it would be much more involving/complicated extracting the tapd metrics on this level and then pushing it to prom
itest/loadtest/load_test.go
Outdated
[]string{"test_name"}, | ||
) | ||
|
||
memTotalAlloc = prometheus.NewGaugeVec( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead, I think we need to focus on metrics including (some of this is a matter of enableing the gRPC prom metrics, and utilizing what we've already exported w/ new queries) :
- time to execute the various components of the test (list asset, coin selection, etc)
- this may require deeper telemetry within tapd itself, depending on the way the load tests are written
- proof size growth as a result of test instance (size after compared to size before)
- total db size (for both sqlite and postgres)
- total number of assets created
- total number of assets sent
cc @calvinrzachman as he may already have some of this posted in the dashboard
f14c531
to
7c94e29
Compare
With this in mind, I'm marking the PR as ready for review. We will add more sophisticated tracking in tapd's prometheus monitoring, not in the loadtesting code |
itest/loadtest/load_test.go
Outdated
|
||
pusher := push.New(pushURL, "load_test"). | ||
Collector(testDuration). | ||
Grouping("test_case", tc.name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment is still relevant, not necessarily blocking though.
"github.com/stretchr/testify/require" | ||
) | ||
|
||
var ( | ||
testDuration = prometheus.NewGaugeVec( | ||
prometheus.GaugeOpts{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should make this a histogram metric. Then we'll be able to do percentile plots, and heat maps, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I looked into this and a histogram won't persist over multiple runs (it will be overwritten across runs). There's some tricks we can try to get the gateway to persist but really not worth the time & diff.
Given the frequency at which we run the loadtests, we can produce a histogram from the GaugeVec with PromQL directly in Grafana. This is not as performant as directly using a histogram, but will give us the same insights on percentiles etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why wouldn't it persist? Isn't it the same as any other metric. I've never ran into this restriction myself, is it just for push metrics? As we have the histogram metrics for proofs ize in the other PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you referring to this.)?
IIUC that means that if the pushgateway is restart before it's scraped, the metrics won't persist, but IIUC we have the system always running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah in our setup pushgateway should always be alive. I was referring to the part where the client side restarts (by design) and we push a fresh instance of "histogram". See here
By pushing a fresh histogram to the pushgateway we're overwriting the old one, effectively only keeping the values of our last run (last-write-wins)
I'm not sure if a HistogramVec, with unique label per test run, is going to be a fine workaround, less performant than a simple histogram nevertheless
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's also a community implementation of pushgateway that seems to be solving this issue (haven't tested it)
https://github.com/zapier/prom-aggregation-gateway
@GeorgeTsagk, remember to re-request review from reviewers when ready |
Has some linter failures:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🦜
g2g after linter fix
Equip the test orchestrator with the ability to push metrics on execution time to a configurable remote prometheus push gateway.
7c94e29
to
8f2499d
Compare
Pull Request Test Coverage Report for Build 10705406892Details
💛 - Coveralls |
Description
This PR is based on #662 by @calvinrzachman. It is an update / revisited version for our loadtesting setup and metrics collection.