Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loadtesting metrics, updated #737

Merged

Conversation

GeorgeTsagk
Copy link
Member

Description

This PR is based on #662 by @calvinrzachman. It is an update / revisited version for our loadtesting setup and metrics collection.

@GeorgeTsagk GeorgeTsagk self-assigned this Dec 13, 2023
@GeorgeTsagk GeorgeTsagk marked this pull request as draft December 13, 2023 22:58
@GeorgeTsagk
Copy link
Member Author

As discussed offline:
We will not include client-side metrics for resource usage as that's not that insightful, since the main load is placed on the universe server


pusher := push.New(pushURL, "load_test").
Collector(testDuration).
Grouping("test_case", tc.name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the test name include factors such as the number of assets created, number before/after, etc? If not, we may want to add additional metrics, but as new time series rather than labels (don't want the labels to get too long).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment is still relevant, not necessarily blocking though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The label was updated so that each gauge gets a unique one, in order to distinguish the pushed metrics. On promQL it's fairly easy to aggregate everything into a single time series regardless of being individual gauges

The assets minted / sent are configured in the loadtest.conf file so that number is always stable (for now)

We can rely on the tapd prom metrics to relate loadtest durations with tapd behavior

}

// Construct the endpoint for Prometheus PushGateway.
cfg.PrometheusGateway.PushURL = fmt.Sprintf(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this can be part of the config creation? Otherwise, something meant to validate is actually mutating the underlying config.

@GeorgeTsagk GeorgeTsagk force-pushed the loadtest-push-metrics-updated branch from 6bfd6f4 to f14c531 Compare July 19, 2024 07:34
[]string{"test_name"},
)

memTotalAlloc = prometheus.NewGaugeVec(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this given the prom metrics already export memory related to the test? https://github.com/danielfm/prometheus-for-developers/blob/master/README.md#measuring-memorycpu-usage

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead, I think we need to focus on metrics including (some of this is a matter of enableing the gRPC prom metrics, and utilizing what we've already exported w/ new queries) :

  • time to execute the various components of the test (list asset, coin selection, etc)
    • this may require deeper telemetry within tapd itself, depending on the way the load tests are written
  • proof size growth as a result of test instance (size after compared to size before)
  • total db size (for both sqlite and postgres)
  • total number of assets created
  • total number of assets sent

cc @calvinrzachman as he may already have some of this posted in the dashboard

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this given the prom metrics already export memory related to the test? https://github.com/danielfm/prometheus-for-developers/blob/master/README.md#measuring-memorycpu-usage

Probably not, will take a look at this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

time to execute the various components of the test (list asset, coin selection, etc)
this may require deeper telemetry within tapd itself, depending on the way the load tests are written

I think it's better to fully rely on the long-running tapd prometheus metrics that were updated on tapd/pull/716. We could probably come up with some fancy collectors to retrieve everything.

Given that the tapd instances that the loadtests run against are long-running, the metrics will persist across multiple loadtest runs.

Copy link
Member Author

@GeorgeTsagk GeorgeTsagk Jul 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So next steps:

  • Strip any redundant changes from this PR: Let's keep the push metrics that are introduced here only for tracking the duration of the test
  • Enhance tapd itself with any type of collector we deem appropriate (gRPC, proof related etc): Decorate our grafana dashboards with the prom metrics of the long-running alice & bob nodes

Tracking extra things in this layer (itest/loadtest) is kind of useless or very limiting as this is the client side and we don't really care about this code area, or it would be much more involving/complicated extracting the tapd metrics on this level and then pushing it to prom

[]string{"test_name"},
)

memTotalAlloc = prometheus.NewGaugeVec(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead, I think we need to focus on metrics including (some of this is a matter of enableing the gRPC prom metrics, and utilizing what we've already exported w/ new queries) :

  • time to execute the various components of the test (list asset, coin selection, etc)
    • this may require deeper telemetry within tapd itself, depending on the way the load tests are written
  • proof size growth as a result of test instance (size after compared to size before)
  • total db size (for both sqlite and postgres)
  • total number of assets created
  • total number of assets sent

cc @calvinrzachman as he may already have some of this posted in the dashboard

@GeorgeTsagk GeorgeTsagk force-pushed the loadtest-push-metrics-updated branch from f14c531 to 7c94e29 Compare July 30, 2024 16:07
@GeorgeTsagk
Copy link
Member Author

GeorgeTsagk commented Jul 30, 2024

With this in mind, I'm marking the PR as ready for review. We will add more sophisticated tracking in tapd's prometheus monitoring, not in the loadtesting code

@GeorgeTsagk GeorgeTsagk marked this pull request as ready for review July 30, 2024 16:09
@GeorgeTsagk
Copy link
Member Author


pusher := push.New(pushURL, "load_test").
Collector(testDuration).
Grouping("test_case", tc.name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment is still relevant, not necessarily blocking though.

"github.com/stretchr/testify/require"
)

var (
testDuration = prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make this a histogram metric. Then we'll be able to do percentile plots, and heat maps, etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I looked into this and a histogram won't persist over multiple runs (it will be overwritten across runs). There's some tricks we can try to get the gateway to persist but really not worth the time & diff.

Given the frequency at which we run the loadtests, we can produce a histogram from the GaugeVec with PromQL directly in Grafana. This is not as performant as directly using a histogram, but will give us the same insights on percentiles etc

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why wouldn't it persist? Isn't it the same as any other metric. I've never ran into this restriction myself, is it just for push metrics? As we have the histogram metrics for proofs ize in the other PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you referring to this.)?

IIUC that means that if the pushgateway is restart before it's scraped, the metrics won't persist, but IIUC we have the system always running.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah in our setup pushgateway should always be alive. I was referring to the part where the client side restarts (by design) and we push a fresh instance of "histogram". See here

By pushing a fresh histogram to the pushgateway we're overwriting the old one, effectively only keeping the values of our last run (last-write-wins)

I'm not sure if a HistogramVec, with unique label per test run, is going to be a fine workaround, less performant than a simple histogram nevertheless

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also a community implementation of pushgateway that seems to be solving this issue (haven't tested it)
https://github.com/zapier/prom-aggregation-gateway

@dstadulis dstadulis added this to the v0.4.2 milestone Aug 13, 2024
@lightninglabs-deploy
Copy link

@GeorgeTsagk, remember to re-request review from reviewers when ready

@Roasbeef
Copy link
Member

Roasbeef commented Sep 3, 2024

Has some linter failures:

itest/loadtest/config.go:64: line is 102 characters (lll)
	Enabled bool   `long:"enabled" description:"Enable pushing metrics to Prometheus PushGateway"`
itest/loadtest/config.go:65: line is 86 characters (lll)
	Host    string `long:"host" description:"Prometheus PushGateway host address"`
itest/loadtest/config.go:110: line is 81 characters (lll)
	// PrometheusGateway is the configuration for the Prometheus PushGateway.
itest/loadtest/config.go:111: line is 161 characters (lll)
	PrometheusGateway *PrometheusGatewayConfig `group:"prometheus-gateway" namespace:"prometheus-gateway" description:"Prometheus PushGateway configuration"`
itest/loadtest/config.go:183: line is 83 characters (lll)
			return nil, fmt.Errorf("gateway hostname may not be empty")

@Roasbeef Roasbeef enabled auto-merge September 3, 2024 18:40
@guggero guggero disabled auto-merge September 3, 2024 18:46
Copy link
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🦜

g2g after linter fix

calvinrzachman and others added 3 commits September 4, 2024 18:03
Equip the test orchestrator with the ability to push
metrics on execution time to a configurable remote prometheus
push gateway.
@GeorgeTsagk GeorgeTsagk force-pushed the loadtest-push-metrics-updated branch from 7c94e29 to 8f2499d Compare September 4, 2024 16:05
@coveralls
Copy link

Pull Request Test Coverage Report for Build 10705406892

Details

  • 0 of 22 (0.0%) changed or added relevant lines in 1 file are covered.
  • 36 unchanged lines in 5 files lost coverage.
  • Overall coverage decreased (-0.04%) to 40.139%

Changes Missing Coverage Covered Lines Changed/Added Lines %
itest/loadtest/config.go 0 22 0.0%
Files with Coverage Reduction New Missed Lines %
tappsbt/create.go 2 53.22%
commitment/tap.go 4 83.91%
asset/asset.go 4 81.61%
tapdb/multiverse.go 7 60.32%
universe/interface.go 19 47.09%
Totals Coverage Status
Change from base Build 10628247677: -0.04%
Covered Lines: 23959
Relevant Lines: 59690

💛 - Coveralls

@Roasbeef Roasbeef disabled auto-merge September 4, 2024 20:13
@Roasbeef Roasbeef merged commit 72551b4 into lightninglabs:main Sep 4, 2024
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

6 participants