Loadtesting metrics, updated #737

GeorgeTsagk · 2023-12-13T22:57:37Z

Description

This PR is based on #662 by @calvinrzachman. It is an update / revisited version for our loadtesting setup and metrics collection.

GeorgeTsagk · 2024-01-15T16:59:45Z

As discussed offline:
We will not include client-side metrics for resource usage as that's not that insightful, since the main load is placed on the universe server

Roasbeef · 2024-01-30T02:36:47Z

itest/loadtest/load_test.go

+
+			pusher := push.New(pushURL, "load_test").
+				Collector(testDuration).
+				Grouping("test_case", tc.name)


Does the test name include factors such as the number of assets created, number before/after, etc? If not, we may want to add additional metrics, but as new time series rather than labels (don't want the labels to get too long).

I think this comment is still relevant, not necessarily blocking though.

The label was updated so that each gauge gets a unique one, in order to distinguish the pushed metrics. On promQL it's fairly easy to aggregate everything into a single time series regardless of being individual gauges

The assets minted / sent are configured in the loadtest.conf file so that number is always stable (for now)

We can rely on the tapd prom metrics to relate loadtest durations with tapd behavior

Roasbeef · 2024-01-30T02:37:33Z

itest/loadtest/config.go

+		}
+
+		// Construct the endpoint for Prometheus PushGateway.
+		cfg.PrometheusGateway.PushURL = fmt.Sprintf(


Perhaps this can be part of the config creation? Otherwise, something meant to validate is actually mutating the underlying config.

Roasbeef · 2024-07-23T22:51:01Z

itest/loadtest/load_test.go

+		[]string{"test_name"},
+	)
+
+	memTotalAlloc = prometheus.NewGaugeVec(


Do we need this given the prom metrics already export memory related to the test? https://github.com/danielfm/prometheus-for-developers/blob/master/README.md#measuring-memorycpu-usage

Instead, I think we need to focus on metrics including (some of this is a matter of enableing the gRPC prom metrics, and utilizing what we've already exported w/ new queries) :

time to execute the various components of the test (list asset, coin selection, etc)

this may require deeper telemetry within tapd itself, depending on the way the load tests are written

proof size growth as a result of test instance (size after compared to size before)

total db size (for both sqlite and postgres)

total number of assets created

total number of assets sent

cc @calvinrzachman as he may already have some of this posted in the dashboard

Do we need this given the prom metrics already export memory related to the test? https://github.com/danielfm/prometheus-for-developers/blob/master/README.md#measuring-memorycpu-usage

Probably not, will take a look at this

time to execute the various components of the test (list asset, coin selection, etc)
this may require deeper telemetry within tapd itself, depending on the way the load tests are written

I think it's better to fully rely on the long-running tapd prometheus metrics that were updated on tapd/pull/716. We could probably come up with some fancy collectors to retrieve everything.

Given that the tapd instances that the loadtests run against are long-running, the metrics will persist across multiple loadtest runs.

So next steps:

Strip any redundant changes from this PR: Let's keep the push metrics that are introduced here only for tracking the duration of the test

Enhance tapd itself with any type of collector we deem appropriate (gRPC, proof related etc): Decorate our grafana dashboards with the prom metrics of the long-running alice & bob nodes

Tracking extra things in this layer (itest/loadtest) is kind of useless or very limiting as this is the client side and we don't really care about this code area, or it would be much more involving/complicated extracting the tapd metrics on this level and then pushing it to prom

Roasbeef · 2024-07-23T22:56:22Z

itest/loadtest/load_test.go

+		[]string{"test_name"},
+	)
+
+	memTotalAlloc = prometheus.NewGaugeVec(


Instead, I think we need to focus on metrics including (some of this is a matter of enableing the gRPC prom metrics, and utilizing what we've already exported w/ new queries) :

time to execute the various components of the test (list asset, coin selection, etc)

this may require deeper telemetry within tapd itself, depending on the way the load tests are written

proof size growth as a result of test instance (size after compared to size before)

total db size (for both sqlite and postgres)

total number of assets created

total number of assets sent

cc @calvinrzachman as he may already have some of this posted in the dashboard

GeorgeTsagk · 2024-07-30T16:09:05Z

With this in mind, I'm marking the PR as ready for review. We will add more sophisticated tracking in tapd's prometheus monitoring, not in the loadtesting code

GeorgeTsagk · 2024-07-30T17:56:39Z

Closes https://github.com/lightninglabs/lightning-infra/issues/1455

Roasbeef · 2024-07-31T23:59:18Z

itest/loadtest/load_test.go

+
+			pusher := push.New(pushURL, "load_test").
+				Collector(testDuration).
+				Grouping("test_case", tc.name)


I think this comment is still relevant, not necessarily blocking though.

Roasbeef · 2024-08-01T00:12:08Z

itest/loadtest/load_test.go

 	"github.com/stretchr/testify/require"
 )

+var (
+	testDuration = prometheus.NewGaugeVec(
+		prometheus.GaugeOpts{


We should make this a histogram metric. Then we'll be able to do percentile plots, and heat maps, etc.

So I looked into this and a histogram won't persist over multiple runs (it will be overwritten across runs). There's some tricks we can try to get the gateway to persist but really not worth the time & diff.

Given the frequency at which we run the loadtests, we can produce a histogram from the GaugeVec with PromQL directly in Grafana. This is not as performant as directly using a histogram, but will give us the same insights on percentiles etc

Why wouldn't it persist? Isn't it the same as any other metric. I've never ran into this restriction myself, is it just for push metrics? As we have the histogram metrics for proofs ize in the other PR.

Are you referring to this.)?

IIUC that means that if the pushgateway is restart before it's scraped, the metrics won't persist, but IIUC we have the system always running.

yeah in our setup pushgateway should always be alive. I was referring to the part where the client side restarts (by design) and we push a fresh instance of "histogram". See here

By pushing a fresh histogram to the pushgateway we're overwriting the old one, effectively only keeping the values of our last run (last-write-wins)

I'm not sure if a HistogramVec, with unique label per test run, is going to be a fine workaround, less performant than a simple histogram nevertheless

There's also a community implementation of pushgateway that seems to be solving this issue (haven't tested it)
https://github.com/zapier/prom-aggregation-gateway

lightninglabs-deploy · 2024-08-28T15:58:04Z

@GeorgeTsagk, remember to re-request review from reviewers when ready

Roasbeef · 2024-09-03T18:39:56Z

Has some linter failures:

itest/loadtest/config.go:64: line is 102 characters (lll)
	Enabled bool   `long:"enabled" description:"Enable pushing metrics to Prometheus PushGateway"`
itest/loadtest/config.go:65: line is 86 characters (lll)
	Host    string `long:"host" description:"Prometheus PushGateway host address"`
itest/loadtest/config.go:110: line is 81 characters (lll)
	// PrometheusGateway is the configuration for the Prometheus PushGateway.
itest/loadtest/config.go:111: line is 161 characters (lll)
	PrometheusGateway *PrometheusGatewayConfig `group:"prometheus-gateway" namespace:"prometheus-gateway" description:"Prometheus PushGateway configuration"`
itest/loadtest/config.go:183: line is 83 characters (lll)
			return nil, fmt.Errorf("gateway hostname may not be empty")

Roasbeef

LGTM 🦜

g2g after linter fix

Equip the test orchestrator with the ability to push metrics on execution time to a configurable remote prometheus push gateway.

coveralls · 2024-09-04T16:14:42Z

Pull Request Test Coverage Report for Build 10705406892

Details

0 of 22 (0.0%) changed or added relevant lines in 1 file are covered.
36 unchanged lines in 5 files lost coverage.
Overall coverage decreased (-0.04%) to 40.139%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
itest/loadtest/config.go	0	22	0.0%

Files with Coverage Reduction	New Missed Lines	%
tappsbt/create.go	2	53.22%
commitment/tap.go	4	83.91%
asset/asset.go	4	81.61%
tapdb/multiverse.go	7	60.32%
universe/interface.go	19	47.09%

Totals
Change from base Build 10628247677:	-0.04%
Covered Lines:	23959
Relevant Lines:	59690

💛 - Coveralls

GeorgeTsagk self-assigned this Dec 13, 2023

GeorgeTsagk marked this pull request as draft December 13, 2023 22:58

GeorgeTsagk mentioned this pull request Dec 13, 2023

loadtest: add support for push metrics #662

Closed

Roasbeef requested changes Jan 30, 2024

View reviewed changes

GeorgeTsagk force-pushed the loadtest-push-metrics-updated branch from 6bfd6f4 to f14c531 Compare July 19, 2024 07:34

Roasbeef requested changes Jul 23, 2024

View reviewed changes

GeorgeTsagk force-pushed the loadtest-push-metrics-updated branch from f14c531 to 7c94e29 Compare July 30, 2024 16:07

GeorgeTsagk marked this pull request as ready for review July 30, 2024 16:09

Roasbeef requested changes Aug 1, 2024

View reviewed changes

GeorgeTsagk mentioned this pull request Aug 1, 2024

Add more prometheus metrics #1054

Merged

dstadulis added this to the v0.4.2 milestone Aug 13, 2024

Roasbeef enabled auto-merge September 3, 2024 18:40

guggero disabled auto-merge September 3, 2024 18:46

Roasbeef approved these changes Sep 3, 2024

View reviewed changes

calvinrzachman and others added 3 commits September 4, 2024 18:03

loadtest: add support for push metrics

f3a2ef0

Equip the test orchestrator with the ability to push metrics on execution time to a configurable remote prometheus push gateway.

loadtest: add all fields to sample conf file

227877e

loadtest: add README

8f2499d

GeorgeTsagk force-pushed the loadtest-push-metrics-updated branch from 7c94e29 to 8f2499d Compare September 4, 2024 16:05

GeorgeTsagk enabled auto-merge September 4, 2024 16:43

Roasbeef disabled auto-merge September 4, 2024 20:13

Roasbeef merged commit 72551b4 into lightninglabs:main Sep 4, 2024
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loadtesting metrics, updated #737

Loadtesting metrics, updated #737

GeorgeTsagk commented Dec 13, 2023

GeorgeTsagk commented Jan 15, 2024

Roasbeef Jan 30, 2024

Roasbeef Jul 31, 2024

GeorgeTsagk Sep 4, 2024

Roasbeef Jan 30, 2024

Roasbeef Jul 23, 2024

Roasbeef Jul 23, 2024

GeorgeTsagk Jul 29, 2024

GeorgeTsagk Jul 29, 2024

GeorgeTsagk Jul 29, 2024 •

edited

Loading

Roasbeef Jul 23, 2024

GeorgeTsagk commented Jul 30, 2024 •

edited

Loading

GeorgeTsagk commented Jul 30, 2024

Roasbeef Jul 31, 2024

Roasbeef Aug 1, 2024

GeorgeTsagk Aug 16, 2024

Roasbeef Aug 19, 2024

Roasbeef Aug 19, 2024

GeorgeTsagk Aug 21, 2024

GeorgeTsagk Aug 21, 2024

lightninglabs-deploy commented Aug 28, 2024

Roasbeef commented Sep 3, 2024

Roasbeef left a comment

coveralls commented Sep 4, 2024

Loadtesting metrics, updated #737

Loadtesting metrics, updated #737

Conversation

GeorgeTsagk commented Dec 13, 2023

Description

GeorgeTsagk commented Jan 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GeorgeTsagk Jul 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GeorgeTsagk commented Jul 30, 2024 • edited Loading

GeorgeTsagk commented Jul 30, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lightninglabs-deploy commented Aug 28, 2024

Roasbeef commented Sep 3, 2024

Roasbeef left a comment

Choose a reason for hiding this comment

coveralls commented Sep 4, 2024

Pull Request Test Coverage Report for Build 10705406892

Details

💛 - Coveralls

GeorgeTsagk Jul 29, 2024 •

edited

Loading

GeorgeTsagk commented Jul 30, 2024 •

edited

Loading