Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backend: get raw values for the various dimensions and present them to the UI for rendering #14

Open
odeke-em opened this issue Dec 16, 2022 · 4 comments
Assignees
Labels

Comments

@odeke-em
Copy link
Member

Right now the informalsystems/tm-loadtest code computes averages of the values and displays a summary, due to the limited user interface of a CLI. We need to be able to show percentiles like we do for qpounder

@kirbyquerby
Copy link
Contributor

kirbyquerby commented Dec 16, 2022

Of the current dimensions we return:

message RunLoadtestResponse {
// The total number of transactions sent.
// Corresponds to total_time in tm-load-test.
int64 total_txs = 1;
// The total time taken to send `total_txs` transactions.
// Corresponds to total_txs in tm-load-test.
google.protobuf.Duration total_time = 2;
// The cumulative number of bytes sent as transactions.
// Corresponds to total_bytes in tm-load-test.
int64 total_bytes = 3;
// The rate at which transactions were submitted (tx/sec).
// Corresponds to avg_tx_rate in tm-load-test.
double avg_txs_per_second = 4;
// The rate at which data was transmitted in transactions (bytes/sec).
// Corresponds to avg_data_rate in tm-load-test.
double avg_bytes_per_second = 5;
}

Three are three totals and two average rates. QPounder only shows percentiles for request latency. What metrics are you looking for and which ones do you want to have percentiles for? Do you want a new metric added to track percentiles for request latency? Do you want graphs over time for certain metrics as well?

If we want to track latency, does it still make sense to track it for broadcast_tx_async, which returns immediately?:
image

@odeke-em
Copy link
Member Author

So to load test an application, we need to ensure that users can see minimums maximums for transactions processed as well as bytes processed, which can then allow them to try out things like remove or adding logging while still seeing extremes. Averages add so much noise and distortion when there is lots of variance in values. As for the new metrics, great question but perhaps we can discuss those after just using the current ones. Please see this suggestion of a graph
image

How to extract those metrics

This is a perfect case for using OpenCensus and then our custom exporter. For transactions processed we can use a count aggregation and for bytes processed and received we can use a distribution aggregation

odeke-em added a commit to orijtech/tm-load-test that referenced this issue Dec 18, 2022
This change returns more informative stats with p50, p75, p90, p95, p99
values for the latencies which massively help in seeing the actual
performance of a node. These values are useful to properly visualize
the processing power maximums and minimums plus load tapers.

Updates orijtech/cosmosloadtester#14
@odeke-em
Copy link
Member Author

@kirbyquerby @willpoint before my flight to Canada today while on a layover in Los Angeles, I sat down and explored a bunch of options like using OpenCensus/OpenTelemetry but their APIs such and are now too convoluted so I instead rolled out simple stats in https://github.com/orijtech/tm-load-test/releases/tag/vorijtech-1.0.0 per orijtech/tm-load-test@d2a5d18 and now if we use this diff, we can see more informative stats like this

diff --git a/go.mod b/go.mod
index 74a202f..ae6d99b 100644
--- a/go.mod
+++ b/go.mod
@@ -9,7 +9,7 @@ require (
 	github.com/informalsystems/tm-load-test v1.0.0
 	github.com/lib/pq v1.10.6
 	github.com/sirupsen/logrus v1.9.0
-	go.opencensus.io v0.23.0
+	go.opencensus.io v0.24.0
 	google.golang.org/genproto v0.0.0-20221207170731-23e4bf6bdc37
 	google.golang.org/grpc v1.51.0
 	google.golang.org/protobuf v1.28.2-0.20220831092852-f930b1dc76e8
@@ -96,7 +96,7 @@ require (
 	github.com/spf13/jwalterweatherman v1.1.0 // indirect
 	github.com/spf13/pflag v1.0.5 // indirect
 	github.com/spf13/viper v1.13.0 // indirect
-	github.com/stretchr/testify v1.8.0 // indirect
+	github.com/stretchr/testify v1.8.1 // indirect
 	github.com/subosito/gotenv v1.4.1 // indirect
 	github.com/syndtr/goleveldb v1.0.1-0.20210819022825-2ae1ddf74ef7 // indirect
 	github.com/tendermint/btcd v0.1.1 // indirect
@@ -121,4 +121,6 @@ require (
 
 replace github.com/gogo/protobuf => github.com/regen-network/protobuf v1.3.3-alpha.regen.1
 
+replace github.com/informalsystems/tm-load-test => github.com/orijtech/tm-load-test v1.0.1-0.20221218023019-d2a5d1861a00
+
 // replace github.com/informalsystems/tm-load-test => /home/nathan/Documents/tm-load-test
diff --git a/server/server.go b/server/server.go
index 621bd33..5c416d8 100644
--- a/server/server.go
+++ b/server/server.go
@@ -3,6 +3,7 @@ package server
 import (
 	"context"
 	"encoding/csv"
+	"encoding/json"
 	"fmt"
 	"os"
 	"strconv"
@@ -75,11 +76,19 @@ func (s *Server) RunLoadtest(ctx context.Context, req *loadtestpb.RunLoadtestReq
 		return nil, status.Errorf(codes.InvalidArgument, "invalid input: %v", err)
 	}
 
-	err = loadtest.ExecuteStandalone(cfg)
+	psL, err := loadtest.ExecuteStandaloneWithStats(cfg)
 	if err != nil {
 		return nil, err
 	}
 
+	// TODO: Send over the actual values of psL to the UI
+	// instead of the CSV parsing down below.
+	blob, err := json.MarshalIndent(psL, "", "  ")
+	if err != nil {
+		return nil, err
+	}
+	println(string(blob))
+
 	f, err := os.Open(statsOutputFilePath)
 	if err != nil {
 		return nil, fmt.Errorf("failed to open stats output file: %w", err)

UI request

Screen Shot 2022-12-17 at 6 41 24 PM

Result

[
  {
    "avg_bytes_per_sec": 11521.651142520457,
    "avg_tx_per_sec": 2880.4127856301143,
    "total_time": 39032947139,
    "total_bytes": 449724,
    "total_txs": 112431,
    "p50": {
      "at_ns": 10581297990,
      "at_str": "10.58129799s",
      "latency": 7037
    },
    "p75": {
      "at_ns": 28128359648,
      "at_str": "28.128359648s",
      "latency": 11295
    },
    "p90": {
      "at_ns": 20108712942,
      "at_str": "20.108712942s",
      "latency": 21267
    },
    "p95": {
      "at_ns": 31586764579,
      "at_str": "31.586764579s",
      "latency": 32603
    },
    "p99": {
      "at_ns": 4631386768,
      "at_str": "4.631386768s",
      "latency": 90098
    },
    "per_sec": [
      {
        "sec": 0,
        "qps": 8838,
        "bytes": 35352
      },
      {
        "sec": 1,
        "qps": 2911,
        "bytes": 11644
      },
      {
        "sec": 2,
        "qps": 5342,
        "bytes": 21368
      },
      {
        "sec": 3,
        "qps": 3074,
        "bytes": 12296
      },
      {
        "sec": 4,
        "qps": 9218,
        "bytes": 36872
      },
      {
        "sec": 5,
        "qps": 0,
        "bytes": 0
      },
      {
        "sec": 6,
        "qps": 0,
        "bytes": 0
      },
      {
        "sec": 7,
        "qps": 10001,
        "bytes": 40004
      },
      {
        "sec": 8,
        "qps": 5403,
        "bytes": 21612
      },
      {
        "sec": 9,
        "qps": 3244,
        "bytes": 12976
      },
      {
        "sec": 10,
        "qps": 3407,
        "bytes": 13628
      },
      {
        "sec": 11,
        "qps": 3730,
        "bytes": 14920
      },
      {
        "sec": 12,
        "qps": 2433,
        "bytes": 9732
      },
      {
        "sec": 13,
        "qps": 2921,
        "bytes": 11684
      },
      {
        "sec": 14,
        "qps": 2595,
        "bytes": 10380
      },
      {
        "sec": 15,
        "qps": 2433,
        "bytes": 9732
      },
      {
        "sec": 16,
        "qps": 2758,
        "bytes": 11032
      },
      {
        "sec": 17,
        "qps": 2596,
        "bytes": 10384
      },
      {
        "sec": 18,
        "qps": 2108,
        "bytes": 8432
      },
      {
        "sec": 19,
        "qps": 2271,
        "bytes": 9084
      },
      {
        "sec": 20,
        "qps": 2758,
        "bytes": 11032
      },
      {
        "sec": 21,
        "qps": 3568,
        "bytes": 14272
      },
      {
        "sec": 22,
        "qps": 2434,
        "bytes": 9736
      },
      {
        "sec": 23,
        "qps": 2109,
        "bytes": 8436
      },
      {
        "sec": 24,
        "qps": 2595,
        "bytes": 10380
      },
      {
        "sec": 25,
        "qps": 2271,
        "bytes": 9084
      },
      {
        "sec": 26,
        "qps": 1946,
        "bytes": 7784
      },
      {
        "sec": 27,
        "qps": 1947,
        "bytes": 7788
      },
      {
        "sec": 28,
        "qps": 1460,
        "bytes": 5840
      },
      {
        "sec": 29,
        "qps": 1135,
        "bytes": 4540
      },
      {
        "sec": 30,
        "qps": 1623,
        "bytes": 6492
      },
      {
        "sec": 31,
        "qps": 1459,
        "bytes": 5836
      },
      {
        "sec": 32,
        "qps": 1947,
        "bytes": 7788
      },
      {
        "sec": 33,
        "qps": 1622,
        "bytes": 6488
      },
      {
        "sec": 34,
        "qps": 2109,
        "bytes": 8436
      },
      {
        "sec": 35,
        "qps": 2271,
        "bytes": 9084
      },
      {
        "sec": 36,
        "qps": 1622,
        "bytes": 6488
      },
      {
        "sec": 37,
        "qps": 2272,
        "bytes": 9088
      }
    ]
  }
]

and with this we can now graph timeseries data as suggested above, showing both QPS and Bytes/second as well as the latency progressions with the various percentiles.

@odeke-em
Copy link
Member Author

I've mailed out a much better modification In our fork of tm-load-test that'll now show rankings per second and thus allow visualizing of percentiles https://github.com/orijtech/tm-load-test/releases/tag/vorijtech-1.1.0 for example

[
  {
    "avg_bytes_per_sec": 422.15393825983676,
    "avg_tx_per_sec": 105.53848456495919,
    "total_time": 18002911524,
    "total_bytes": 7600,
    "total_txs": 1900,
    "p50": {
      "at_ns": 12000740850,
      "at_str": "12.00074085s",
      "latency": 16942,
      "size": 4
    },
    "p75": {
      "at_ns": 2001010940,
      "at_str": "2.00101094s",
      "latency": 25159,
      "size": 4
    },
    "p90": {
      "at_ns": 15002840984,
      "at_str": "15.002840984s",
      "latency": 40838,
      "size": 4
    },
    "p95": {
      "at_ns": 16000796152,
      "at_str": "16.000796152s",
      "latency": 54770,
      "size": 4
    },
    "p99": {
      "at_ns": 3000639229,
      "at_str": "3.000639229s",
      "latency": 88433,
      "size": 4
    },
    "per_sec": [
      {
        "sec": 0,
        "qps": 100,
        "bytes": 400,
        "bytes_rankings": {
          "p50": {
            "at_ns": 3025761,
            "at_str": "3.025761ms",
            "size": 4
          },
          "p75": {
            "at_ns": 3126893,
            "at_str": "3.126893ms",
            "size": 4
          },
          "p90": {
            "at_ns": 710362,
            "at_str": "710.362µs",
            "size": 4
          },
          "p95": {
            "at_ns": 2010957,
            "at_str": "2.010957ms",
            "size": 4
          },
          "p99": {
            "at_ns": 633480,
            "at_str": "633.48µs",
            "size": 4
          }
        },
        "latency_rankings": {
          "p50": {
            "at_ns": 3025761,
            "at_str": "3.025761ms",
            "latency": 17800
          },
          "p75": {
            "at_ns": 3126893,
            "at_str": "3.126893ms",
            "latency": 26296
          },
          "p90": {
            "at_ns": 710362,
            "at_str": "710.362µs",
            "latency": 42240
          },
          "p95": {
            "at_ns": 2010957,
            "at_str": "2.010957ms",
            "latency": 56953
          },
          "p99": {
            "at_ns": 633480,
            "at_str": "633.48µs",
            "latency": 242929
          }
        }
      },
      {
        "sec": 1,
        "qps": 100,
        "bytes": 400,
        "bytes_rankings": {
          "p50": {
            "at_ns": 1002111655,
            "at_str": "1.002111655s",
            "size": 4
          },
          "p75": {
            "at_ns": 1002353024,
            "at_str": "1.002353024s",
            "size": 4
          },
          "p90": {
            "at_ns": 1002326247,
            "at_str": "1.002326247s",
            "size": 4
          },
          "p95": {
            "at_ns": 1002397596,
            "at_str": "1.002397596s",
            "size": 4
          },
          "p99": {
            "at_ns": 1001288678,
            "at_str": "1.001288678s",
            "size": 4
          }
        },
        "latency_rankings": {
          "p50": {
            "at_ns": 1002111655,
            "at_str": "1.002111655s",
            "latency": 14099
          },
          "p75": {
            "at_ns": 1002353024,
            "at_str": "1.002353024s",
            "latency": 23419
          },
          "p90": {
            "at_ns": 1002326247,
            "at_str": "1.002326247s",
            "latency": 39388
          },
          "p95": {
            "at_ns": 1002397596,
            "at_str": "1.002397596s",
            "latency": 42895
          },
          "p99": {
            "at_ns": 1001288678,
            "at_str": "1.001288678s",
            "latency": 81521
          }
        }
      }
    ]
  }
]

odeke-em added a commit that referenced this issue Dec 18, 2022
This change uses our fork of tm-load-test which provides
latency and bytes rankings for each second and will allow us
to visualize percentile graphs which exhibit actual behavior
necessary to determine pathological behaviour in load tests and
responses.

Updates #14
Updates #16
odeke-em added a commit that referenced this issue Dec 19, 2022
This change uses our fork of tm-load-test which provides
latency and bytes rankings for each second and will allow us
to visualize percentile graphs which exhibit actual behavior
necessary to determine pathological behaviour in load tests and
responses.

Updates #14
Updates #16
odeke-em added a commit that referenced this issue Dec 20, 2022
Updates the protobuf definitions to have per point data
that can then be used by the user interface.

Updates #14
Updates #16
willpoint added a commit that referenced this issue Dec 20, 2022
willpoint added a commit that referenced this issue Dec 20, 2022
updates #14

TODO(uzo) complete the flow using the protobuf definitions and server response
odeke-em added a commit that referenced this issue Dec 22, 2022
Updates the protobuf definitions to have per point data
that can then be used by the user interface.

Updates #14
Updates #16
kirbyquerby added a commit that referenced this issue Dec 23, 2022
descPercentile and bucketizedPerSecond need to be exported for this change to compile. I don't have push access to the tm-load-test fork, but here's where the files are:
https://github.com/orijtech/tm-load-test/blob/d37154798c88c311e880eb23ec799d09e04cb44b/pkg/loadtest/transactor.go#L271-L285

Updates #14
Updates #16
odeke-em pushed a commit that referenced this issue Dec 23, 2022
descPercentile and bucketizedPerSecond need to be exported for this change to compile. I don't have push access to the tm-load-test fork, but here's where the files are:
https://github.com/orijtech/tm-load-test/blob/d37154798c88c311e880eb23ec799d09e04cb44b/pkg/loadtest/transactor.go#L271-L285

Updates #14
Updates #16
kirbyquerby added a commit that referenced this issue Dec 23, 2022
* proto/orijtech: define per point data

Updates the protobuf definitions to have per point data
that can then be used by the user interface.

Updates #14
Updates #16

* Document proto fields + regenerate protos

* server: populate new RunLoadtestResponse fields

descPercentile and bucketizedPerSecond need to be exported for this change to compile. I don't have push access to the tm-load-test fork, but here's where the files are:
https://github.com/orijtech/tm-load-test/blob/d37154798c88c311e880eb23ec799d09e04cb44b/pkg/loadtest/transactor.go#L271-L285

Updates #14
Updates #16

* fix cosmos-sdk version

The PR this change is based on changed the cosmos-sdk version, which causes a build failure

* regenerate protos + update go modules + fix warnings

* use correct value for percentile latency and bytes sent

Co-authored-by: Nathan Dias <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants