backend: get raw values for the various dimensions and present them to the UI for rendering #14

odeke-em · 2022-12-16T19:18:02Z

Right now the informalsystems/tm-loadtest code computes averages of the values and displays a summary, due to the limited user interface of a CLI. We need to be able to show percentiles like we do for qpounder

kirbyquerby · 2022-12-16T22:03:57Z

Of the current dimensions we return:

cosmosloadtester/proto/orijtech/cosmosloadtester/v1/loadtest_service.proto

Lines 83 to 99 in cf9cd20

    
           message RunLoadtestResponse { 
        
             // The total number of transactions sent. 
        
             // Corresponds to total_time in tm-load-test. 
        
             int64 total_txs = 1; 
        
             // The total time taken to send `total_txs` transactions. 
        
             // Corresponds to total_txs in tm-load-test. 
        
             google.protobuf.Duration total_time = 2; 
        
             // The cumulative number of bytes sent as transactions. 
        
             // Corresponds to total_bytes in tm-load-test. 
        
             int64 total_bytes = 3; 
        
             // The rate at which transactions were submitted (tx/sec). 
        
             // Corresponds to avg_tx_rate in tm-load-test. 
        
             double avg_txs_per_second = 4; 
        
             // The rate at which data was transmitted in transactions (bytes/sec). 
        
             // Corresponds to avg_data_rate in tm-load-test. 
        
             double avg_bytes_per_second = 5; 
        
           }

Three are three totals and two average rates. QPounder only shows percentiles for request latency. What metrics are you looking for and which ones do you want to have percentiles for? Do you want a new metric added to track percentiles for request latency? Do you want graphs over time for certain metrics as well?

If we want to track latency, does it still make sense to track it for broadcast_tx_async, which returns immediately?:

odeke-em · 2022-12-17T21:34:06Z

So to load test an application, we need to ensure that users can see minimums maximums for transactions processed as well as bytes processed, which can then allow them to try out things like remove or adding logging while still seeing extremes. Averages add so much noise and distortion when there is lots of variance in values. As for the new metrics, great question but perhaps we can discuss those after just using the current ones. Please see this suggestion of a graph

How to extract those metrics

This is a perfect case for using OpenCensus and then our custom exporter. For transactions processed we can use a count aggregation and for bytes processed and received we can use a distribution aggregation

This change returns more informative stats with p50, p75, p90, p95, p99 values for the latencies which massively help in seeing the actual performance of a node. These values are useful to properly visualize the processing power maximums and minimums plus load tapers. Updates orijtech/cosmosloadtester#14

odeke-em · 2022-12-18T02:44:17Z

@kirbyquerby @willpoint before my flight to Canada today while on a layover in Los Angeles, I sat down and explored a bunch of options like using OpenCensus/OpenTelemetry but their APIs such and are now too convoluted so I instead rolled out simple stats in https://github.com/orijtech/tm-load-test/releases/tag/vorijtech-1.0.0 per orijtech/tm-load-test@d2a5d18 and now if we use this diff, we can see more informative stats like this

diff --git a/go.mod b/go.mod
index 74a202f..ae6d99b 100644
--- a/go.mod
+++ b/go.mod
@@ -9,7 +9,7 @@ require (
 	github.com/informalsystems/tm-load-test v1.0.0
 	github.com/lib/pq v1.10.6
 	github.com/sirupsen/logrus v1.9.0
-	go.opencensus.io v0.23.0
+	go.opencensus.io v0.24.0
 	google.golang.org/genproto v0.0.0-20221207170731-23e4bf6bdc37
 	google.golang.org/grpc v1.51.0
 	google.golang.org/protobuf v1.28.2-0.20220831092852-f930b1dc76e8
@@ -96,7 +96,7 @@ require (
 	github.com/spf13/jwalterweatherman v1.1.0 // indirect
 	github.com/spf13/pflag v1.0.5 // indirect
 	github.com/spf13/viper v1.13.0 // indirect
-	github.com/stretchr/testify v1.8.0 // indirect
+	github.com/stretchr/testify v1.8.1 // indirect
 	github.com/subosito/gotenv v1.4.1 // indirect
 	github.com/syndtr/goleveldb v1.0.1-0.20210819022825-2ae1ddf74ef7 // indirect
 	github.com/tendermint/btcd v0.1.1 // indirect
@@ -121,4 +121,6 @@ require (
 
 replace github.com/gogo/protobuf => github.com/regen-network/protobuf v1.3.3-alpha.regen.1
 
+replace github.com/informalsystems/tm-load-test => github.com/orijtech/tm-load-test v1.0.1-0.20221218023019-d2a5d1861a00
+
 // replace github.com/informalsystems/tm-load-test => /home/nathan/Documents/tm-load-test
diff --git a/server/server.go b/server/server.go
index 621bd33..5c416d8 100644
--- a/server/server.go
+++ b/server/server.go
@@ -3,6 +3,7 @@ package server
 import (
 	"context"
 	"encoding/csv"
+	"encoding/json"
 	"fmt"
 	"os"
 	"strconv"
@@ -75,11 +76,19 @@ func (s *Server) RunLoadtest(ctx context.Context, req *loadtestpb.RunLoadtestReq
 		return nil, status.Errorf(codes.InvalidArgument, "invalid input: %v", err)
 	}
 
-	err = loadtest.ExecuteStandalone(cfg)
+	psL, err := loadtest.ExecuteStandaloneWithStats(cfg)
 	if err != nil {
 		return nil, err
 	}
 
+	// TODO: Send over the actual values of psL to the UI
+	// instead of the CSV parsing down below.
+	blob, err := json.MarshalIndent(psL, "", "  ")
+	if err != nil {
+		return nil, err
+	}
+	println(string(blob))
+
 	f, err := os.Open(statsOutputFilePath)
 	if err != nil {
 		return nil, fmt.Errorf("failed to open stats output file: %w", err)

UI request

Result

[
  {
    "avg_bytes_per_sec": 11521.651142520457,
    "avg_tx_per_sec": 2880.4127856301143,
    "total_time": 39032947139,
    "total_bytes": 449724,
    "total_txs": 112431,
    "p50": {
      "at_ns": 10581297990,
      "at_str": "10.58129799s",
      "latency": 7037
    },
    "p75": {
      "at_ns": 28128359648,
      "at_str": "28.128359648s",
      "latency": 11295
    },
    "p90": {
      "at_ns": 20108712942,
      "at_str": "20.108712942s",
      "latency": 21267
    },
    "p95": {
      "at_ns": 31586764579,
      "at_str": "31.586764579s",
      "latency": 32603
    },
    "p99": {
      "at_ns": 4631386768,
      "at_str": "4.631386768s",
      "latency": 90098
    },
    "per_sec": [
      {
        "sec": 0,
        "qps": 8838,
        "bytes": 35352
      },
      {
        "sec": 1,
        "qps": 2911,
        "bytes": 11644
      },
      {
        "sec": 2,
        "qps": 5342,
        "bytes": 21368
      },
      {
        "sec": 3,
        "qps": 3074,
        "bytes": 12296
      },
      {
        "sec": 4,
        "qps": 9218,
        "bytes": 36872
      },
      {
        "sec": 5,
        "qps": 0,
        "bytes": 0
      },
      {
        "sec": 6,
        "qps": 0,
        "bytes": 0
      },
      {
        "sec": 7,
        "qps": 10001,
        "bytes": 40004
      },
      {
        "sec": 8,
        "qps": 5403,
        "bytes": 21612
      },
      {
        "sec": 9,
        "qps": 3244,
        "bytes": 12976
      },
      {
        "sec": 10,
        "qps": 3407,
        "bytes": 13628
      },
      {
        "sec": 11,
        "qps": 3730,
        "bytes": 14920
      },
      {
        "sec": 12,
        "qps": 2433,
        "bytes": 9732
      },
      {
        "sec": 13,
        "qps": 2921,
        "bytes": 11684
      },
      {
        "sec": 14,
        "qps": 2595,
        "bytes": 10380
      },
      {
        "sec": 15,
        "qps": 2433,
        "bytes": 9732
      },
      {
        "sec": 16,
        "qps": 2758,
        "bytes": 11032
      },
      {
        "sec": 17,
        "qps": 2596,
        "bytes": 10384
      },
      {
        "sec": 18,
        "qps": 2108,
        "bytes": 8432
      },
      {
        "sec": 19,
        "qps": 2271,
        "bytes": 9084
      },
      {
        "sec": 20,
        "qps": 2758,
        "bytes": 11032
      },
      {
        "sec": 21,
        "qps": 3568,
        "bytes": 14272
      },
      {
        "sec": 22,
        "qps": 2434,
        "bytes": 9736
      },
      {
        "sec": 23,
        "qps": 2109,
        "bytes": 8436
      },
      {
        "sec": 24,
        "qps": 2595,
        "bytes": 10380
      },
      {
        "sec": 25,
        "qps": 2271,
        "bytes": 9084
      },
      {
        "sec": 26,
        "qps": 1946,
        "bytes": 7784
      },
      {
        "sec": 27,
        "qps": 1947,
        "bytes": 7788
      },
      {
        "sec": 28,
        "qps": 1460,
        "bytes": 5840
      },
      {
        "sec": 29,
        "qps": 1135,
        "bytes": 4540
      },
      {
        "sec": 30,
        "qps": 1623,
        "bytes": 6492
      },
      {
        "sec": 31,
        "qps": 1459,
        "bytes": 5836
      },
      {
        "sec": 32,
        "qps": 1947,
        "bytes": 7788
      },
      {
        "sec": 33,
        "qps": 1622,
        "bytes": 6488
      },
      {
        "sec": 34,
        "qps": 2109,
        "bytes": 8436
      },
      {
        "sec": 35,
        "qps": 2271,
        "bytes": 9084
      },
      {
        "sec": 36,
        "qps": 1622,
        "bytes": 6488
      },
      {
        "sec": 37,
        "qps": 2272,
        "bytes": 9088
      }
    ]
  }
]

and with this we can now graph timeseries data as suggested above, showing both QPS and Bytes/second as well as the latency progressions with the various percentiles.

odeke-em · 2022-12-18T20:52:51Z

I've mailed out a much better modification In our fork of tm-load-test that'll now show rankings per second and thus allow visualizing of percentiles https://github.com/orijtech/tm-load-test/releases/tag/vorijtech-1.1.0 for example

[
  {
    "avg_bytes_per_sec": 422.15393825983676,
    "avg_tx_per_sec": 105.53848456495919,
    "total_time": 18002911524,
    "total_bytes": 7600,
    "total_txs": 1900,
    "p50": {
      "at_ns": 12000740850,
      "at_str": "12.00074085s",
      "latency": 16942,
      "size": 4
    },
    "p75": {
      "at_ns": 2001010940,
      "at_str": "2.00101094s",
      "latency": 25159,
      "size": 4
    },
    "p90": {
      "at_ns": 15002840984,
      "at_str": "15.002840984s",
      "latency": 40838,
      "size": 4
    },
    "p95": {
      "at_ns": 16000796152,
      "at_str": "16.000796152s",
      "latency": 54770,
      "size": 4
    },
    "p99": {
      "at_ns": 3000639229,
      "at_str": "3.000639229s",
      "latency": 88433,
      "size": 4
    },
    "per_sec": [
      {
        "sec": 0,
        "qps": 100,
        "bytes": 400,
        "bytes_rankings": {
          "p50": {
            "at_ns": 3025761,
            "at_str": "3.025761ms",
            "size": 4
          },
          "p75": {
            "at_ns": 3126893,
            "at_str": "3.126893ms",
            "size": 4
          },
          "p90": {
            "at_ns": 710362,
            "at_str": "710.362µs",
            "size": 4
          },
          "p95": {
            "at_ns": 2010957,
            "at_str": "2.010957ms",
            "size": 4
          },
          "p99": {
            "at_ns": 633480,
            "at_str": "633.48µs",
            "size": 4
          }
        },
        "latency_rankings": {
          "p50": {
            "at_ns": 3025761,
            "at_str": "3.025761ms",
            "latency": 17800
          },
          "p75": {
            "at_ns": 3126893,
            "at_str": "3.126893ms",
            "latency": 26296
          },
          "p90": {
            "at_ns": 710362,
            "at_str": "710.362µs",
            "latency": 42240
          },
          "p95": {
            "at_ns": 2010957,
            "at_str": "2.010957ms",
            "latency": 56953
          },
          "p99": {
            "at_ns": 633480,
            "at_str": "633.48µs",
            "latency": 242929
          }
        }
      },
      {
        "sec": 1,
        "qps": 100,
        "bytes": 400,
        "bytes_rankings": {
          "p50": {
            "at_ns": 1002111655,
            "at_str": "1.002111655s",
            "size": 4
          },
          "p75": {
            "at_ns": 1002353024,
            "at_str": "1.002353024s",
            "size": 4
          },
          "p90": {
            "at_ns": 1002326247,
            "at_str": "1.002326247s",
            "size": 4
          },
          "p95": {
            "at_ns": 1002397596,
            "at_str": "1.002397596s",
            "size": 4
          },
          "p99": {
            "at_ns": 1001288678,
            "at_str": "1.001288678s",
            "size": 4
          }
        },
        "latency_rankings": {
          "p50": {
            "at_ns": 1002111655,
            "at_str": "1.002111655s",
            "latency": 14099
          },
          "p75": {
            "at_ns": 1002353024,
            "at_str": "1.002353024s",
            "latency": 23419
          },
          "p90": {
            "at_ns": 1002326247,
            "at_str": "1.002326247s",
            "latency": 39388
          },
          "p95": {
            "at_ns": 1002397596,
            "at_str": "1.002397596s",
            "latency": 42895
          },
          "p99": {
            "at_ns": 1001288678,
            "at_str": "1.001288678s",
            "latency": 81521
          }
        }
      }
    ]
  }
]

This change uses our fork of tm-load-test which provides latency and bytes rankings for each second and will allow us to visualize percentile graphs which exhibit actual behavior necessary to determine pathological behaviour in load tests and responses. Updates #14 Updates #16

Updates the protobuf definitions to have per point data that can then be used by the user interface. Updates #14 Updates #16

update #14

updates #14 TODO(uzo) complete the flow using the protobuf definitions and server response

Updates the protobuf definitions to have per point data that can then be used by the user interface. Updates #14 Updates #16

descPercentile and bucketizedPerSecond need to be exported for this change to compile. I don't have push access to the tm-load-test fork, but here's where the files are: https://github.com/orijtech/tm-load-test/blob/d37154798c88c311e880eb23ec799d09e04cb44b/pkg/loadtest/transactor.go#L271-L285 Updates #14 Updates #16

* proto/orijtech: define per point data Updates the protobuf definitions to have per point data that can then be used by the user interface. Updates #14 Updates #16 * Document proto fields + regenerate protos * server: populate new RunLoadtestResponse fields descPercentile and bucketizedPerSecond need to be exported for this change to compile. I don't have push access to the tm-load-test fork, but here's where the files are: https://github.com/orijtech/tm-load-test/blob/d37154798c88c311e880eb23ec799d09e04cb44b/pkg/loadtest/transactor.go#L271-L285 Updates #14 Updates #16 * fix cosmos-sdk version The PR this change is based on changed the cosmos-sdk version, which causes a build failure * regenerate protos + update go modules + fix warnings * use correct value for percentile latency and bytes sent Co-authored-by: Nathan Dias <[email protected]>

odeke-em added the v2 label Dec 16, 2022

odeke-em assigned kirbyquerby Dec 16, 2022

odeke-em mentioned this issue Dec 17, 2022

ui: build time series visualizations graphs #16

Open

odeke-em mentioned this issue Dec 18, 2022

go.mod, server: use orijtech/tm-load-test for percentile rankings #18

Merged

odeke-em added a commit that referenced this issue Dec 20, 2022

proto/orijtech: define per point data

fef436e

Updates the protobuf definitions to have per point data that can then be used by the user interface. Updates #14 Updates #16

odeke-em mentioned this issue Dec 20, 2022

proto/orijtech: define per point data #21

Merged

willpoint added a commit that referenced this issue Dec 20, 2022

ui: add types defining response data

67f4f5b

update #14

willpoint added a commit that referenced this issue Dec 20, 2022

ui: add graphs showing qps and bytes sent per #14

7687140

updates #14 TODO(uzo) complete the flow using the protobuf definitions and server response

willpoint mentioned this issue Dec 20, 2022

ui: add graphs showing qps and bytes sent #23

Merged

odeke-em added a commit that referenced this issue Dec 22, 2022

proto/orijtech: define per point data

8216391

Updates the protobuf definitions to have per point data that can then be used by the user interface. Updates #14 Updates #16

kirbyquerby mentioned this issue Dec 23, 2022

server: populate new RunLoadtestResponse fields #24

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backend: get raw values for the various dimensions and present them to the UI for rendering #14

backend: get raw values for the various dimensions and present them to the UI for rendering #14

odeke-em commented Dec 16, 2022

kirbyquerby commented Dec 16, 2022 •

edited

Loading

odeke-em commented Dec 17, 2022

odeke-em commented Dec 18, 2022

odeke-em commented Dec 18, 2022

backend: get raw values for the various dimensions and present them to the UI for rendering #14

backend: get raw values for the various dimensions and present them to the UI for rendering #14

Comments

odeke-em commented Dec 16, 2022

kirbyquerby commented Dec 16, 2022 • edited Loading

odeke-em commented Dec 17, 2022

How to extract those metrics

odeke-em commented Dec 18, 2022

UI request

Result

odeke-em commented Dec 18, 2022

kirbyquerby commented Dec 16, 2022 •

edited

Loading