Add retry policy for uploading requests to agent #7824

jpbempel · 2024-10-22T12:28:57Z

What Does This Do

Add a mechanism to retry on network failure or 500 response code. The retry policy store the call made to OkHttp to track number of retries made and the maximum retries allowed by failure categories: network or 500.
3 retries are allowed for snapshots, 10 for probe statuses & symdb

Motivation

Additional Notes

Contributor Checklist

Format the title according the contribution guidelines
Assign the type: and (comp: or inst:) labels in addition to any usefull labels
Squash your commits prior merging or merge using GitHub's Squash and merge
Don't use close, fix or any linking keywords when referencing an issue.
Use solves instead, and assign the PR milestone to the issue
Update the public documentation in case of new configuration flag or behavior

Jira ticket: [PROJ-IDENT]

Add a mechanism to retry on network failure or 500 response code. The retry policy store the call made to OkHttp to track number of retries made and the maximum retries allowed by failure categories: network or 500. 3 retries are allowed for snapshots, 10 for probe statuses & symdb

pr-commenter · 2024-10-22T12:40:37Z

Debugger benchmarks

Parameters

	Baseline	Candidate
baseline_or_candidate	baseline	candidate
ci_job_date	1729671190	1729671565
end_time	2024-10-23T08:14:25	2024-10-23T08:20:39
git_branch	master	jpbempel/retry-policy
git_commit_sha	`179c62a`	`9be11d1`
start_time	2024-10-23T08:13:11	2024-10-23T08:19:26

See matching parameters

	Baseline	Candidate
ci_job_id	680981191	680981191
ci_pipeline_id	47208441	47208441
cpu_model	Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz	Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
git_commit_date	1729670552	1729670552

Summary

Found 1 performance improvements and 0 performance regressions! Performance is the same for 8 metrics, 6 unstable metrics.

scenario	Δ mean agg_http_req_duration_min	Δ mean agg_http_req_duration_p50	Δ mean agg_http_req_duration_p75	Δ mean agg_http_req_duration_p99	Δ mean throughput
scenario:basic	better [-17.652µs; -5.268µs] or [-6.389%; -1.907%]	same	same	unstable [-49.422µs; +38.986µs] or [-9.143%; +7.212%]	unstable [-74.527op/s; +209.479op/s] or [-2.907%; +8.170%]

See unchanged results

scenario	Δ mean agg_http_req_duration_min	Δ mean agg_http_req_duration_p50	Δ mean agg_http_req_duration_p75	Δ mean agg_http_req_duration_p99	Δ mean throughput
scenario:noprobe	unstable [-24.714µs; +11.994µs] or [-9.420%; +4.572%]	unstable [-40.384µs; +18.873µs] or [-13.377%; +6.251%]	unstable [-51.796µs; +30.701µs] or [-16.418%; +9.732%]	unstable [-104.199µs; +83.426µs] or [-17.201%; +13.772%]	same
scenario:loop	unsure [+0.475µs; +15.761µs] or [+0.005%; +0.156%]	same	same	same	same

Request duration reports for reports

gantt
    title reports - request duration [CI 0.99] : candidate=None, baseline=None
    dateFormat X
    axisFormat %s
section baseline
noprobe (301.898 µs) : 269, 335
.   : milestone, 302,
basic (303.834 µs) : 294, 314
.   : milestone, 304,
loop (10.293 ms) : 10247, 10339
.   : milestone, 10293,
section candidate
noprobe (291.142 µs) : 269, 314
.   : milestone, 291,
basic (293.792 µs) : 284, 304
.   : milestone, 294,
loop (10.292 ms) : 10258, 10326
.   : milestone, 10292,

baseline results

Scenario	Request median duration [CI 0.99]
noprobe	301.898 µs [268.836 µs, 334.959 µs]
basic	303.834 µs [294.108 µs, 313.559 µs]
loop	10.293 ms [10.247 ms, 10.339 ms]

candidate results

Scenario	Request median duration [CI 0.99]
noprobe	291.142 µs [268.676 µs, 313.607 µs]
basic	293.792 µs [283.733 µs, 303.851 µs]
loop	10.292 ms [10.258 ms, 10.326 ms]

pr-commenter · 2024-10-22T12:59:13Z

Benchmarks

Startup

Parameters

	Baseline	Candidate
baseline_or_candidate	baseline	candidate
git_branch	master	jpbempel/retry-policy
git_commit_date	1729609825	1729670552
git_commit_sha	`179c62a`	`9be11d1`
release_version	1.42.0-SNAPSHOT~179c62ad1e	1.42.0-SNAPSHOT~9be11d1ad9

See matching parameters

	Baseline	Candidate
application	insecure-bank	insecure-bank
ci_job_date	1729673047	1729673047
ci_job_id	680981185	680981185
ci_pipeline_id	47208441	47208441
cpu_model	Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz	Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
module	Agent	Agent
parent	None	None
variant	iast	iast

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 51 metrics, 12 unstable metrics.

Startup time reports for insecure-bank

gantt
    title insecure-bank - global startup overhead: candidate=1.42.0-SNAPSHOT~9be11d1ad9, baseline=1.42.0-SNAPSHOT~179c62ad1e

    dateFormat X
    axisFormat %s
section tracing
Agent [baseline] (1.089 s) : 0, 1088878
Total [baseline] (8.594 s) : 0, 8593847
Agent [candidate] (1.078 s) : 0, 1077761
Total [candidate] (8.575 s) : 0, 8574848
section iast
Agent [baseline] (1.222 s) : 0, 1222078
Total [baseline] (9.216 s) : 0, 9216278
Agent [candidate] (1.212 s) : 0, 1211849
Total [candidate] (9.13 s) : 0, 9130102
section iast_HARDCODED_SECRET_DISABLED
Agent [baseline] (1.213 s) : 0, 1212573
Total [baseline] (9.136 s) : 0, 9135552
Agent [candidate] (1.223 s) : 0, 1222654
Total [candidate] (9.126 s) : 0, 9126248
section iast_TELEMETRY_OFF
Agent [baseline] (1.224 s) : 0, 1224186
Total [baseline] (9.155 s) : 0, 9155386
Agent [candidate] (1.209 s) : 0, 1208798
Total [candidate] (9.174 s) : 0, 9174415

baseline results

Module	Variant	Duration	Δ tracing
Agent	tracing	1.089 s	-
Agent	iast	1.222 s	133.201 ms (12.2%)
Agent	iast_HARDCODED_SECRET_DISABLED	1.213 s	123.696 ms (11.4%)
Agent	iast_TELEMETRY_OFF	1.224 s	135.308 ms (12.4%)
Total	tracing	8.594 s	-
Total	iast	9.216 s	622.431 ms (7.2%)
Total	iast_HARDCODED_SECRET_DISABLED	9.136 s	541.704 ms (6.3%)
Total	iast_TELEMETRY_OFF	9.155 s	561.539 ms (6.5%)

candidate results

Module	Variant	Duration	Δ tracing
Agent	tracing	1.078 s	-
Agent	iast	1.212 s	134.088 ms (12.4%)
Agent	iast_HARDCODED_SECRET_DISABLED	1.223 s	144.893 ms (13.4%)
Agent	iast_TELEMETRY_OFF	1.209 s	131.037 ms (12.2%)
Total	tracing	8.575 s	-
Total	iast	9.13 s	555.254 ms (6.5%)
Total	iast_HARDCODED_SECRET_DISABLED	9.126 s	551.4 ms (6.4%)
Total	iast_TELEMETRY_OFF	9.174 s	599.566 ms (7.0%)

gantt
    title insecure-bank - break down per module: candidate=1.42.0-SNAPSHOT~9be11d1ad9, baseline=1.42.0-SNAPSHOT~179c62ad1e

    dateFormat X
    axisFormat %s
section tracing
BytebuddyAgent [baseline] (693.786 ms) : 0, 693786
BytebuddyAgent [candidate] (684.915 ms) : 0, 684915
GlobalTracer [baseline] (315.989 ms) : 0, 315989
GlobalTracer [candidate] (312.814 ms) : 0, 312814
AppSec [baseline] (54.78 ms) : 0, 54780
AppSec [candidate] (53.951 ms) : 0, 53951
Remote Config [baseline] (672.158 µs) : 0, 672
Remote Config [candidate] (669.619 µs) : 0, 670
Telemetry [baseline] (9.772 ms) : 0, 9772
Telemetry [candidate] (11.678 ms) : 0, 11678
section iast
BytebuddyAgent [baseline] (814.983 ms) : 0, 814983
BytebuddyAgent [candidate] (807.229 ms) : 0, 807229
GlobalTracer [baseline] (305.898 ms) : 0, 305898
GlobalTracer [candidate] (304.569 ms) : 0, 304569
AppSec [baseline] (58.509 ms) : 0, 58509
AppSec [candidate] (57.857 ms) : 0, 57857
IAST [baseline] (20.851 ms) : 0, 20851
IAST [candidate] (20.605 ms) : 0, 20605
Remote Config [baseline] (617.876 µs) : 0, 618
Remote Config [candidate] (600.549 µs) : 0, 601
Telemetry [baseline] (7.235 ms) : 0, 7235
Telemetry [candidate] (7.116 ms) : 0, 7116
section iast_HARDCODED_SECRET_DISABLED
BytebuddyAgent [baseline] (808.931 ms) : 0, 808931
BytebuddyAgent [candidate] (815.209 ms) : 0, 815209
GlobalTracer [baseline] (303.84 ms) : 0, 303840
GlobalTracer [candidate] (306.515 ms) : 0, 306515
AppSec [baseline] (57.636 ms) : 0, 57636
AppSec [candidate] (58.355 ms) : 0, 58355
IAST [baseline] (20.663 ms) : 0, 20663
IAST [candidate] (20.772 ms) : 0, 20772
Remote Config [baseline] (604.326 µs) : 0, 604
Remote Config [candidate] (610.439 µs) : 0, 610
Telemetry [baseline] (7.045 ms) : 0, 7045
Telemetry [candidate] (7.22 ms) : 0, 7220
section iast_TELEMETRY_OFF
BytebuddyAgent [baseline] (816.111 ms) : 0, 816111
BytebuddyAgent [candidate] (804.351 ms) : 0, 804351
GlobalTracer [baseline] (307.543 ms) : 0, 307543
GlobalTracer [candidate] (304.148 ms) : 0, 304148
AppSec [baseline] (56.77 ms) : 0, 56770
AppSec [candidate] (58.451 ms) : 0, 58451
IAST [baseline] (21.123 ms) : 0, 21123
IAST [candidate] (20.383 ms) : 0, 20383
Remote Config [baseline] (613.299 µs) : 0, 613
Remote Config [candidate] (608.381 µs) : 0, 608
Telemetry [baseline] (7.992 ms) : 0, 7992
Telemetry [candidate] (7.054 ms) : 0, 7054

Startup time reports for petclinic

gantt
    title petclinic - global startup overhead: candidate=1.42.0-SNAPSHOT~9be11d1ad9, baseline=1.42.0-SNAPSHOT~179c62ad1e

    dateFormat X
    axisFormat %s
section tracing
Agent [baseline] (1.087 s) : 0, 1087356
Total [baseline] (10.499 s) : 0, 10498981
Agent [candidate] (1.085 s) : 0, 1084768
Total [candidate] (10.525 s) : 0, 10525150
section appsec
Agent [baseline] (1.214 s) : 0, 1213787
Total [baseline] (10.605 s) : 0, 10604634
Agent [candidate] (1.212 s) : 0, 1211504
Total [candidate] (10.617 s) : 0, 10616868
section iast
Agent [baseline] (1.205 s) : 0, 1204545
Total [baseline] (10.919 s) : 0, 10919069
Agent [candidate] (1.204 s) : 0, 1203986
Total [candidate] (10.897 s) : 0, 10897296
section profiling
Agent [baseline] (1.283 s) : 0, 1282772
Total [baseline] (10.709 s) : 0, 10709058
Agent [candidate] (1.286 s) : 0, 1285849
Total [candidate] (10.753 s) : 0, 10752664

baseline results

Module	Variant	Duration	Δ tracing
Agent	tracing	1.087 s	-
Agent	appsec	1.214 s	126.431 ms (11.6%)
Agent	iast	1.205 s	117.189 ms (10.8%)
Agent	profiling	1.283 s	195.416 ms (18.0%)
Total	tracing	10.499 s	-
Total	appsec	10.605 s	105.653 ms (1.0%)
Total	iast	10.919 s	420.088 ms (4.0%)
Total	profiling	10.709 s	210.077 ms (2.0%)

candidate results

Module	Variant	Duration	Δ tracing
Agent	tracing	1.085 s	-
Agent	appsec	1.212 s	126.737 ms (11.7%)
Agent	iast	1.204 s	119.218 ms (11.0%)
Agent	profiling	1.286 s	201.081 ms (18.5%)
Total	tracing	10.525 s	-
Total	appsec	10.617 s	91.718 ms (0.9%)
Total	iast	10.897 s	372.146 ms (3.5%)
Total	profiling	10.753 s	227.514 ms (2.2%)

gantt
    title petclinic - break down per module: candidate=1.42.0-SNAPSHOT~9be11d1ad9, baseline=1.42.0-SNAPSHOT~179c62ad1e

    dateFormat X
    axisFormat %s
section tracing
BytebuddyAgent [baseline] (692.118 ms) : 0, 692118
BytebuddyAgent [candidate] (691.457 ms) : 0, 691457
GlobalTracer [baseline] (315.399 ms) : 0, 315399
GlobalTracer [candidate] (314.891 ms) : 0, 314891
AppSec [baseline] (54.45 ms) : 0, 54450
AppSec [candidate] (54.106 ms) : 0, 54106
Remote Config [baseline] (661.011 µs) : 0, 661
Remote Config [candidate] (670.476 µs) : 0, 670
Telemetry [baseline] (10.898 ms) : 0, 10898
Telemetry [candidate] (9.818 ms) : 0, 9818
section appsec
BytebuddyAgent [baseline] (704.381 ms) : 0, 704381
BytebuddyAgent [candidate] (703.096 ms) : 0, 703096
GlobalTracer [baseline] (310.787 ms) : 0, 310787
GlobalTracer [candidate] (310.303 ms) : 0, 310303
AppSec [baseline] (165.938 ms) : 0, 165938
AppSec [candidate] (166.324 ms) : 0, 166324
Remote Config [baseline] (634.329 µs) : 0, 634
Remote Config [candidate] (632.668 µs) : 0, 633
Telemetry [baseline] (7.814 ms) : 0, 7814
Telemetry [candidate] (7.834 ms) : 0, 7834
IAST [baseline] (21.038 ms) : 0, 21038
IAST [candidate] (19.196 ms) : 0, 19196
section iast
BytebuddyAgent [baseline] (802.748 ms) : 0, 802748
BytebuddyAgent [candidate] (802.563 ms) : 0, 802563
GlobalTracer [baseline] (301.659 ms) : 0, 301659
GlobalTracer [candidate] (302.085 ms) : 0, 302085
AppSec [baseline] (57.327 ms) : 0, 57327
AppSec [candidate] (56.493 ms) : 0, 56493
Remote Config [baseline] (601.234 µs) : 0, 601
Remote Config [candidate] (604.702 µs) : 0, 605
Telemetry [baseline] (7.094 ms) : 0, 7094
Telemetry [candidate] (7.959 ms) : 0, 7959
IAST [baseline] (21.361 ms) : 0, 21361
IAST [candidate] (20.516 ms) : 0, 20516
section profiling
BytebuddyAgent [baseline] (685.056 ms) : 0, 685056
BytebuddyAgent [candidate] (685.917 ms) : 0, 685917
GlobalTracer [baseline] (399.159 ms) : 0, 399159
GlobalTracer [candidate] (399.683 ms) : 0, 399683
AppSec [baseline] (54.845 ms) : 0, 54845
AppSec [candidate] (55.196 ms) : 0, 55196
Remote Config [baseline] (664.553 µs) : 0, 665
Remote Config [candidate] (665.96 µs) : 0, 666
Telemetry [baseline] (12.427 ms) : 0, 12427
Telemetry [candidate] (11.868 ms) : 0, 11868
ProfilingAgent [baseline] (91.54 ms) : 0, 91540
ProfilingAgent [candidate] (93.227 ms) : 0, 93227
Profiling [baseline] (91.563 ms) : 0, 91563
Profiling [candidate] (93.25 ms) : 0, 93250

Load

Parameters

	Baseline	Candidate
baseline_or_candidate	baseline	candidate
end_time	2024-10-23T08:14:28	2024-10-23T08:21:20
git_branch	master	jpbempel/retry-policy
git_commit_date	1729609825	1729670552
git_commit_sha	`179c62a`	`9be11d1`
release_version	1.42.0-SNAPSHOT~179c62ad1e	1.42.0-SNAPSHOT~9be11d1ad9
start_time	2024-10-23T08:14:15	2024-10-23T08:21:07

See matching parameters

	Baseline	Candidate
application	insecure-bank	insecure-bank
ci_job_date	1729672028	1729672028
ci_job_id	680981186	680981186
ci_pipeline_id	47208441	47208441
cpu_model	Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz	Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
variant	iast	iast

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 11 metrics, 17 unstable metrics.

Request duration reports for petclinic

gantt
    title petclinic - request duration [CI 0.99] : candidate=1.42.0-SNAPSHOT~9be11d1ad9, baseline=1.42.0-SNAPSHOT~179c62ad1e
    dateFormat X
    axisFormat %s
section baseline
no_agent (1.341 ms) : 1322, 1360
.   : milestone, 1341,
appsec (1.728 ms) : 1704, 1751
.   : milestone, 1728,
appsec_no_iast (1.702 ms) : 1677, 1726
.   : milestone, 1702,
iast (1.495 ms) : 1472, 1518
.   : milestone, 1495,
profiling (1.499 ms) : 1475, 1523
.   : milestone, 1499,
tracing (1.465 ms) : 1440, 1490
.   : milestone, 1465,
section candidate
no_agent (1.343 ms) : 1325, 1362
.   : milestone, 1343,
appsec (1.716 ms) : 1692, 1741
.   : milestone, 1716,
appsec_no_iast (1.747 ms) : 1723, 1770
.   : milestone, 1747,
iast (1.479 ms) : 1456, 1502
.   : milestone, 1479,
profiling (1.473 ms) : 1450, 1496
.   : milestone, 1473,
tracing (1.482 ms) : 1458, 1506
.   : milestone, 1482,

baseline results

Variant	Request duration [CI 0.99]	Δ no_agent
no_agent	1.341 ms [1.322 ms, 1.36 ms]	-
appsec	1.728 ms [1.704 ms, 1.751 ms]	386.833 µs (28.8%)
appsec_no_iast	1.702 ms [1.677 ms, 1.726 ms]	360.811 µs (26.9%)
iast	1.495 ms [1.472 ms, 1.518 ms]	154.25 µs (11.5%)
profiling	1.499 ms [1.475 ms, 1.523 ms]	158.098 µs (11.8%)
tracing	1.465 ms [1.44 ms, 1.49 ms]	124.197 µs (9.3%)

candidate results

Variant	Request duration [CI 0.99]	Δ no_agent
no_agent	1.343 ms [1.325 ms, 1.362 ms]	-
appsec	1.716 ms [1.692 ms, 1.741 ms]	372.98 µs (27.8%)
appsec_no_iast	1.747 ms [1.723 ms, 1.77 ms]	403.144 µs (30.0%)
iast	1.479 ms [1.456 ms, 1.502 ms]	135.299 µs (10.1%)
profiling	1.473 ms [1.45 ms, 1.496 ms]	129.538 µs (9.6%)
tracing	1.482 ms [1.458 ms, 1.506 ms]	138.783 µs (10.3%)

Request duration reports for insecure-bank

gantt
    title insecure-bank - request duration [CI 0.99] : candidate=1.42.0-SNAPSHOT~9be11d1ad9, baseline=1.42.0-SNAPSHOT~179c62ad1e
    dateFormat X
    axisFormat %s
section baseline
no_agent (366.366 µs) : 347, 386
.   : milestone, 366,
iast (482.512 µs) : 461, 504
.   : milestone, 483,
iast_FULL (558.2 µs) : 537, 579
.   : milestone, 558,
iast_GLOBAL (518.163 µs) : 496, 540
.   : milestone, 518,
iast_HARDCODED_SECRET_DISABLED (485.004 µs) : 464, 506
.   : milestone, 485,
iast_INACTIVE (450.174 µs) : 429, 472
.   : milestone, 450,
iast_TELEMETRY_OFF (482.625 µs) : 460, 505
.   : milestone, 483,
tracing (444.203 µs) : 423, 465
.   : milestone, 444,
section candidate
no_agent (375.284 µs) : 354, 397
.   : milestone, 375,
iast (483.221 µs) : 462, 504
.   : milestone, 483,
iast_FULL (550.67 µs) : 530, 572
.   : milestone, 551,
iast_GLOBAL (524.643 µs) : 503, 546
.   : milestone, 525,
iast_HARDCODED_SECRET_DISABLED (479.223 µs) : 458, 500
.   : milestone, 479,
iast_INACTIVE (449.53 µs) : 428, 471
.   : milestone, 450,
iast_TELEMETRY_OFF (472.274 µs) : 451, 493
.   : milestone, 472,
tracing (446.32 µs) : 425, 468
.   : milestone, 446,

baseline results

Variant	Request duration [CI 0.99]	Δ no_agent
no_agent	366.366 µs [347.094 µs, 385.637 µs]	-
iast	482.512 µs [461.241 µs, 503.783 µs]	116.146 µs (31.7%)
iast_FULL	558.2 µs [536.996 µs, 579.404 µs]	191.834 µs (52.4%)
iast_GLOBAL	518.163 µs [496.375 µs, 539.95 µs]	151.797 µs (41.4%)
iast_HARDCODED_SECRET_DISABLED	485.004 µs [463.703 µs, 506.304 µs]	118.638 µs (32.4%)
iast_INACTIVE	450.174 µs [428.53 µs, 471.817 µs]	83.808 µs (22.9%)
iast_TELEMETRY_OFF	482.625 µs [460.344 µs, 504.906 µs]	116.259 µs (31.7%)
tracing	444.203 µs [423.221 µs, 465.184 µs]	77.837 µs (21.2%)

candidate results

Variant	Request duration [CI 0.99]	Δ no_agent
no_agent	375.284 µs [354.032 µs, 396.536 µs]	-
iast	483.221 µs [461.972 µs, 504.469 µs]	107.937 µs (28.8%)
iast_FULL	550.67 µs [529.552 µs, 571.789 µs]	175.386 µs (46.7%)
iast_GLOBAL	524.643 µs [503.117 µs, 546.168 µs]	149.359 µs (39.8%)
iast_HARDCODED_SECRET_DISABLED	479.223 µs [457.959 µs, 500.486 µs]	103.939 µs (27.7%)
iast_INACTIVE	449.53 µs [428.211 µs, 470.85 µs]	74.246 µs (19.8%)
iast_TELEMETRY_OFF	472.274 µs [451.331 µs, 493.217 µs]	96.99 µs (25.8%)
tracing	446.32 µs [424.676 µs, 467.963 µs]	71.036 µs (18.9%)

Dacapo

Parameters

	Baseline	Candidate
baseline_or_candidate	baseline	candidate
git_branch	master	jpbempel/retry-policy
git_commit_date	1729609825	1729670552
git_commit_sha	`179c62a`	`9be11d1`
release_version	1.42.0-SNAPSHOT~179c62ad1e	1.42.0-SNAPSHOT~9be11d1ad9

See matching parameters

	Baseline	Candidate
application	biojava	biojava
ci_job_date	1729672557	1729672557
ci_job_id	680981187	680981187
ci_pipeline_id	47208441	47208441
cpu_model	Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz	Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
variant	appsec	appsec

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 12 metrics, 0 unstable metrics.

Execution time for tomcat

gantt
    title tomcat - execution time [CI 0.99] : candidate=1.42.0-SNAPSHOT~9be11d1ad9, baseline=1.42.0-SNAPSHOT~179c62ad1e
    dateFormat X
    axisFormat %s
section baseline
no_agent (1.467 ms) : 1456, 1478
.   : milestone, 1467,
appsec (2.335 ms) : 2294, 2376
.   : milestone, 2335,
iast (2.089 ms) : 2036, 2141
.   : milestone, 2089,
iast_GLOBAL (2.13 ms) : 2078, 2183
.   : milestone, 2130,
profiling (1.958 ms) : 1916, 2000
.   : milestone, 1958,
tracing (1.931 ms) : 1891, 1971
.   : milestone, 1931,
section candidate
no_agent (1.472 ms) : 1461, 1484
.   : milestone, 1472,
appsec (2.365 ms) : 2323, 2407
.   : milestone, 2365,
iast (2.092 ms) : 2040, 2145
.   : milestone, 2092,
iast_GLOBAL (2.14 ms) : 2087, 2193
.   : milestone, 2140,
profiling (1.955 ms) : 1912, 1998
.   : milestone, 1955,
tracing (1.925 ms) : 1885, 1965
.   : milestone, 1925,

baseline results

Variant	Execution Time [CI 0.99]	Δ no_agent
no_agent	1.467 ms [1.456 ms, 1.478 ms]	-
appsec	2.335 ms [2.294 ms, 2.376 ms]	868.301 µs (59.2%)
iast	2.089 ms [2.036 ms, 2.141 ms]	621.723 µs (42.4%)
iast_GLOBAL	2.13 ms [2.078 ms, 2.183 ms]	663.342 µs (45.2%)
profiling	1.958 ms [1.916 ms, 2.0 ms]	491.408 µs (33.5%)
tracing	1.931 ms [1.891 ms, 1.971 ms]	464.068 µs (31.6%)

candidate results

Variant	Execution Time [CI 0.99]	Δ no_agent
no_agent	1.472 ms [1.461 ms, 1.484 ms]	-
appsec	2.365 ms [2.323 ms, 2.407 ms]	892.644 µs (60.6%)
iast	2.092 ms [2.04 ms, 2.145 ms]	619.839 µs (42.1%)
iast_GLOBAL	2.14 ms [2.087 ms, 2.193 ms]	667.578 µs (45.3%)
profiling	1.955 ms [1.912 ms, 1.998 ms]	482.729 µs (32.8%)
tracing	1.925 ms [1.885 ms, 1.965 ms]	452.735 µs (30.7%)

Execution time for biojava

gantt
    title biojava - execution time [CI 0.99] : candidate=1.42.0-SNAPSHOT~9be11d1ad9, baseline=1.42.0-SNAPSHOT~179c62ad1e
    dateFormat X
    axisFormat %s
section baseline
no_agent (15.377 s) : 15377000, 15377000
.   : milestone, 15377000,
appsec (15.269 s) : 15269000, 15269000
.   : milestone, 15269000,
iast (18.908 s) : 18908000, 18908000
.   : milestone, 18908000,
iast_GLOBAL (17.93 s) : 17930000, 17930000
.   : milestone, 17930000,
profiling (15.076 s) : 15076000, 15076000
.   : milestone, 15076000,
tracing (15.071 s) : 15071000, 15071000
.   : milestone, 15071000,
section candidate
no_agent (15.377 s) : 15377000, 15377000
.   : milestone, 15377000,
appsec (15.422 s) : 15422000, 15422000
.   : milestone, 15422000,
iast (19.011 s) : 19011000, 19011000
.   : milestone, 19011000,
iast_GLOBAL (18.344 s) : 18344000, 18344000
.   : milestone, 18344000,
profiling (15.16 s) : 15160000, 15160000
.   : milestone, 15160000,
tracing (15.073 s) : 15073000, 15073000
.   : milestone, 15073000,

baseline results

Variant	Execution Time [CI 0.99]	Δ no_agent
no_agent	15.377 s [15.377 s, 15.377 s]	-
appsec	15.269 s [15.269 s, 15.269 s]	-108.0 ms (-0.7%)
iast	18.908 s [18.908 s, 18.908 s]	3.531 s (23.0%)
iast_GLOBAL	17.93 s [17.93 s, 17.93 s]	2.553 s (16.6%)
profiling	15.076 s [15.076 s, 15.076 s]	-301.0 ms (-2.0%)
tracing	15.071 s [15.071 s, 15.071 s]	-306.0 ms (-2.0%)

candidate results

Variant	Execution Time [CI 0.99]	Δ no_agent
no_agent	15.377 s [15.377 s, 15.377 s]	-
appsec	15.422 s [15.422 s, 15.422 s]	45.0 ms (0.3%)
iast	19.011 s [19.011 s, 19.011 s]	3.634 s (23.6%)
iast_GLOBAL	18.344 s [18.344 s, 18.344 s]	2.967 s (19.3%)
profiling	15.16 s [15.16 s, 15.16 s]	-217.0 ms (-1.4%)
tracing	15.073 s [15.073 s, 15.073 s]	-304.0 ms (-2.0%)

ojung

Thanks 👍 Added couple of comments / questions.

ojung · 2024-10-22T20:09:40Z

dd-java-agent/agent-debugger/src/main/java/com/datadog/debugger/uploader/BatchUploader.java

@@ -301,6 +369,11 @@ public void onResponse(final Call call, final Response response) {
                response.message(),
                response.code());
          }
+          if (response.code() >= 500) {


There are also 408 (client timeout) and 429 (too many requests) that should be retried according to the official logs http api docs: https://docs.datadoghq.com/api/latest/logs/

thanks for pointing this out, added.

ojung · 2024-10-22T20:11:06Z

dd-java-agent/agent-debugger/src/main/java/com/datadog/debugger/uploader/BatchUploader.java

+        int failureCount = failure + 1;
+        if (failureCount <= maxFailures) {
+          LOGGER.debug("Retrying upload {}/{}", failureCount, maxFailures);
+          enqueueCall(client, call.request(), this, retryPolicy, failureCount, inflightRequests);


Don't want to over-complicate things, but maybe we should add a backoff? If the requests fail fast, all retries might be done in a very short time interval.

I see your point. The problem is how I deal with this backoff? where I put the new request on hold, who is responsible for enqueue again after x time?
also, we are keeping for a longest period large buffers (serialized symbols up to 50MB)

Well, we would have to keep them in a collection on the class level I guess. We could either have a scheduled thread that re-enqueues the requests or store a scheduled thread for each retry that has a given initial delay. As I said, it's making things a bit more complicated, but the more I think about it the less I like retrying immediately.

ojung · 2024-10-22T20:15:43Z

dd-java-agent/agent-debugger/src/main/java/com/datadog/debugger/uploader/BatchUploader.java

@@ -260,28 +289,67 @@ private boolean canEnqueueMoreRequests() {
    return client.dispatcher().queuedCallsCount() < MAX_ENQUEUED_REQUESTS;
  }

+  private static void enqueueCall(


How are we handling back pressure? If network is unreliable and we retry the same symdb uploads multiple times with backoff, are we going to slow down serializing new classes into memory?

we have a MAX_ENQUEUED_REQUESTS limit placed at 20 (see above)
so we are dropping requests if reach this limit

How much effort would it be to change the way we extract symbols to:

Discover jars to parse

Put them (path only) on a queue

Only parse them as fast as we can (without dropping requests)

ojung · 2024-10-22T20:16:09Z

dd-java-agent/agent-debugger/src/main/java/com/datadog/debugger/agent/DebuggerAgent.java

@@ -48,6 +48,8 @@
 /** Debugger agent implementation */
 public class DebuggerAgent {
  private static final Logger LOGGER = LoggerFactory.getLogger(DebuggerAgent.class);
+  private static final BatchUploader.RetryPolicy SNAPSHOT_RETRY_POLICY =
+      new BatchUploader.RetryPolicy(3, 3);


Not sure if we should retry snapshots 🤔 SymDB uploads and probe statuses I agree, but for logs, it might be overkill?

ok let's turn it to 0. at least we have the mechanism in place

shatzi · 2024-10-22T19:07:35Z

dd-java-agent/agent-debugger/src/main/java/com/datadog/debugger/agent/DebuggerTransformer.java

+                new BatchUploader(
+                    config,
+                    config.getFinalDebuggerSnapshotUrl(),
+                    new BatchUploader.RetryPolicy(3, 3))),


why this is not using the default SNAPSHOT_RETRY_POLICY ?

shatzi · 2024-10-22T19:08:23Z

dd-java-agent/agent-debugger/src/main/java/com/datadog/debugger/sink/DebuggerSink.java

+            config,
+            null,
+            new BatchUploader(
+                config, config.getFinalDebuggerSnapshotUrl(), new BatchUploader.RetryPolicy(3, 3))),


same here. should we just use one default so it would be easier to main. maybe even move all the other defaults to be under one location (maybe BatchUploader.Defaults ?)

shatzi · 2024-10-22T19:12:44Z

dd-java-agent/agent-debugger/src/main/java/com/datadog/debugger/agent/DebuggerAgent.java

-            config, tags, new BatchUploader(config, config.getFinalDebuggerSnapshotUrl()));
+            config,
+            tags,
+            new BatchUploader(config, config.getFinalDebuggerSnapshotUrl(), SNAPSHOT_RETRY_POLICY));


unlike ProbeStatuses and Symbols that are singular events that will not be sent again - Snapshots are stream of events. I wonder if we need to have a test that mimic timeouts/slowdowns while there are many other events been generated and to see we are not creating deadlocks or memory pressure.

Snapshots are stream of events.

not quite agree with this view. some snapshots are more valuable than others specially if you put conditions.
But as answered to Oskar, let's turn it to 0. and see how it goes.

shatzi · 2024-10-23T05:56:51Z

dd-java-agent/agent-debugger/src/main/java/com/datadog/debugger/uploader/BatchUploader.java

-  private static final Logger log = LoggerFactory.getLogger(BatchUploader.class);
+  public static class RetryPolicy {
+    public final ConcurrentMap<Call, Integer> failures = new ConcurrentHashMap<>();
+    public final int maxNetworkFailures;


all code uses those two max with same value. and we count the failures regardless of type. I recommend we simplify and just have maxFailures for now and split it back once we do need that difference.

shatzi · 2024-10-23T06:08:52Z

dd-java-agent/agent-debugger/src/main/java/com/datadog/debugger/uploader/BatchUploader.java

+      if (failure != null) {
+        int failureCount = failure + 1;
+        if (failureCount <= maxFailures) {
+          LOGGER.debug("Retrying upload {}/{}", failureCount, maxFailures);


add call.request().url() to better understand which sync is was coming from or add BatchUploader.name() ?

shatzi · 2024-10-23T06:22:05Z

dd-java-agent/agent-debugger/src/main/java/com/datadog/debugger/uploader/BatchUploader.java

+    call.enqueue(responseCallback);
+    inflightRequests.register();
+  }
+
  private static final class ResponseCallback implements Callback {


suggest this design here, would be simpler to understand.

make it private inner class (not static) - so you don't need to pass ratelimitedLoger, inflightRequests ... they can just use the parent class references.

add two members: request + failures.

make new instances of this callback for every call of buildAndSendRequest - it only need to get a request object.

this can handle all response and can use the request for retry and simple count the failures.

this would enable to remove the failures map that look out of place and not really needed.

on critical path I am always trying to reduce allocations. That's why I have tried to keep this ReponseCallback as it is because we are avoiding to allocate an instance for each request.
I can understand this is not most straightforward to handle the tracking of failures.
Not easy to place correctly the cursor about the perf impact and readability

shatzi · 2024-10-23T06:24:24Z

dd-java-agent/agent-debugger/src/test/java/com/datadog/debugger/uploader/BatchUploaderTest.java

@@ -48,6 +50,7 @@ public class BatchUploaderTest {

  private final MockWebServer server = new MockWebServer();
  private HttpUrl url;
+  private final BatchUploader.RetryPolicy retryPolicy = new BatchUploader.RetryPolicy(3, 3);


for all those tests, maybe add back that BatchUploader ctor overload with default retryPolicy ? or do we really want to specify that retryPolicy everywhere?

I am conflicted to add new overloaded constructor because sometimes it's complicated to maintain on the long run ( because we are accumulating and version for each new parameter)

jpbempel added the comp: debugger Dynamic Instrumentation label Oct 22, 2024

jpbempel requested a review from a team as a code owner October 22, 2024 12:28

jpbempel requested review from shatzi and removed request for a team October 22, 2024 12:28

ojung reviewed Oct 22, 2024

View reviewed changes

shatzi approved these changes Oct 23, 2024

View reviewed changes

comments

9be11d1

ojung approved these changes Oct 23, 2024

View reviewed changes

jpbempel merged commit 893bd2d into master Oct 23, 2024
105 checks passed

jpbempel deleted the jpbempel/retry-policy branch October 23, 2024 11:03

github-actions bot added this to the 1.42.0 milestone Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add retry policy for uploading requests to agent #7824

Add retry policy for uploading requests to agent #7824

jpbempel commented Oct 22, 2024

pr-commenter bot commented Oct 22, 2024 •

edited

Loading

pr-commenter bot commented Oct 22, 2024 •

edited

Loading

ojung left a comment

ojung Oct 22, 2024

jpbempel Oct 23, 2024

ojung Oct 22, 2024

jpbempel Oct 23, 2024

ojung Oct 23, 2024

ojung Oct 22, 2024

jpbempel Oct 23, 2024

ojung Oct 23, 2024

ojung Oct 22, 2024

jpbempel Oct 23, 2024

shatzi Oct 22, 2024

jpbempel Oct 23, 2024

shatzi Oct 22, 2024

jpbempel Oct 23, 2024

shatzi Oct 22, 2024

jpbempel Oct 23, 2024 •

edited

Loading

shatzi Oct 23, 2024

jpbempel Oct 23, 2024

shatzi Oct 23, 2024

jpbempel Oct 23, 2024

shatzi Oct 23, 2024

jpbempel Oct 23, 2024

shatzi Oct 23, 2024

jpbempel Oct 23, 2024

Add retry policy for uploading requests to agent #7824

Add retry policy for uploading requests to agent #7824

Conversation

jpbempel commented Oct 22, 2024

What Does This Do

Motivation

Additional Notes

Contributor Checklist

pr-commenter bot commented Oct 22, 2024 • edited Loading

Debugger benchmarks

Parameters

Summary

pr-commenter bot commented Oct 22, 2024 • edited Loading

Benchmarks

Startup

Parameters

Summary

Load

Parameters

Summary

Dacapo

Parameters

Summary

ojung left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpbempel Oct 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pr-commenter bot commented Oct 22, 2024 •

edited

Loading

pr-commenter bot commented Oct 22, 2024 •

edited

Loading

jpbempel Oct 23, 2024 •

edited

Loading