Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add retry policy for uploading requests to agent #7824

Merged
merged 2 commits into from
Oct 23, 2024

Conversation

jpbempel
Copy link
Member

What Does This Do

Add a mechanism to retry on network failure or 500 response code. The retry policy store the call made to OkHttp to track number of retries made and the maximum retries allowed by failure categories: network or 500.
3 retries are allowed for snapshots, 10 for probe statuses & symdb

Motivation

Additional Notes

Contributor Checklist

Jira ticket: [PROJ-IDENT]

Add a mechanism to retry on network failure or 500 response code.
The retry policy store the call made to OkHttp to track number of
retries made and the maximum retries allowed by failure categories:
network or 500.
3 retries are allowed for snapshots, 10 for probe statuses & symdb
@jpbempel jpbempel added the comp: debugger Dynamic Instrumentation label Oct 22, 2024
@jpbempel jpbempel requested a review from a team as a code owner October 22, 2024 12:28
@jpbempel jpbempel requested review from shatzi and removed request for a team October 22, 2024 12:28
@pr-commenter
Copy link

pr-commenter bot commented Oct 22, 2024

Debugger benchmarks

Parameters

Baseline Candidate
baseline_or_candidate baseline candidate
ci_job_date 1729671190 1729671565
end_time 2024-10-23T08:14:25 2024-10-23T08:20:39
git_branch master jpbempel/retry-policy
git_commit_sha 179c62a 9be11d1
start_time 2024-10-23T08:13:11 2024-10-23T08:19:26
See matching parameters
Baseline Candidate
ci_job_id 680981191 680981191
ci_pipeline_id 47208441 47208441
cpu_model Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
git_commit_date 1729670552 1729670552

Summary

Found 1 performance improvements and 0 performance regressions! Performance is the same for 8 metrics, 6 unstable metrics.

scenario Δ mean agg_http_req_duration_min Δ mean agg_http_req_duration_p50 Δ mean agg_http_req_duration_p75 Δ mean agg_http_req_duration_p99 Δ mean throughput
scenario:basic better
[-17.652µs; -5.268µs] or [-6.389%; -1.907%]
same same unstable
[-49.422µs; +38.986µs] or [-9.143%; +7.212%]
unstable
[-74.527op/s; +209.479op/s] or [-2.907%; +8.170%]
See unchanged results
scenario Δ mean agg_http_req_duration_min Δ mean agg_http_req_duration_p50 Δ mean agg_http_req_duration_p75 Δ mean agg_http_req_duration_p99 Δ mean throughput
scenario:noprobe unstable
[-24.714µs; +11.994µs] or [-9.420%; +4.572%]
unstable
[-40.384µs; +18.873µs] or [-13.377%; +6.251%]
unstable
[-51.796µs; +30.701µs] or [-16.418%; +9.732%]
unstable
[-104.199µs; +83.426µs] or [-17.201%; +13.772%]
same
scenario:loop unsure
[+0.475µs; +15.761µs] or [+0.005%; +0.156%]
same same same same
Request duration reports for reports
gantt
    title reports - request duration [CI 0.99] : candidate=None, baseline=None
    dateFormat X
    axisFormat %s
section baseline
noprobe (301.898 µs) : 269, 335
.   : milestone, 302,
basic (303.834 µs) : 294, 314
.   : milestone, 304,
loop (10.293 ms) : 10247, 10339
.   : milestone, 10293,
section candidate
noprobe (291.142 µs) : 269, 314
.   : milestone, 291,
basic (293.792 µs) : 284, 304
.   : milestone, 294,
loop (10.292 ms) : 10258, 10326
.   : milestone, 10292,
Loading
  • baseline results
Scenario Request median duration [CI 0.99]
noprobe 301.898 µs [268.836 µs, 334.959 µs]
basic 303.834 µs [294.108 µs, 313.559 µs]
loop 10.293 ms [10.247 ms, 10.339 ms]
  • candidate results
Scenario Request median duration [CI 0.99]
noprobe 291.142 µs [268.676 µs, 313.607 µs]
basic 293.792 µs [283.733 µs, 303.851 µs]
loop 10.292 ms [10.258 ms, 10.326 ms]

@pr-commenter
Copy link

pr-commenter bot commented Oct 22, 2024

Benchmarks

Startup

Parameters

Baseline Candidate
baseline_or_candidate baseline candidate
git_branch master jpbempel/retry-policy
git_commit_date 1729609825 1729670552
git_commit_sha 179c62a 9be11d1
release_version 1.42.0-SNAPSHOT~179c62ad1e 1.42.0-SNAPSHOT~9be11d1ad9
See matching parameters
Baseline Candidate
application insecure-bank insecure-bank
ci_job_date 1729673047 1729673047
ci_job_id 680981185 680981185
ci_pipeline_id 47208441 47208441
cpu_model Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
module Agent Agent
parent None None
variant iast iast

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 51 metrics, 12 unstable metrics.

Startup time reports for insecure-bank
gantt
    title insecure-bank - global startup overhead: candidate=1.42.0-SNAPSHOT~9be11d1ad9, baseline=1.42.0-SNAPSHOT~179c62ad1e

    dateFormat X
    axisFormat %s
section tracing
Agent [baseline] (1.089 s) : 0, 1088878
Total [baseline] (8.594 s) : 0, 8593847
Agent [candidate] (1.078 s) : 0, 1077761
Total [candidate] (8.575 s) : 0, 8574848
section iast
Agent [baseline] (1.222 s) : 0, 1222078
Total [baseline] (9.216 s) : 0, 9216278
Agent [candidate] (1.212 s) : 0, 1211849
Total [candidate] (9.13 s) : 0, 9130102
section iast_HARDCODED_SECRET_DISABLED
Agent [baseline] (1.213 s) : 0, 1212573
Total [baseline] (9.136 s) : 0, 9135552
Agent [candidate] (1.223 s) : 0, 1222654
Total [candidate] (9.126 s) : 0, 9126248
section iast_TELEMETRY_OFF
Agent [baseline] (1.224 s) : 0, 1224186
Total [baseline] (9.155 s) : 0, 9155386
Agent [candidate] (1.209 s) : 0, 1208798
Total [candidate] (9.174 s) : 0, 9174415
Loading
  • baseline results
Module Variant Duration Δ tracing
Agent tracing 1.089 s -
Agent iast 1.222 s 133.201 ms (12.2%)
Agent iast_HARDCODED_SECRET_DISABLED 1.213 s 123.696 ms (11.4%)
Agent iast_TELEMETRY_OFF 1.224 s 135.308 ms (12.4%)
Total tracing 8.594 s -
Total iast 9.216 s 622.431 ms (7.2%)
Total iast_HARDCODED_SECRET_DISABLED 9.136 s 541.704 ms (6.3%)
Total iast_TELEMETRY_OFF 9.155 s 561.539 ms (6.5%)
  • candidate results
Module Variant Duration Δ tracing
Agent tracing 1.078 s -
Agent iast 1.212 s 134.088 ms (12.4%)
Agent iast_HARDCODED_SECRET_DISABLED 1.223 s 144.893 ms (13.4%)
Agent iast_TELEMETRY_OFF 1.209 s 131.037 ms (12.2%)
Total tracing 8.575 s -
Total iast 9.13 s 555.254 ms (6.5%)
Total iast_HARDCODED_SECRET_DISABLED 9.126 s 551.4 ms (6.4%)
Total iast_TELEMETRY_OFF 9.174 s 599.566 ms (7.0%)
gantt
    title insecure-bank - break down per module: candidate=1.42.0-SNAPSHOT~9be11d1ad9, baseline=1.42.0-SNAPSHOT~179c62ad1e

    dateFormat X
    axisFormat %s
section tracing
BytebuddyAgent [baseline] (693.786 ms) : 0, 693786
BytebuddyAgent [candidate] (684.915 ms) : 0, 684915
GlobalTracer [baseline] (315.989 ms) : 0, 315989
GlobalTracer [candidate] (312.814 ms) : 0, 312814
AppSec [baseline] (54.78 ms) : 0, 54780
AppSec [candidate] (53.951 ms) : 0, 53951
Remote Config [baseline] (672.158 µs) : 0, 672
Remote Config [candidate] (669.619 µs) : 0, 670
Telemetry [baseline] (9.772 ms) : 0, 9772
Telemetry [candidate] (11.678 ms) : 0, 11678
section iast
BytebuddyAgent [baseline] (814.983 ms) : 0, 814983
BytebuddyAgent [candidate] (807.229 ms) : 0, 807229
GlobalTracer [baseline] (305.898 ms) : 0, 305898
GlobalTracer [candidate] (304.569 ms) : 0, 304569
AppSec [baseline] (58.509 ms) : 0, 58509
AppSec [candidate] (57.857 ms) : 0, 57857
IAST [baseline] (20.851 ms) : 0, 20851
IAST [candidate] (20.605 ms) : 0, 20605
Remote Config [baseline] (617.876 µs) : 0, 618
Remote Config [candidate] (600.549 µs) : 0, 601
Telemetry [baseline] (7.235 ms) : 0, 7235
Telemetry [candidate] (7.116 ms) : 0, 7116
section iast_HARDCODED_SECRET_DISABLED
BytebuddyAgent [baseline] (808.931 ms) : 0, 808931
BytebuddyAgent [candidate] (815.209 ms) : 0, 815209
GlobalTracer [baseline] (303.84 ms) : 0, 303840
GlobalTracer [candidate] (306.515 ms) : 0, 306515
AppSec [baseline] (57.636 ms) : 0, 57636
AppSec [candidate] (58.355 ms) : 0, 58355
IAST [baseline] (20.663 ms) : 0, 20663
IAST [candidate] (20.772 ms) : 0, 20772
Remote Config [baseline] (604.326 µs) : 0, 604
Remote Config [candidate] (610.439 µs) : 0, 610
Telemetry [baseline] (7.045 ms) : 0, 7045
Telemetry [candidate] (7.22 ms) : 0, 7220
section iast_TELEMETRY_OFF
BytebuddyAgent [baseline] (816.111 ms) : 0, 816111
BytebuddyAgent [candidate] (804.351 ms) : 0, 804351
GlobalTracer [baseline] (307.543 ms) : 0, 307543
GlobalTracer [candidate] (304.148 ms) : 0, 304148
AppSec [baseline] (56.77 ms) : 0, 56770
AppSec [candidate] (58.451 ms) : 0, 58451
IAST [baseline] (21.123 ms) : 0, 21123
IAST [candidate] (20.383 ms) : 0, 20383
Remote Config [baseline] (613.299 µs) : 0, 613
Remote Config [candidate] (608.381 µs) : 0, 608
Telemetry [baseline] (7.992 ms) : 0, 7992
Telemetry [candidate] (7.054 ms) : 0, 7054
Loading
Startup time reports for petclinic
gantt
    title petclinic - global startup overhead: candidate=1.42.0-SNAPSHOT~9be11d1ad9, baseline=1.42.0-SNAPSHOT~179c62ad1e

    dateFormat X
    axisFormat %s
section tracing
Agent [baseline] (1.087 s) : 0, 1087356
Total [baseline] (10.499 s) : 0, 10498981
Agent [candidate] (1.085 s) : 0, 1084768
Total [candidate] (10.525 s) : 0, 10525150
section appsec
Agent [baseline] (1.214 s) : 0, 1213787
Total [baseline] (10.605 s) : 0, 10604634
Agent [candidate] (1.212 s) : 0, 1211504
Total [candidate] (10.617 s) : 0, 10616868
section iast
Agent [baseline] (1.205 s) : 0, 1204545
Total [baseline] (10.919 s) : 0, 10919069
Agent [candidate] (1.204 s) : 0, 1203986
Total [candidate] (10.897 s) : 0, 10897296
section profiling
Agent [baseline] (1.283 s) : 0, 1282772
Total [baseline] (10.709 s) : 0, 10709058
Agent [candidate] (1.286 s) : 0, 1285849
Total [candidate] (10.753 s) : 0, 10752664
Loading
  • baseline results
Module Variant Duration Δ tracing
Agent tracing 1.087 s -
Agent appsec 1.214 s 126.431 ms (11.6%)
Agent iast 1.205 s 117.189 ms (10.8%)
Agent profiling 1.283 s 195.416 ms (18.0%)
Total tracing 10.499 s -
Total appsec 10.605 s 105.653 ms (1.0%)
Total iast 10.919 s 420.088 ms (4.0%)
Total profiling 10.709 s 210.077 ms (2.0%)
  • candidate results
Module Variant Duration Δ tracing
Agent tracing 1.085 s -
Agent appsec 1.212 s 126.737 ms (11.7%)
Agent iast 1.204 s 119.218 ms (11.0%)
Agent profiling 1.286 s 201.081 ms (18.5%)
Total tracing 10.525 s -
Total appsec 10.617 s 91.718 ms (0.9%)
Total iast 10.897 s 372.146 ms (3.5%)
Total profiling 10.753 s 227.514 ms (2.2%)
gantt
    title petclinic - break down per module: candidate=1.42.0-SNAPSHOT~9be11d1ad9, baseline=1.42.0-SNAPSHOT~179c62ad1e

    dateFormat X
    axisFormat %s
section tracing
BytebuddyAgent [baseline] (692.118 ms) : 0, 692118
BytebuddyAgent [candidate] (691.457 ms) : 0, 691457
GlobalTracer [baseline] (315.399 ms) : 0, 315399
GlobalTracer [candidate] (314.891 ms) : 0, 314891
AppSec [baseline] (54.45 ms) : 0, 54450
AppSec [candidate] (54.106 ms) : 0, 54106
Remote Config [baseline] (661.011 µs) : 0, 661
Remote Config [candidate] (670.476 µs) : 0, 670
Telemetry [baseline] (10.898 ms) : 0, 10898
Telemetry [candidate] (9.818 ms) : 0, 9818
section appsec
BytebuddyAgent [baseline] (704.381 ms) : 0, 704381
BytebuddyAgent [candidate] (703.096 ms) : 0, 703096
GlobalTracer [baseline] (310.787 ms) : 0, 310787
GlobalTracer [candidate] (310.303 ms) : 0, 310303
AppSec [baseline] (165.938 ms) : 0, 165938
AppSec [candidate] (166.324 ms) : 0, 166324
Remote Config [baseline] (634.329 µs) : 0, 634
Remote Config [candidate] (632.668 µs) : 0, 633
Telemetry [baseline] (7.814 ms) : 0, 7814
Telemetry [candidate] (7.834 ms) : 0, 7834
IAST [baseline] (21.038 ms) : 0, 21038
IAST [candidate] (19.196 ms) : 0, 19196
section iast
BytebuddyAgent [baseline] (802.748 ms) : 0, 802748
BytebuddyAgent [candidate] (802.563 ms) : 0, 802563
GlobalTracer [baseline] (301.659 ms) : 0, 301659
GlobalTracer [candidate] (302.085 ms) : 0, 302085
AppSec [baseline] (57.327 ms) : 0, 57327
AppSec [candidate] (56.493 ms) : 0, 56493
Remote Config [baseline] (601.234 µs) : 0, 601
Remote Config [candidate] (604.702 µs) : 0, 605
Telemetry [baseline] (7.094 ms) : 0, 7094
Telemetry [candidate] (7.959 ms) : 0, 7959
IAST [baseline] (21.361 ms) : 0, 21361
IAST [candidate] (20.516 ms) : 0, 20516
section profiling
BytebuddyAgent [baseline] (685.056 ms) : 0, 685056
BytebuddyAgent [candidate] (685.917 ms) : 0, 685917
GlobalTracer [baseline] (399.159 ms) : 0, 399159
GlobalTracer [candidate] (399.683 ms) : 0, 399683
AppSec [baseline] (54.845 ms) : 0, 54845
AppSec [candidate] (55.196 ms) : 0, 55196
Remote Config [baseline] (664.553 µs) : 0, 665
Remote Config [candidate] (665.96 µs) : 0, 666
Telemetry [baseline] (12.427 ms) : 0, 12427
Telemetry [candidate] (11.868 ms) : 0, 11868
ProfilingAgent [baseline] (91.54 ms) : 0, 91540
ProfilingAgent [candidate] (93.227 ms) : 0, 93227
Profiling [baseline] (91.563 ms) : 0, 91563
Profiling [candidate] (93.25 ms) : 0, 93250
Loading

Load

Parameters

Baseline Candidate
baseline_or_candidate baseline candidate
end_time 2024-10-23T08:14:28 2024-10-23T08:21:20
git_branch master jpbempel/retry-policy
git_commit_date 1729609825 1729670552
git_commit_sha 179c62a 9be11d1
release_version 1.42.0-SNAPSHOT~179c62ad1e 1.42.0-SNAPSHOT~9be11d1ad9
start_time 2024-10-23T08:14:15 2024-10-23T08:21:07
See matching parameters
Baseline Candidate
application insecure-bank insecure-bank
ci_job_date 1729672028 1729672028
ci_job_id 680981186 680981186
ci_pipeline_id 47208441 47208441
cpu_model Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
variant iast iast

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 11 metrics, 17 unstable metrics.

Request duration reports for petclinic
gantt
    title petclinic - request duration [CI 0.99] : candidate=1.42.0-SNAPSHOT~9be11d1ad9, baseline=1.42.0-SNAPSHOT~179c62ad1e
    dateFormat X
    axisFormat %s
section baseline
no_agent (1.341 ms) : 1322, 1360
.   : milestone, 1341,
appsec (1.728 ms) : 1704, 1751
.   : milestone, 1728,
appsec_no_iast (1.702 ms) : 1677, 1726
.   : milestone, 1702,
iast (1.495 ms) : 1472, 1518
.   : milestone, 1495,
profiling (1.499 ms) : 1475, 1523
.   : milestone, 1499,
tracing (1.465 ms) : 1440, 1490
.   : milestone, 1465,
section candidate
no_agent (1.343 ms) : 1325, 1362
.   : milestone, 1343,
appsec (1.716 ms) : 1692, 1741
.   : milestone, 1716,
appsec_no_iast (1.747 ms) : 1723, 1770
.   : milestone, 1747,
iast (1.479 ms) : 1456, 1502
.   : milestone, 1479,
profiling (1.473 ms) : 1450, 1496
.   : milestone, 1473,
tracing (1.482 ms) : 1458, 1506
.   : milestone, 1482,
Loading
  • baseline results
Variant Request duration [CI 0.99] Δ no_agent
no_agent 1.341 ms [1.322 ms, 1.36 ms] -
appsec 1.728 ms [1.704 ms, 1.751 ms] 386.833 µs (28.8%)
appsec_no_iast 1.702 ms [1.677 ms, 1.726 ms] 360.811 µs (26.9%)
iast 1.495 ms [1.472 ms, 1.518 ms] 154.25 µs (11.5%)
profiling 1.499 ms [1.475 ms, 1.523 ms] 158.098 µs (11.8%)
tracing 1.465 ms [1.44 ms, 1.49 ms] 124.197 µs (9.3%)
  • candidate results
Variant Request duration [CI 0.99] Δ no_agent
no_agent 1.343 ms [1.325 ms, 1.362 ms] -
appsec 1.716 ms [1.692 ms, 1.741 ms] 372.98 µs (27.8%)
appsec_no_iast 1.747 ms [1.723 ms, 1.77 ms] 403.144 µs (30.0%)
iast 1.479 ms [1.456 ms, 1.502 ms] 135.299 µs (10.1%)
profiling 1.473 ms [1.45 ms, 1.496 ms] 129.538 µs (9.6%)
tracing 1.482 ms [1.458 ms, 1.506 ms] 138.783 µs (10.3%)
Request duration reports for insecure-bank
gantt
    title insecure-bank - request duration [CI 0.99] : candidate=1.42.0-SNAPSHOT~9be11d1ad9, baseline=1.42.0-SNAPSHOT~179c62ad1e
    dateFormat X
    axisFormat %s
section baseline
no_agent (366.366 µs) : 347, 386
.   : milestone, 366,
iast (482.512 µs) : 461, 504
.   : milestone, 483,
iast_FULL (558.2 µs) : 537, 579
.   : milestone, 558,
iast_GLOBAL (518.163 µs) : 496, 540
.   : milestone, 518,
iast_HARDCODED_SECRET_DISABLED (485.004 µs) : 464, 506
.   : milestone, 485,
iast_INACTIVE (450.174 µs) : 429, 472
.   : milestone, 450,
iast_TELEMETRY_OFF (482.625 µs) : 460, 505
.   : milestone, 483,
tracing (444.203 µs) : 423, 465
.   : milestone, 444,
section candidate
no_agent (375.284 µs) : 354, 397
.   : milestone, 375,
iast (483.221 µs) : 462, 504
.   : milestone, 483,
iast_FULL (550.67 µs) : 530, 572
.   : milestone, 551,
iast_GLOBAL (524.643 µs) : 503, 546
.   : milestone, 525,
iast_HARDCODED_SECRET_DISABLED (479.223 µs) : 458, 500
.   : milestone, 479,
iast_INACTIVE (449.53 µs) : 428, 471
.   : milestone, 450,
iast_TELEMETRY_OFF (472.274 µs) : 451, 493
.   : milestone, 472,
tracing (446.32 µs) : 425, 468
.   : milestone, 446,
Loading
  • baseline results
Variant Request duration [CI 0.99] Δ no_agent
no_agent 366.366 µs [347.094 µs, 385.637 µs] -
iast 482.512 µs [461.241 µs, 503.783 µs] 116.146 µs (31.7%)
iast_FULL 558.2 µs [536.996 µs, 579.404 µs] 191.834 µs (52.4%)
iast_GLOBAL 518.163 µs [496.375 µs, 539.95 µs] 151.797 µs (41.4%)
iast_HARDCODED_SECRET_DISABLED 485.004 µs [463.703 µs, 506.304 µs] 118.638 µs (32.4%)
iast_INACTIVE 450.174 µs [428.53 µs, 471.817 µs] 83.808 µs (22.9%)
iast_TELEMETRY_OFF 482.625 µs [460.344 µs, 504.906 µs] 116.259 µs (31.7%)
tracing 444.203 µs [423.221 µs, 465.184 µs] 77.837 µs (21.2%)
  • candidate results
Variant Request duration [CI 0.99] Δ no_agent
no_agent 375.284 µs [354.032 µs, 396.536 µs] -
iast 483.221 µs [461.972 µs, 504.469 µs] 107.937 µs (28.8%)
iast_FULL 550.67 µs [529.552 µs, 571.789 µs] 175.386 µs (46.7%)
iast_GLOBAL 524.643 µs [503.117 µs, 546.168 µs] 149.359 µs (39.8%)
iast_HARDCODED_SECRET_DISABLED 479.223 µs [457.959 µs, 500.486 µs] 103.939 µs (27.7%)
iast_INACTIVE 449.53 µs [428.211 µs, 470.85 µs] 74.246 µs (19.8%)
iast_TELEMETRY_OFF 472.274 µs [451.331 µs, 493.217 µs] 96.99 µs (25.8%)
tracing 446.32 µs [424.676 µs, 467.963 µs] 71.036 µs (18.9%)

Dacapo

Parameters

Baseline Candidate
baseline_or_candidate baseline candidate
git_branch master jpbempel/retry-policy
git_commit_date 1729609825 1729670552
git_commit_sha 179c62a 9be11d1
release_version 1.42.0-SNAPSHOT~179c62ad1e 1.42.0-SNAPSHOT~9be11d1ad9
See matching parameters
Baseline Candidate
application biojava biojava
ci_job_date 1729672557 1729672557
ci_job_id 680981187 680981187
ci_pipeline_id 47208441 47208441
cpu_model Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
variant appsec appsec

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 12 metrics, 0 unstable metrics.

Execution time for tomcat
gantt
    title tomcat - execution time [CI 0.99] : candidate=1.42.0-SNAPSHOT~9be11d1ad9, baseline=1.42.0-SNAPSHOT~179c62ad1e
    dateFormat X
    axisFormat %s
section baseline
no_agent (1.467 ms) : 1456, 1478
.   : milestone, 1467,
appsec (2.335 ms) : 2294, 2376
.   : milestone, 2335,
iast (2.089 ms) : 2036, 2141
.   : milestone, 2089,
iast_GLOBAL (2.13 ms) : 2078, 2183
.   : milestone, 2130,
profiling (1.958 ms) : 1916, 2000
.   : milestone, 1958,
tracing (1.931 ms) : 1891, 1971
.   : milestone, 1931,
section candidate
no_agent (1.472 ms) : 1461, 1484
.   : milestone, 1472,
appsec (2.365 ms) : 2323, 2407
.   : milestone, 2365,
iast (2.092 ms) : 2040, 2145
.   : milestone, 2092,
iast_GLOBAL (2.14 ms) : 2087, 2193
.   : milestone, 2140,
profiling (1.955 ms) : 1912, 1998
.   : milestone, 1955,
tracing (1.925 ms) : 1885, 1965
.   : milestone, 1925,
Loading
  • baseline results
Variant Execution Time [CI 0.99] Δ no_agent
no_agent 1.467 ms [1.456 ms, 1.478 ms] -
appsec 2.335 ms [2.294 ms, 2.376 ms] 868.301 µs (59.2%)
iast 2.089 ms [2.036 ms, 2.141 ms] 621.723 µs (42.4%)
iast_GLOBAL 2.13 ms [2.078 ms, 2.183 ms] 663.342 µs (45.2%)
profiling 1.958 ms [1.916 ms, 2.0 ms] 491.408 µs (33.5%)
tracing 1.931 ms [1.891 ms, 1.971 ms] 464.068 µs (31.6%)
  • candidate results
Variant Execution Time [CI 0.99] Δ no_agent
no_agent 1.472 ms [1.461 ms, 1.484 ms] -
appsec 2.365 ms [2.323 ms, 2.407 ms] 892.644 µs (60.6%)
iast 2.092 ms [2.04 ms, 2.145 ms] 619.839 µs (42.1%)
iast_GLOBAL 2.14 ms [2.087 ms, 2.193 ms] 667.578 µs (45.3%)
profiling 1.955 ms [1.912 ms, 1.998 ms] 482.729 µs (32.8%)
tracing 1.925 ms [1.885 ms, 1.965 ms] 452.735 µs (30.7%)
Execution time for biojava
gantt
    title biojava - execution time [CI 0.99] : candidate=1.42.0-SNAPSHOT~9be11d1ad9, baseline=1.42.0-SNAPSHOT~179c62ad1e
    dateFormat X
    axisFormat %s
section baseline
no_agent (15.377 s) : 15377000, 15377000
.   : milestone, 15377000,
appsec (15.269 s) : 15269000, 15269000
.   : milestone, 15269000,
iast (18.908 s) : 18908000, 18908000
.   : milestone, 18908000,
iast_GLOBAL (17.93 s) : 17930000, 17930000
.   : milestone, 17930000,
profiling (15.076 s) : 15076000, 15076000
.   : milestone, 15076000,
tracing (15.071 s) : 15071000, 15071000
.   : milestone, 15071000,
section candidate
no_agent (15.377 s) : 15377000, 15377000
.   : milestone, 15377000,
appsec (15.422 s) : 15422000, 15422000
.   : milestone, 15422000,
iast (19.011 s) : 19011000, 19011000
.   : milestone, 19011000,
iast_GLOBAL (18.344 s) : 18344000, 18344000
.   : milestone, 18344000,
profiling (15.16 s) : 15160000, 15160000
.   : milestone, 15160000,
tracing (15.073 s) : 15073000, 15073000
.   : milestone, 15073000,
Loading
  • baseline results
Variant Execution Time [CI 0.99] Δ no_agent
no_agent 15.377 s [15.377 s, 15.377 s] -
appsec 15.269 s [15.269 s, 15.269 s] -108.0 ms (-0.7%)
iast 18.908 s [18.908 s, 18.908 s] 3.531 s (23.0%)
iast_GLOBAL 17.93 s [17.93 s, 17.93 s] 2.553 s (16.6%)
profiling 15.076 s [15.076 s, 15.076 s] -301.0 ms (-2.0%)
tracing 15.071 s [15.071 s, 15.071 s] -306.0 ms (-2.0%)
  • candidate results
Variant Execution Time [CI 0.99] Δ no_agent
no_agent 15.377 s [15.377 s, 15.377 s] -
appsec 15.422 s [15.422 s, 15.422 s] 45.0 ms (0.3%)
iast 19.011 s [19.011 s, 19.011 s] 3.634 s (23.6%)
iast_GLOBAL 18.344 s [18.344 s, 18.344 s] 2.967 s (19.3%)
profiling 15.16 s [15.16 s, 15.16 s] -217.0 ms (-1.4%)
tracing 15.073 s [15.073 s, 15.073 s] -304.0 ms (-2.0%)

Copy link
Member

@ojung ojung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks 👍 Added couple of comments / questions.

@@ -301,6 +369,11 @@ public void onResponse(final Call call, final Response response) {
response.message(),
response.code());
}
if (response.code() >= 500) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are also 408 (client timeout) and 429 (too many requests) that should be retried according to the official logs http api docs: https://docs.datadoghq.com/api/latest/logs/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for pointing this out, added.

int failureCount = failure + 1;
if (failureCount <= maxFailures) {
LOGGER.debug("Retrying upload {}/{}", failureCount, maxFailures);
enqueueCall(client, call.request(), this, retryPolicy, failureCount, inflightRequests);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't want to over-complicate things, but maybe we should add a backoff? If the requests fail fast, all retries might be done in a very short time interval.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point. The problem is how I deal with this backoff? where I put the new request on hold, who is responsible for enqueue again after x time?
also, we are keeping for a longest period large buffers (serialized symbols up to 50MB)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, we would have to keep them in a collection on the class level I guess. We could either have a scheduled thread that re-enqueues the requests or store a scheduled thread for each retry that has a given initial delay. As I said, it's making things a bit more complicated, but the more I think about it the less I like retrying immediately.

@@ -260,28 +289,67 @@ private boolean canEnqueueMoreRequests() {
return client.dispatcher().queuedCallsCount() < MAX_ENQUEUED_REQUESTS;
}

private static void enqueueCall(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we handling back pressure? If network is unreliable and we retry the same symdb uploads multiple times with backoff, are we going to slow down serializing new classes into memory?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have a MAX_ENQUEUED_REQUESTS limit placed at 20 (see above)
so we are dropping requests if reach this limit

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much effort would it be to change the way we extract symbols to:

  1. Discover jars to parse
  2. Put them (path only) on a queue
  3. Only parse them as fast as we can (without dropping requests)

@@ -48,6 +48,8 @@
/** Debugger agent implementation */
public class DebuggerAgent {
private static final Logger LOGGER = LoggerFactory.getLogger(DebuggerAgent.class);
private static final BatchUploader.RetryPolicy SNAPSHOT_RETRY_POLICY =
new BatchUploader.RetryPolicy(3, 3);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we should retry snapshots 🤔 SymDB uploads and probe statuses I agree, but for logs, it might be overkill?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok let's turn it to 0. at least we have the mechanism in place

new BatchUploader(
config,
config.getFinalDebuggerSnapshotUrl(),
new BatchUploader.RetryPolicy(3, 3))),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this is not using the default SNAPSHOT_RETRY_POLICY ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

config,
null,
new BatchUploader(
config, config.getFinalDebuggerSnapshotUrl(), new BatchUploader.RetryPolicy(3, 3))),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here. should we just use one default so it would be easier to main. maybe even move all the other defaults to be under one location (maybe BatchUploader.Defaults ?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

config, tags, new BatchUploader(config, config.getFinalDebuggerSnapshotUrl()));
config,
tags,
new BatchUploader(config, config.getFinalDebuggerSnapshotUrl(), SNAPSHOT_RETRY_POLICY));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unlike ProbeStatuses and Symbols that are singular events that will not be sent again - Snapshots are stream of events. I wonder if we need to have a test that mimic timeouts/slowdowns while there are many other events been generated and to see we are not creating deadlocks or memory pressure.

Copy link
Member Author

@jpbempel jpbempel Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Snapshots are stream of events.

not quite agree with this view. some snapshots are more valuable than others specially if you put conditions.
But as answered to Oskar, let's turn it to 0. and see how it goes.

private static final Logger log = LoggerFactory.getLogger(BatchUploader.class);
public static class RetryPolicy {
public final ConcurrentMap<Call, Integer> failures = new ConcurrentHashMap<>();
public final int maxNetworkFailures;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all code uses those two max with same value. and we count the failures regardless of type. I recommend we simplify and just have maxFailures for now and split it back once we do need that difference.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

if (failure != null) {
int failureCount = failure + 1;
if (failureCount <= maxFailures) {
LOGGER.debug("Retrying upload {}/{}", failureCount, maxFailures);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add call.request().url() to better understand which sync is was coming from or add BatchUploader.name() ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

call.enqueue(responseCallback);
inflightRequests.register();
}

private static final class ResponseCallback implements Callback {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest this design here, would be simpler to understand.

  1. make it private inner class (not static) - so you don't need to pass ratelimitedLoger, inflightRequests ... they can just use the parent class references.
  2. add two members: request + failures.
  3. make new instances of this callback for every call of buildAndSendRequest - it only need to get a request object.
  4. this can handle all response and can use the request for retry and simple count the failures.

this would enable to remove the failures map that look out of place and not really needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on critical path I am always trying to reduce allocations. That's why I have tried to keep this ReponseCallback as it is because we are avoiding to allocate an instance for each request.
I can understand this is not most straightforward to handle the tracking of failures.
Not easy to place correctly the cursor about the perf impact and readability

@@ -48,6 +50,7 @@ public class BatchUploaderTest {

private final MockWebServer server = new MockWebServer();
private HttpUrl url;
private final BatchUploader.RetryPolicy retryPolicy = new BatchUploader.RetryPolicy(3, 3);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for all those tests, maybe add back that BatchUploader ctor overload with default retryPolicy ? or do we really want to specify that retryPolicy everywhere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am conflicted to add new overloaded constructor because sometimes it's complicated to maintain on the long run ( because we are accumulating and version for each new parameter)

@jpbempel jpbempel merged commit 893bd2d into master Oct 23, 2024
105 checks passed
@jpbempel jpbempel deleted the jpbempel/retry-policy branch October 23, 2024 11:03
@github-actions github-actions bot added this to the 1.42.0 milestone Oct 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp: debugger Dynamic Instrumentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants