-
Notifications
You must be signed in to change notification settings - Fork 292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add retry policy for uploading requests to agent #7824
Conversation
Add a mechanism to retry on network failure or 500 response code. The retry policy store the call made to OkHttp to track number of retries made and the maximum retries allowed by failure categories: network or 500. 3 retries are allowed for snapshots, 10 for probe statuses & symdb
Debugger benchmarksParameters
See matching parameters
SummaryFound 1 performance improvements and 0 performance regressions! Performance is the same for 8 metrics, 6 unstable metrics.
See unchanged results
Request duration reports for reportsgantt
title reports - request duration [CI 0.99] : candidate=None, baseline=None
dateFormat X
axisFormat %s
section baseline
noprobe (301.898 µs) : 269, 335
. : milestone, 302,
basic (303.834 µs) : 294, 314
. : milestone, 304,
loop (10.293 ms) : 10247, 10339
. : milestone, 10293,
section candidate
noprobe (291.142 µs) : 269, 314
. : milestone, 291,
basic (293.792 µs) : 284, 304
. : milestone, 294,
loop (10.292 ms) : 10258, 10326
. : milestone, 10292,
|
BenchmarksStartupParameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 51 metrics, 12 unstable metrics. Startup time reports for insecure-bankgantt
title insecure-bank - global startup overhead: candidate=1.42.0-SNAPSHOT~9be11d1ad9, baseline=1.42.0-SNAPSHOT~179c62ad1e
dateFormat X
axisFormat %s
section tracing
Agent [baseline] (1.089 s) : 0, 1088878
Total [baseline] (8.594 s) : 0, 8593847
Agent [candidate] (1.078 s) : 0, 1077761
Total [candidate] (8.575 s) : 0, 8574848
section iast
Agent [baseline] (1.222 s) : 0, 1222078
Total [baseline] (9.216 s) : 0, 9216278
Agent [candidate] (1.212 s) : 0, 1211849
Total [candidate] (9.13 s) : 0, 9130102
section iast_HARDCODED_SECRET_DISABLED
Agent [baseline] (1.213 s) : 0, 1212573
Total [baseline] (9.136 s) : 0, 9135552
Agent [candidate] (1.223 s) : 0, 1222654
Total [candidate] (9.126 s) : 0, 9126248
section iast_TELEMETRY_OFF
Agent [baseline] (1.224 s) : 0, 1224186
Total [baseline] (9.155 s) : 0, 9155386
Agent [candidate] (1.209 s) : 0, 1208798
Total [candidate] (9.174 s) : 0, 9174415
gantt
title insecure-bank - break down per module: candidate=1.42.0-SNAPSHOT~9be11d1ad9, baseline=1.42.0-SNAPSHOT~179c62ad1e
dateFormat X
axisFormat %s
section tracing
BytebuddyAgent [baseline] (693.786 ms) : 0, 693786
BytebuddyAgent [candidate] (684.915 ms) : 0, 684915
GlobalTracer [baseline] (315.989 ms) : 0, 315989
GlobalTracer [candidate] (312.814 ms) : 0, 312814
AppSec [baseline] (54.78 ms) : 0, 54780
AppSec [candidate] (53.951 ms) : 0, 53951
Remote Config [baseline] (672.158 µs) : 0, 672
Remote Config [candidate] (669.619 µs) : 0, 670
Telemetry [baseline] (9.772 ms) : 0, 9772
Telemetry [candidate] (11.678 ms) : 0, 11678
section iast
BytebuddyAgent [baseline] (814.983 ms) : 0, 814983
BytebuddyAgent [candidate] (807.229 ms) : 0, 807229
GlobalTracer [baseline] (305.898 ms) : 0, 305898
GlobalTracer [candidate] (304.569 ms) : 0, 304569
AppSec [baseline] (58.509 ms) : 0, 58509
AppSec [candidate] (57.857 ms) : 0, 57857
IAST [baseline] (20.851 ms) : 0, 20851
IAST [candidate] (20.605 ms) : 0, 20605
Remote Config [baseline] (617.876 µs) : 0, 618
Remote Config [candidate] (600.549 µs) : 0, 601
Telemetry [baseline] (7.235 ms) : 0, 7235
Telemetry [candidate] (7.116 ms) : 0, 7116
section iast_HARDCODED_SECRET_DISABLED
BytebuddyAgent [baseline] (808.931 ms) : 0, 808931
BytebuddyAgent [candidate] (815.209 ms) : 0, 815209
GlobalTracer [baseline] (303.84 ms) : 0, 303840
GlobalTracer [candidate] (306.515 ms) : 0, 306515
AppSec [baseline] (57.636 ms) : 0, 57636
AppSec [candidate] (58.355 ms) : 0, 58355
IAST [baseline] (20.663 ms) : 0, 20663
IAST [candidate] (20.772 ms) : 0, 20772
Remote Config [baseline] (604.326 µs) : 0, 604
Remote Config [candidate] (610.439 µs) : 0, 610
Telemetry [baseline] (7.045 ms) : 0, 7045
Telemetry [candidate] (7.22 ms) : 0, 7220
section iast_TELEMETRY_OFF
BytebuddyAgent [baseline] (816.111 ms) : 0, 816111
BytebuddyAgent [candidate] (804.351 ms) : 0, 804351
GlobalTracer [baseline] (307.543 ms) : 0, 307543
GlobalTracer [candidate] (304.148 ms) : 0, 304148
AppSec [baseline] (56.77 ms) : 0, 56770
AppSec [candidate] (58.451 ms) : 0, 58451
IAST [baseline] (21.123 ms) : 0, 21123
IAST [candidate] (20.383 ms) : 0, 20383
Remote Config [baseline] (613.299 µs) : 0, 613
Remote Config [candidate] (608.381 µs) : 0, 608
Telemetry [baseline] (7.992 ms) : 0, 7992
Telemetry [candidate] (7.054 ms) : 0, 7054
Startup time reports for petclinicgantt
title petclinic - global startup overhead: candidate=1.42.0-SNAPSHOT~9be11d1ad9, baseline=1.42.0-SNAPSHOT~179c62ad1e
dateFormat X
axisFormat %s
section tracing
Agent [baseline] (1.087 s) : 0, 1087356
Total [baseline] (10.499 s) : 0, 10498981
Agent [candidate] (1.085 s) : 0, 1084768
Total [candidate] (10.525 s) : 0, 10525150
section appsec
Agent [baseline] (1.214 s) : 0, 1213787
Total [baseline] (10.605 s) : 0, 10604634
Agent [candidate] (1.212 s) : 0, 1211504
Total [candidate] (10.617 s) : 0, 10616868
section iast
Agent [baseline] (1.205 s) : 0, 1204545
Total [baseline] (10.919 s) : 0, 10919069
Agent [candidate] (1.204 s) : 0, 1203986
Total [candidate] (10.897 s) : 0, 10897296
section profiling
Agent [baseline] (1.283 s) : 0, 1282772
Total [baseline] (10.709 s) : 0, 10709058
Agent [candidate] (1.286 s) : 0, 1285849
Total [candidate] (10.753 s) : 0, 10752664
gantt
title petclinic - break down per module: candidate=1.42.0-SNAPSHOT~9be11d1ad9, baseline=1.42.0-SNAPSHOT~179c62ad1e
dateFormat X
axisFormat %s
section tracing
BytebuddyAgent [baseline] (692.118 ms) : 0, 692118
BytebuddyAgent [candidate] (691.457 ms) : 0, 691457
GlobalTracer [baseline] (315.399 ms) : 0, 315399
GlobalTracer [candidate] (314.891 ms) : 0, 314891
AppSec [baseline] (54.45 ms) : 0, 54450
AppSec [candidate] (54.106 ms) : 0, 54106
Remote Config [baseline] (661.011 µs) : 0, 661
Remote Config [candidate] (670.476 µs) : 0, 670
Telemetry [baseline] (10.898 ms) : 0, 10898
Telemetry [candidate] (9.818 ms) : 0, 9818
section appsec
BytebuddyAgent [baseline] (704.381 ms) : 0, 704381
BytebuddyAgent [candidate] (703.096 ms) : 0, 703096
GlobalTracer [baseline] (310.787 ms) : 0, 310787
GlobalTracer [candidate] (310.303 ms) : 0, 310303
AppSec [baseline] (165.938 ms) : 0, 165938
AppSec [candidate] (166.324 ms) : 0, 166324
Remote Config [baseline] (634.329 µs) : 0, 634
Remote Config [candidate] (632.668 µs) : 0, 633
Telemetry [baseline] (7.814 ms) : 0, 7814
Telemetry [candidate] (7.834 ms) : 0, 7834
IAST [baseline] (21.038 ms) : 0, 21038
IAST [candidate] (19.196 ms) : 0, 19196
section iast
BytebuddyAgent [baseline] (802.748 ms) : 0, 802748
BytebuddyAgent [candidate] (802.563 ms) : 0, 802563
GlobalTracer [baseline] (301.659 ms) : 0, 301659
GlobalTracer [candidate] (302.085 ms) : 0, 302085
AppSec [baseline] (57.327 ms) : 0, 57327
AppSec [candidate] (56.493 ms) : 0, 56493
Remote Config [baseline] (601.234 µs) : 0, 601
Remote Config [candidate] (604.702 µs) : 0, 605
Telemetry [baseline] (7.094 ms) : 0, 7094
Telemetry [candidate] (7.959 ms) : 0, 7959
IAST [baseline] (21.361 ms) : 0, 21361
IAST [candidate] (20.516 ms) : 0, 20516
section profiling
BytebuddyAgent [baseline] (685.056 ms) : 0, 685056
BytebuddyAgent [candidate] (685.917 ms) : 0, 685917
GlobalTracer [baseline] (399.159 ms) : 0, 399159
GlobalTracer [candidate] (399.683 ms) : 0, 399683
AppSec [baseline] (54.845 ms) : 0, 54845
AppSec [candidate] (55.196 ms) : 0, 55196
Remote Config [baseline] (664.553 µs) : 0, 665
Remote Config [candidate] (665.96 µs) : 0, 666
Telemetry [baseline] (12.427 ms) : 0, 12427
Telemetry [candidate] (11.868 ms) : 0, 11868
ProfilingAgent [baseline] (91.54 ms) : 0, 91540
ProfilingAgent [candidate] (93.227 ms) : 0, 93227
Profiling [baseline] (91.563 ms) : 0, 91563
Profiling [candidate] (93.25 ms) : 0, 93250
LoadParameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 11 metrics, 17 unstable metrics. Request duration reports for petclinicgantt
title petclinic - request duration [CI 0.99] : candidate=1.42.0-SNAPSHOT~9be11d1ad9, baseline=1.42.0-SNAPSHOT~179c62ad1e
dateFormat X
axisFormat %s
section baseline
no_agent (1.341 ms) : 1322, 1360
. : milestone, 1341,
appsec (1.728 ms) : 1704, 1751
. : milestone, 1728,
appsec_no_iast (1.702 ms) : 1677, 1726
. : milestone, 1702,
iast (1.495 ms) : 1472, 1518
. : milestone, 1495,
profiling (1.499 ms) : 1475, 1523
. : milestone, 1499,
tracing (1.465 ms) : 1440, 1490
. : milestone, 1465,
section candidate
no_agent (1.343 ms) : 1325, 1362
. : milestone, 1343,
appsec (1.716 ms) : 1692, 1741
. : milestone, 1716,
appsec_no_iast (1.747 ms) : 1723, 1770
. : milestone, 1747,
iast (1.479 ms) : 1456, 1502
. : milestone, 1479,
profiling (1.473 ms) : 1450, 1496
. : milestone, 1473,
tracing (1.482 ms) : 1458, 1506
. : milestone, 1482,
Request duration reports for insecure-bankgantt
title insecure-bank - request duration [CI 0.99] : candidate=1.42.0-SNAPSHOT~9be11d1ad9, baseline=1.42.0-SNAPSHOT~179c62ad1e
dateFormat X
axisFormat %s
section baseline
no_agent (366.366 µs) : 347, 386
. : milestone, 366,
iast (482.512 µs) : 461, 504
. : milestone, 483,
iast_FULL (558.2 µs) : 537, 579
. : milestone, 558,
iast_GLOBAL (518.163 µs) : 496, 540
. : milestone, 518,
iast_HARDCODED_SECRET_DISABLED (485.004 µs) : 464, 506
. : milestone, 485,
iast_INACTIVE (450.174 µs) : 429, 472
. : milestone, 450,
iast_TELEMETRY_OFF (482.625 µs) : 460, 505
. : milestone, 483,
tracing (444.203 µs) : 423, 465
. : milestone, 444,
section candidate
no_agent (375.284 µs) : 354, 397
. : milestone, 375,
iast (483.221 µs) : 462, 504
. : milestone, 483,
iast_FULL (550.67 µs) : 530, 572
. : milestone, 551,
iast_GLOBAL (524.643 µs) : 503, 546
. : milestone, 525,
iast_HARDCODED_SECRET_DISABLED (479.223 µs) : 458, 500
. : milestone, 479,
iast_INACTIVE (449.53 µs) : 428, 471
. : milestone, 450,
iast_TELEMETRY_OFF (472.274 µs) : 451, 493
. : milestone, 472,
tracing (446.32 µs) : 425, 468
. : milestone, 446,
DacapoParameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 12 metrics, 0 unstable metrics. Execution time for tomcatgantt
title tomcat - execution time [CI 0.99] : candidate=1.42.0-SNAPSHOT~9be11d1ad9, baseline=1.42.0-SNAPSHOT~179c62ad1e
dateFormat X
axisFormat %s
section baseline
no_agent (1.467 ms) : 1456, 1478
. : milestone, 1467,
appsec (2.335 ms) : 2294, 2376
. : milestone, 2335,
iast (2.089 ms) : 2036, 2141
. : milestone, 2089,
iast_GLOBAL (2.13 ms) : 2078, 2183
. : milestone, 2130,
profiling (1.958 ms) : 1916, 2000
. : milestone, 1958,
tracing (1.931 ms) : 1891, 1971
. : milestone, 1931,
section candidate
no_agent (1.472 ms) : 1461, 1484
. : milestone, 1472,
appsec (2.365 ms) : 2323, 2407
. : milestone, 2365,
iast (2.092 ms) : 2040, 2145
. : milestone, 2092,
iast_GLOBAL (2.14 ms) : 2087, 2193
. : milestone, 2140,
profiling (1.955 ms) : 1912, 1998
. : milestone, 1955,
tracing (1.925 ms) : 1885, 1965
. : milestone, 1925,
Execution time for biojavagantt
title biojava - execution time [CI 0.99] : candidate=1.42.0-SNAPSHOT~9be11d1ad9, baseline=1.42.0-SNAPSHOT~179c62ad1e
dateFormat X
axisFormat %s
section baseline
no_agent (15.377 s) : 15377000, 15377000
. : milestone, 15377000,
appsec (15.269 s) : 15269000, 15269000
. : milestone, 15269000,
iast (18.908 s) : 18908000, 18908000
. : milestone, 18908000,
iast_GLOBAL (17.93 s) : 17930000, 17930000
. : milestone, 17930000,
profiling (15.076 s) : 15076000, 15076000
. : milestone, 15076000,
tracing (15.071 s) : 15071000, 15071000
. : milestone, 15071000,
section candidate
no_agent (15.377 s) : 15377000, 15377000
. : milestone, 15377000,
appsec (15.422 s) : 15422000, 15422000
. : milestone, 15422000,
iast (19.011 s) : 19011000, 19011000
. : milestone, 19011000,
iast_GLOBAL (18.344 s) : 18344000, 18344000
. : milestone, 18344000,
profiling (15.16 s) : 15160000, 15160000
. : milestone, 15160000,
tracing (15.073 s) : 15073000, 15073000
. : milestone, 15073000,
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks 👍 Added couple of comments / questions.
@@ -301,6 +369,11 @@ public void onResponse(final Call call, final Response response) { | |||
response.message(), | |||
response.code()); | |||
} | |||
if (response.code() >= 500) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are also 408
(client timeout) and 429
(too many requests) that should be retried according to the official logs http api docs: https://docs.datadoghq.com/api/latest/logs/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for pointing this out, added.
int failureCount = failure + 1; | ||
if (failureCount <= maxFailures) { | ||
LOGGER.debug("Retrying upload {}/{}", failureCount, maxFailures); | ||
enqueueCall(client, call.request(), this, retryPolicy, failureCount, inflightRequests); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't want to over-complicate things, but maybe we should add a backoff? If the requests fail fast, all retries might be done in a very short time interval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point. The problem is how I deal with this backoff? where I put the new request on hold, who is responsible for enqueue again after x time?
also, we are keeping for a longest period large buffers (serialized symbols up to 50MB)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, we would have to keep them in a collection on the class level I guess. We could either have a scheduled thread that re-enqueues the requests or store a scheduled thread for each retry that has a given initial delay. As I said, it's making things a bit more complicated, but the more I think about it the less I like retrying immediately.
@@ -260,28 +289,67 @@ private boolean canEnqueueMoreRequests() { | |||
return client.dispatcher().queuedCallsCount() < MAX_ENQUEUED_REQUESTS; | |||
} | |||
|
|||
private static void enqueueCall( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are we handling back pressure? If network is unreliable and we retry the same symdb uploads multiple times with backoff, are we going to slow down serializing new classes into memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have a MAX_ENQUEUED_REQUESTS limit placed at 20 (see above)
so we are dropping requests if reach this limit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How much effort would it be to change the way we extract symbols to:
- Discover jars to parse
- Put them (path only) on a queue
- Only parse them as fast as we can (without dropping requests)
@@ -48,6 +48,8 @@ | |||
/** Debugger agent implementation */ | |||
public class DebuggerAgent { | |||
private static final Logger LOGGER = LoggerFactory.getLogger(DebuggerAgent.class); | |||
private static final BatchUploader.RetryPolicy SNAPSHOT_RETRY_POLICY = | |||
new BatchUploader.RetryPolicy(3, 3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we should retry snapshots 🤔 SymDB uploads and probe statuses I agree, but for logs, it might be overkill?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok let's turn it to 0. at least we have the mechanism in place
new BatchUploader( | ||
config, | ||
config.getFinalDebuggerSnapshotUrl(), | ||
new BatchUploader.RetryPolicy(3, 3))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this is not using the default SNAPSHOT_RETRY_POLICY ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
config, | ||
null, | ||
new BatchUploader( | ||
config, config.getFinalDebuggerSnapshotUrl(), new BatchUploader.RetryPolicy(3, 3))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here. should we just use one default so it would be easier to main. maybe even move all the other defaults to be under one location (maybe BatchUploader.Defaults ?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
config, tags, new BatchUploader(config, config.getFinalDebuggerSnapshotUrl())); | ||
config, | ||
tags, | ||
new BatchUploader(config, config.getFinalDebuggerSnapshotUrl(), SNAPSHOT_RETRY_POLICY)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unlike ProbeStatuses and Symbols that are singular events that will not be sent again - Snapshots are stream of events. I wonder if we need to have a test that mimic timeouts/slowdowns while there are many other events been generated and to see we are not creating deadlocks or memory pressure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Snapshots are stream of events.
not quite agree with this view. some snapshots are more valuable than others specially if you put conditions.
But as answered to Oskar, let's turn it to 0. and see how it goes.
private static final Logger log = LoggerFactory.getLogger(BatchUploader.class); | ||
public static class RetryPolicy { | ||
public final ConcurrentMap<Call, Integer> failures = new ConcurrentHashMap<>(); | ||
public final int maxNetworkFailures; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all code uses those two max with same value. and we count the failures regardless of type. I recommend we simplify and just have maxFailures for now and split it back once we do need that difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
if (failure != null) { | ||
int failureCount = failure + 1; | ||
if (failureCount <= maxFailures) { | ||
LOGGER.debug("Retrying upload {}/{}", failureCount, maxFailures); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add call.request().url() to better understand which sync is was coming from or add BatchUploader.name() ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
call.enqueue(responseCallback); | ||
inflightRequests.register(); | ||
} | ||
|
||
private static final class ResponseCallback implements Callback { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggest this design here, would be simpler to understand.
- make it private inner class (not static) - so you don't need to pass ratelimitedLoger, inflightRequests ... they can just use the parent class references.
- add two members: request + failures.
- make new instances of this callback for every call of buildAndSendRequest - it only need to get a request object.
- this can handle all response and can use the request for retry and simple count the failures.
this would enable to remove the failures map that look out of place and not really needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on critical path I am always trying to reduce allocations. That's why I have tried to keep this ReponseCallback as it is because we are avoiding to allocate an instance for each request.
I can understand this is not most straightforward to handle the tracking of failures.
Not easy to place correctly the cursor about the perf impact and readability
@@ -48,6 +50,7 @@ public class BatchUploaderTest { | |||
|
|||
private final MockWebServer server = new MockWebServer(); | |||
private HttpUrl url; | |||
private final BatchUploader.RetryPolicy retryPolicy = new BatchUploader.RetryPolicy(3, 3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for all those tests, maybe add back that BatchUploader ctor overload with default retryPolicy ? or do we really want to specify that retryPolicy everywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am conflicted to add new overloaded constructor because sometimes it's complicated to maintain on the long run ( because we are accumulating and version for each new parameter)
What Does This Do
Add a mechanism to retry on network failure or 500 response code. The retry policy store the call made to OkHttp to track number of retries made and the maximum retries allowed by failure categories: network or 500.
3 retries are allowed for snapshots, 10 for probe statuses & symdb
Motivation
Additional Notes
Contributor Checklist
type:
and (comp:
orinst:
) labels in addition to any usefull labelsclose
,fix
or any linking keywords when referencing an issue.Use
solves
instead, and assign the PR milestone to the issueJira ticket: [PROJ-IDENT]