Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support/db: Delay canceling queries from client side when there's a statement / transaction timeout configured in postgres #5223

Merged
merged 13 commits into from
Mar 14, 2024

Conversation

tamirms
Copy link
Contributor

@tamirms tamirms commented Feb 26, 2024

PR Checklist

PR Structure

  • This PR has reasonably narrow scope (if not, break it down into smaller PRs).
  • This PR avoids mixing refactoring changes with feature changes (split into two PRs
    otherwise).
  • This PR's title starts with name of package that is most changed in the PR, ex.
    services/friendbot, or all or doc if the changes are broad or impact many
    packages.

Thoroughness

  • This PR adds tests for the most critical parts of the new functionality or fixes.
  • I've updated any docs (developer docs, .md
    files, etc... affected by this change). Take a look in the docs folder for a given service,
    like this one.

Release planning

  • I've updated the relevant CHANGELOG (here for Horizon) if
    needed with deprecations, added features, breaking changes, and DB schema changes.
  • I've decided if this PR requires a new major/minor version according to
    semver, or if it's mainly a patch change. The PR is targeted at the next
    release branch if it's not a patch change.

What

Add a --cancel-db-query-timeout flag which configures a timeout on all db queries invoked by a horizon HTTP request. This timeout is a postgres client side timeout which means that it is triggered by horizon establishing a new connection to postgres in order to cancel the ongoing query.

Why

Previously, we configured both postgres client side timeouts and server side timeouts (in the form of the statement_timeout postgres setting). However, the client side timeout is redundant when the postgres server will enforce its own statement timeout. In fact, the client side timeout can result in additional load which will exacerbate situations when the postgres DB cannot keep up with the ongoing query traffic.

In our horizon deployments, we plan to configure the client side db timeout to be longer than the server side db timeout. We expect the server side timeout to terminate a majority of the long running queries and the client side timeout will be a fallback in the unexpected scenario where the db query has surpassed the server side timeout.

Known limitations

[N/A]

@tamirms tamirms force-pushed the db-fixes branch 2 times, most recently from 7ff5919 to 3014772 Compare February 29, 2024 20:55
@tamirms tamirms force-pushed the db-fixes branch 6 times, most recently from 40bf0c3 to c20c9bf Compare March 11, 2024 09:34
@tamirms tamirms changed the base branch from master to release-horizon-v2.29.0 March 11, 2024 21:00
Comment on lines +48 to +51
serverSidePGTimeoutConfigs := []db.ClientConfig{
db.StatementTimeout(app.config.ConnectionTimeout),
db.IdleTransactionTimeout(app.config.ConnectionTimeout),
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the ingestion system uses a separate connection pool

HistorySession: mustNewDBSession(
db.IngestSubservice, app.config.DatabaseURL, ingest.MaxDBConnections, ingest.MaxDBConnections, app.prometheusRegistry,
),

so we can apply these timeout settings to app.historyQ and app.primaryHistoryQ because we know that those db connections will never be used in ingestion and will only be used when serving HTTP requests

@tamirms tamirms marked this pull request as ready for review March 11, 2024 21:48
@tamirms tamirms requested a review from a team March 11, 2024 21:52
Usage: "defines the timeout for when horizon will cancel all postgres queries connected to an HTTP request. The timeout is measured in seconds since the start of the HTTP request. Note, this timeout does not apply to POST /transactions. " +
"The difference between cancel-db-query-timeout and connection-timeout is that connection-timeout applies a postgres statement timeout whereas cancel-db-query-timeout will send an additional request to postgres to cancel the ongoing query. " +
"Generally, cancel-db-query-timeout should be configured to be higher than connection-timeout to allow the postgres statement timeout to kill long running queries without having to send the additional cancel request to postgres. " +
"By default, cancel-db-query-timeout will be set to twice the connection-timeout.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"By default, cancel-db-query-timeout will be set to twice the connection-timeout.",
"When set to 0, which is the default, it will trigger cancel-db-query-timeout to be set to twice the connection-timeout.",

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor, should the flag allow for disabling by the user, I see later in middleware, 0 equates to skipping application of the flag, but it would never be exercised as of now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want to allow the flag to be disabled because I think it could result in confusion. If we skip the application of this flag, the connection timeout will be set to both the postgres server side statement timeouts and the query cancelations on the horizon side. I think this would be unintuitive because, as a user, my first thought is that disabling the query cancelation timeout would result in no query cancelation whatsoever.

If we wanted to allow the option of having no query cancelation at all we could equate 0 to that behavior. -1 could be a default which means cancel-db-query-timeout will be set to twice the connection-timeout value. The user still has the option of setting cancel-db-query-timeout and connection-timeout to the same value which will result in more or less the status quo behavior of horizon prior to this PR.

support/db/session.go Outdated Show resolved Hide resolved
@@ -437,6 +437,18 @@ func Flags() (*Config, support.ConfigOptions) {
Usage: "defines the timeout of connection after which 504 response will be sent or stream will be closed, if Horizon is behind a load balancer with idle connection timeout, this should be set to a few seconds less that idle timeout, does not apply to POST /transactions",
UsedInCommands: ApiServerCommands,
},
&support.ConfigOption{
Name: "cancel-db-query-timeout",
Copy link
Contributor

@Shaptic Shaptic Mar 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just driving by - can we simplify the name to query-timeout? I think it delivers the same meaning with less verbosity

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Shaptic how about cancel-query-timeout? I would like to emphasize that this timeout is only for client side query canceling and not like the postgres server side statement timeout

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, good point! maybe client-query-timeout then? I guess in my mind I read the flag as "cancel query" and think "what the heck is a cancel query, and why does it need a timeout?"

(also to be clear I don't want to delay this PR at all)

Copy link
Contributor

@sreuland sreuland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice integration, left some optional comments for consideration.

support/db/session.go Outdated Show resolved Hide resolved
return requestCtx, nil, requestCtx.Err()
}

if deadline.IsZero() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the caller context is dropped here, not propagated further, if the http client as caller were to close its connection early, middleware responds with ClientDisconnected http response anyways, but before making that response, will middleware still block on the call stack to finish first such as when these Session methods now continue to run longer to finish or hit new deadline? Will this possibly change log output behavior? Just trying to confirm for my own understanding, not suggesting change. I was looking for an integration test that exercises this case of an http request that leads into db and initiates an early client cancel but not seeing one so far.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the http client disconnects, the context attached to the http request in horizon would be canceled. If that happens before the query is made to the DB, horizon will not attempt to do the query and will just return with an error that eventually gets translated to ClientDisconnected in the response. If the request context is canceled during the query, horizon will still wait until the query finishes. If that was the last query in the horizon endpoint, horizon will respond with a 200 and populate the response body with the json payload. If there was another query required, the session would detect the request context was cancelled and would return an error that eventually gets translated to ClientDisconnected in the response.

Even without this change, it is still possible to get a 200 response when the http client disconnects. That case will occur if the client disconnects after the last db query finishes but before horizon written any http header / response

Copy link
Contributor

@sreuland sreuland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks for incorporating feedback.

@tamirms tamirms merged commit 50677ee into stellar:release-horizon-v2.29.0 Mar 14, 2024
29 checks passed
@tamirms tamirms deleted the db-fixes branch March 14, 2024 10:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants