-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow push queries to be streamed for more than ten minutes at a time #6970
Comments
A common approach with clients to prevent idle connections being timed out would be to send a "ping" request (which does nothing) every few seconds. |
A workaround that can be used in the meantime (courtesy of @AlexCasdu): https://groups.google.com/g/ksql-users/c/iB6mo1tHgn0/m/ARfID-RoAgAJ |
Moving to 0.19, we'll need to tackle this as part of #7203 |
As far as I can tell from the code, the Based on the reports and the proposed workaround, the problem just seems to be the idle timeout. We can certainly add a heartbeat mechanism to the Java client, but that doesn't solve the general problem for users of the server APIs. Each API caller will have to set up their own heartbeat mechanism. Alternatively, we could just increase the server's idle connection timeout, possibly to a large number. However, this defeats the server's ability to close connections that aren't being used. Maybe this is ok; if clients actually hang up, the socket itself will close, closing the connection, so this is only for clients that are still holding an open socket but not using it. If we think that the number of client connections to the ksqlDB are limited, then perhaps closing idle connections actually doesn't provide much value. From my armchair, it seems like the number of ksqlDB client connections should be limited. This isn't like a public web server; only internal hosts should be connecting to a database. Another approach is to let the server send "pings" back over the response channel when it's working on a long-running response. For example, when handling a request, the server could set up a timer to send a null byte back over the response channel once a minute or so. That way, truly idle connections (no requests) will still get closed, while active connections that have slow responses or low data volume will stay open. The downside of this approach is that we have to pick a "ping" datum that doesn't break the response itself. E.g., is it actually ok to insert random null bytes into a JSON response stream? If not, then what data, if any, is ok to insert? Due to the relative ROI of the approach and the known-ness of the downsides, I'm leaning toward a larger idle timeout. For example, 24 hours, and possibly adding a server config property so people can set it to anything they like. Any objections? |
This assumption is probably valid today. What would a number of clients that is large enough to actually cause an issue here? I'm wondering if we start supporting IOT style use-cases with push queries if 1000s of clients will be an issue (cc @AlanConfluent). Otherwise, I'm fine with the suggestion - increasing idle timeout seems like a high ROI suggestion to me. |
Thanks, @agavra , I should clarify my statement: I don't think it's a good idea to open up any database for connections from the internet, regardless of the use case. For example, an IoT setup that distributes database credentials to all the devices and then lets them directly connect to the database seems dubiously wise to me. I'd suggest for such a system to set up a web server with the appropriate technology to accept connections from the IoT fleet, and then itself form a connection to the database for persistence/queries. That's also not a trivial job, but there are managed offering specifically for that use case (such as AWS API Gateway). I think it could be a cool offering if we wanted to take KSQL in a Firebase-like direction, but there's a huge pile of work we'd have to tackle to get there, so I think that for now we can just lean on the assumption that ksqlDB sits in an application architecture in a similar place as a traditional database. |
yeah that makes sense, don't know what I was thinking. +1 to the suggestion |
Follow-up: while testing my proposed fix, I noticed that we actually do write a newline character about once a second when you're using the http1.1 endpoint. The purpose is actually to verify that the connection is still open (see So, it seems like this issue only affects the HTTP2 API. I spent some time digging around in the code, but the async code in there makes it hard to find a good place to add a similar "ping" operation. We should also consider that we need not only to keep alive the client/server connection, but also the server/server connections within the cluster. (Although, it's likely that the nodes in the cluster will wind up chatting with each other frequently enough that we won't time out in practice.) The most general place to insert these pings would be at the actual data source. For transient queries, this is the Here is a poc of this approach: #7555 |
Ok, here's the config PR: #7556 @agavra (and maybe @AlanConfluent ), now that there is a POC of the "ping" approach (#7555), what's your sense of whether we really want the config, or whether we should double down on the ping approach? |
This is not really true. TCP is a reliable protocol and will not close a connection if the two ends of the connection stop being able to communicate - TCP assumes this might be some transient network issue that it can recover from. If the client machine suddenly disappears (the OS hard crashes or the client machine has power loss) or someone pulls the network cable to the client, the server will have no way of distinguishing that from a transient network failure and the connection will not be closed until it eventually times out. This is why pinging, or a short idle timeout is important to prevent lots of connections hanging around. (Failure to detect dead connections using pinging or long idle timeouts can also be exploited by bad clients to create denial of service attacks). Most http servers for example have very short idle timeouts (seconds or maybe small numbers of minutes). That's not appropriate for long lived connections such as in a push query, so pinging seems the only sensible option imho. |
Is your feature request related to a problem? Please describe.
Currently, a single push query may be streamed for a maximum of ten minutes at a time, after which the server closes the connection. ksqlDB should provide a way for users to stream push queries for longer if desired, while still not allowing disconnected clients to hold server resources for unreasonable amounts of time.
Describe the solution you'd like
It may make sense to allow users to specify that certain queries should have longer timeouts.
Details here need to be thought out more. There are two places today where the 10-minute timeout is enforced on the server side:
ksql/ksqldb-rest-app/src/main/java/io/confluent/ksql/api/server/Server.java
Line 295 in 9988e6a
ksql/ksqldb-rest-app/src/main/java/io/confluent/ksql/rest/server/resources/streaming/QueryStreamWriter.java
Line 42 in 9988e6a
Additional context
Example community request for this: https://groups.google.com/g/ksql-users/c/iB6mo1tHgn0/m/dEzpEDHwAAAJ
The text was updated successfully, but these errors were encountered: