-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
queries are A LOT slower after losing and re-establishing a connection to zookeeper #183
Comments
Maybe what you are seeing is possibly due to the time it takes for zookeeper server to be ready to respond to it's client requests. A ZK server can take a significantly long time to be ready. It is not just a matter of starting the server. But during this time, it's networking is running so at the low level, it clients can connect. They don't get a response though until the server is ready. The time a ZK server it takes to come up to speed gets longer, the longer the cluster has been in operation - it replays the ZK log. There is also a leader election. One approach I have seen is to add a server to the cluster, making N+1, then moving a server (e.g for maintenance). The cluster can still operate. Simpy restarting a server drops the cluster to below the critical threshold which causes knock-on effects. |
I've seen the slowness regardless of waiting several minutes or immediately after the zookeeper server is back up and running. Also, at initial startup, I don't see the slowness at all. It only occurs AFTER the rdf-delta server has connected to the zookeeper, then loses that connection and re-establishes it. No amount of time is changing the slowness of the responses from what I'm seeing. |
Why did the zookeeper server restart? I'm not sure what is going on them - I'm don't know which part of the system is pausing. Is there always a pause? Is it always the same length of time? Maybe there is stale TCP connection (not properly shutdown). The patch server doesn't know the original zookeeper has gone until usage is attempted and then fails. When the Fuseki server makes a query - is it a query or an update? Updates involve more ZK interaction. |
I'm not exactly sure. Regardless, the main point is that a connection is lost for one reason or another to zookeeper.
After a connection to zookeeper is re-established, yes. After that connection is re-established and then the entire rdf-delta server is restarted, no. A full restart of the rdf-delta server always fixes the issue.
It seems to be. I was clocking the times to complete the simple ask query that I mention in the original post at 10 seconds very consistently.
There isn't a timeout on the rdf-delta side, it does complete eventually, it just takes a super long time to complete after the zookeeper connection is re-established.
It is a query. The only request sent to rdf-delta is the log description RPC call to determine if any new patches are available. To me, all of the evidence I'm seeing points to the issue being inside the rdf-delta server somewhere (or an api that it is using in the curator libs); especially given the fact that the only way that I've found to resolve the issue is to restart the rdf-delta server itself. I wonder if there are connections that, when closed, aren't actually disposed of or something like that? I really have no idea though. My knowledge about interacting with zookeeper, let alone via curator, is extremely limited. |
Loss of a ZK server is supposed to be a rare event. There are several 10s time related constants: Zk.createCuratorClient:
|
I can say that the loss of connection is definitely rare for us, but the few times where it has happened, it has created some serious issues. Those 10s constants are a good point. That could definitely point us in the right direction. |
Also, not all of the zookeeper instances have gone down when the loss of connection occurs, it seems that it will happen if only one of our more than two instances drops...I'll have to double check though. |
speaking of those timeouts, it would be really nice to be able to customize them via dcmd. |
I'm not sure what that means. |
Sorry, I mean that we have more than 2 zookeeper servers running and that if one of them is "lost" momentarily, then this seems to happen. |
I hope you have at least 3! You do have a SPOF at the single patch log server. As these are locally stateless, you won't loose data but can have a service interruption for updates. |
So far: Some sort of of pause when the ZK server changes is to be expected. When you say "and will continue to take a long time to complete", any queries in the switchover window will have a delay. Are queries issued serially or overlapping? If a sequence of queries is sent, waiting for each to complete before sending the next, so queries 2 onwards experience the delay? Even after several minutes? (this is probing the working of the Curator/Zookeeper libraries.) |
Ha. Yes, we do.
That is understood.
In our test and production environments, they are overlapping. However, when I did my local testing, I did a single query at a time and waited for a response before submitting another and got the same results (I.E. the 10 second delay in the responses). |
As we do run into the same problem, we did some tests and came to the following conclusions:
The tests were done with queries issued serially and some random times between the queries. |
Hi @oliverbrandt -- thank you very much for the information; it helps to have another environment reporting. In what environment are you running? Is it a cloud platform? |
May be relevant: Apache Curator 5.4.0 has just been released. It includes an upgrade of the ZooKeeper dependency from 3.6.3 (As in RDF Delta 1.1.2) to 3.7.1. |
The tests to determine the effect of different timeouts were done locally. |
@bsara @oliverbrandt -- In #154, the use of load balancer confused ZooKeeper. Do either of you use load balancers in front of the ZooKeeper servers? Azure question: What is being used for the object blob storage? #154 uses MinIO in front of the Azure Blog Store in a k8s environment. It's probably unrelated but it's always useful to know about the deployment setup. |
We did configure the components to directly address the ZKs instead of going through a service. For the blob storage, MinIO is used as a frontend to the Azure Blob Storage. |
@bsara's deployment does not have a load balancer in front of zookeeper. |
In the tests that I conducted on my local machine, I was seeing query speeds of ~50ms go to no better than 10 seconds for the same query after re-establishing a lost connection to zookeeper.
Setup
Steps to Reproduce
RDF Delta Logs
I've added labels to the below log to indicate when each step described above (except for step 1) was performed during the logging.
The text was updated successfully, but these errors were encountered: