-
Notifications
You must be signed in to change notification settings - Fork 373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Firestore: Deadline Exceeded, retrying doesn't help #499
Firestore: Deadline Exceeded, retrying doesn't help #499
Comments
I couldn't figure out how to label this issue, so I've labeled it for a human to triage. Hang tight. |
Just to add my similar experience ... A simple process (ubuntu 18, node 10, pm2 as process manager) that update an object (document.set merge=true), a frequency of one document.set per minute (I think it can be considered as a low frequency). After some hours (8/10) DEADLINE_EXCEEDED exception for every attempts. I do process.exit (and pm2 restart) and process restart working without issue for hours ... the set that fails with DEADLINE_EXCEEDED, takes about 40 seconds to raise the exception. The exception return "code:4" |
Any news on this? We were able to work around this by splitting our workload among 15 processes and running them in servers other than the one where we got this error. The worry is that when we restart the servers, this may start happening again. I tried making it self-heal by sending a data point to another server when deadline exceeded is received. But this requires writing to Firestore and it just received deadline exceeded again. |
@wilhuff can somebody from your team take a look? |
Generally, a deadline exceeded error indicates that the rpc took too long to complete. Firestore sets a deadline of 60s, so I'd expect these errors to occur pretty much precisely 60s after the query was initiated. I suppose an exceptionally large query could cause this to occur, or perhaps your process being preempted for 60s(!). Based on what's been reported here, I wouldn't expect these operations to take that long though. There's a related issue here: #349. If you're able to reliably reproduce this issue, could you grab the logs by following the debugging instructions there, specifically:
(Suggestions courtesy of @schmidt-sebastian and @hiranya911). It's also worthwhile to check out https://firebase.google.com/docs/firestore/quotas to ensure you're within these limits. I think you should be getting a quota exceeded error (rather than deadline exceeded) if one of these are violated, so this probably isn't the cause. But perhaps a good sanity check regardless. |
I am able to reproduce it reliably. Here's the log:
|
To others that are experiencing this issue: given what we've seen here, this doesn't appear to be an SDK bug, so there isn't much we can do about this here. This is likely data- or project-specific so the best course of action is to file a direct support request with your project details and what you're trying to do. @ariofrio: The log essentially says that you started a write and while waiting for it, got a bunch of no-change messages for the listeners you have active. Finally, after almost exactly 60 seconds the server has timed out the write request. The update you're making is standard and tiny, and the SDK seems to be operating as expected. This should work. At this point I suspect that there's something specific about the data or indexes you have or something else internal to the backend that's affecting your project. If you're able to reproduce this reliably, the best thing you can do is package this up in a small self-contained example so that we can pass this along to a backend engineer with access to the data. The best way to get that to us is to file a support request so that people with the right permissions can dig into this and we can discuss project details and data privately. Some other things to include:
The system should reply with a case number. Send that to me at [email protected] and I'll make sure the right people see it. |
@ariofrio We just received the report that you filed via Google's external support. I will ask the backend team to take a look at your project. In the meantime, do you have a rough estimate on how many active listeners you have? We currently only support 100 operations per GRPC channel, and if you have 100 active |
@schmidt-sebastian Currently we roughly have 1,000 active listeners at a time. We aim to be able to scale it up to 20,000 or more. |
Any updates on this? |
Not yet, I will keep you posted! |
Ping. We're waiting on this to be resolved for a critical feature in our software. |
Sorry for the long delay. The backend team was able to look at your project. While we were not able to find the root cause for your issues, we do see elevated latencies during the time window of your last test run (2019-04-13). We haven't seen any recent latency spikes. If you are able to re-run your tests and let us know the exact time window during which you encounter issues, we might be able to narrow it down further. Please do note that is likely that occasional latency spikes are not completely avoidable. To avoid the errors, you could increase the requests timeouts for your
|
A request that finished with "deadline exceeded" was ongoing at Wed Apr 24 19:04:00 PDT 2019. Update: Increasing the "Listen", "UpdateDocument", and "Write" timeouts to |
Thanks for the additional update. We are still looking into this with our backend team. |
I report the same issue of ariofrio. the problem is that all the next set fail with "deadline exceeded". My workaround is to restart the process. In my case the frequency is really low (1 time per minute), the document size, the size of entire db are so small that the issue can't depends on that. It seems that the first "deadline exceeded" corrupt the connection deadly. |
@FVilli Please send me your project id and some time intervals where you've been seeing this by email (to [email protected]) and I'll add them to the internal issue tracking this. |
@wilhuff I gathered some more time intervals where this occurred. I sent them by e-mail. |
Is there an update to this issue? |
We're still working on trying to understand why this is happening. It's not coming easily, unfortunately :-(. |
You are likely running into an internal limitation of how many RPCs can be active on a channel (aka connection). As a workaround (until the right limit is found and adjusted) you can make sure to remove the onSnapshot listeners, and make sure to never exceed more than 100 of them. You can see this behavior with a minimal example:
This will print deadline exceeded errors starting with document 102 (101 active snapshot listeners):
I don't know why it would be 101 and not 100 active snapshot listeners however. |
Hrm. What's supposed to be happening here is that the Node SDK pools the clients and creates new ones once the number of active RPCs exceeds the limit of 100. Something's not working right in this case. I'll look into this. @ariofrio Is there any chance that the ever growing list of listeners is not intended? If so closing listeners when you don't need them anymore will address this in the short term while we figure out what's going on with our pooling logic. |
I have one process that register (only) 4 onSnapshot listeners. |
Previously _initializeStream returned a Promise that indicated that the stream was "released", i.e. that it was was ready for attaching listeners. #256 Added pooled clients and changed the callers of _initializeStream to reuse this promise such that when it was resolved, the stream could be returned to the pool. This works when listeners are short-lived, but fails when listeners run indefinitely. This change arranges to release the clients back to the pool only after the stream has completed, which allows an arbitrary number of indefinite listens to run without problems.
Previously _initializeStream returned a Promise that indicated that the stream was "released", i.e. that it was was ready for attaching listeners. #256 Added pooled clients and changed the callers of _initializeStream to reuse this promise such that when it was resolved, the stream could be returned to the pool. This works when listeners are short-lived, but fails when listeners run indefinitely. This change arranges to release the clients back to the pool only after the stream has completed, which allows an arbitrary number of indefinite listens to run without problems.
Fixes firebase/firebase-admin-node#499 Previously _initializeStream returned a Promise that indicated that the stream was "released", i.e. that it was was ready for attaching listeners. #256 Added pooled clients and changed the callers of _initializeStream to reuse this promise such that when it was resolved, the stream could be returned to the pool. This works when listeners are short-lived, but fails when listeners run indefinitely. This change arranges to release the clients back to the pool only after the stream has completed, which allows an arbitrary number of indefinite listens to run without problems. This turns out to be fiendishly difficult to test given the current structure of the code. A second pass at this that reformulates this as just another stream that composes with the others would make this easier to understand and test. For now, this fix unblocks the customers waiting on the referenced issue.
The Node 6 deprecation has made the timing of releasing this fix somewhat challenging so my best advice right now is that if this fix is critical to you, you should pin to the commit that implements the fix. If you're using "dependencies": {
"@google-cloud/firestore": "googleapis/nodejs-firestore#479bc9c4847cc2a5632a266da706b349a1b74a41",
"firebase-admin": "^7.3.0"
} Note that order matters, as far as
|
A large number of long-lived listeners is the intended behavior of our application. Confirmed that this fixes our issue. |
I am noticing this error in my AWS::CloudWatch logs for production on an AWS Lambda function via API Gateway. This is marked as closed but the issue is still happening in 2020. {
"code": 4,
"details": "Deadline exceeded",
"metadata": {
"internalRepr": {},
"options": {}
}
} This issue is discussed on StackOverflow as well.
I am going to use AWS::SQS to place the write requests into a queue as it's important that these don't fail. |
[READ] Step 1: Are you in the right place?
file a Github issue.
with the firebase tag.
google group.
of the above categories, reach out to the personalized
Firebase support channel.
[REQUIRED] Step 2: Describe your environment
[REQUIRED] Step 3: Describe the problem
We are running a server on AWS (though this reproduces locally on a MacBook Pro, also). It makes a series of (mostly) sequential operations on Firestore, including onSnapshot, get, set, and delete. When we run it with a reduced data set, it works well. However, when a larger data set is used, we get the DEADLINE_EXCEEDED error after a long delay (a minute or more). Retrying when this error occurs doesn't fix the problem, the next call also takes a long time and throws DEADLINE_EXCEEDED.
Sometimes a new data point comes in and it is processed in parallel (they are mostly batched, so mostly sequential). The new data point operates on a different Firestore document than the data point that keeps failing. Surprisingly, the new data point is processed successfully on the first try, while the data point that keeps failing continues failing over and over.
Relevant Code:
It usually happens on this call:
The text was updated successfully, but these errors were encountered: