Publishing fails after seemingly random interval, requiring application restart #1810
Labels
api: pubsub
Issues related to the googleapis/nodejs-pubsub API.
priority: p2
Moderately-important priority. Fix may not be included in next release.
type: bug
Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
After a period of successful message publishing, all requests will begin timing out until the application is restarted. Interval has been observed to be anywhere from 1 to 7 days.
App is running within a container in GKE. It is a nodejs application built from a Typescript project using the following versions:
There are 300 instances of our application running at all times, and each individual instance generates approximately 35 to 45 messages per second. The messages are split between three topics, the most heavily published topic receives up to approximately 35 messages per second per instance. Total message rate for all instances combined is between 11,000 and 13,000 messages per second.
Observed publication failures appear to impact only a single instance at random, and can occur half a dozen times or more throughout the day. It does not appear to be uptime-related, as some instances can run for a week before being impacted.
Once an application instance begins failing to publish messages, it fails all message publish requests until it is restarted.
Restart gets triggered by k8s due to memory consumption exceeding its maximum value. This can take 10 minutes or more as message publishing attemps back up and become queued.
The impact of a single application instance being in this state is that as many as 2,500 of our customers are unable to interact with their products until the instance they are communicating with is restarted. This is a significant degradation of service for these customers and is generating sizable support issues for us.
The errors reported by the Pub/Sub client all appear to indicate that the request has timed out.
"Error: Total timeout of API google.pubsub.v1.Publisher exceeded 60000 milliseconds before any response was received."
Other than this being observed on our production applications, I don't have a specific set of circumstances which can be used to reproduce it.
I have included the typescript source file, which is responsible for message publishing and error handling.
The text was updated successfully, but these errors were encountered: