Preventing PubSub.Receive from preloading too many messages #1097
Labels
api: pubsub
Issues related to the Pub/Sub API.
priority: p1
Important issue which blocks shipping the next release. Will be fixed prior to next release.
🚨
This issue needs some love.
type: feature request
‘Nice-to-have’ improvement, new feature or different behavior or design.
Client
PubSub
cloud.google.com/go - v0.25.0
Describe Your Environment
GKE. Services are docker images from "scratch" with go binary.
Expected Behavior
PubSub streaming client should have a configuration that limits the number of messages pulled in to process. We are setting MaxOutstandingMessages to control how many messages are processed concurrently, but the client floods the process with messages that are not completed within 30 minutes. Individual messages are typically consumed within 0.5-5 seconds.
I'm hoping there is an easy answer to this we are missing. To be clear. This is only an issue when processing a backlog of items. We're not perfect and sometimes new data brings introduces an error that we have to fix in order to continue processing.
Actual Behavior
Upon starting our service the process is flooded with StreamingPull Operations. So many in fact that we cannot process them within the 30 minute window of MaxExtension, so our processes begin working on and acking messages that have already expired causing a lot of duplicative work.
Below is a screenshot of our Stackdriver metrics demonstrating the problem. I was told by GCP support that the orange line on Acknowledge Requests means messages acked within deadline. The green line means messsages acked outside of the deadline. Orange == good. Green == bad. (no comment)
At this time we were running 12 instances of our service, each with a MaxOutstandingMessages set to 3. In particular we are indexing to an Elasticsearch cluster, so we can't simply scale up beyond the capabilities of our cluster.
At the beginning of the period we are only acking messages outside of the extension window. Then we restart the services and start acking messages within the window. After 30 minutes we are back to only acking messages outside the window. You can see the effect in our Undelivered Messages (queue size). It's flatlined or going up slowly during the periods that we are successfully processing messages beyond the max extension window. This extra work adds a lot of load to our Elasticsearch cluster and makes it really hard to catch up from a backlog.
The text was updated successfully, but these errors were encountered: