Backlog spikes reported every hour #850

pjm17971 · 2020-01-12T20:42:52Z

We run many workers on GKE that essentially pull messages from a Pubsub subscription at a rate of about 120k/min, do some processing and write the result into Bigtable. As part of a recent change we upgraded this Pubsub client library from 0.29.1 to 1.2 and immediately started to see alerts.

What we started to see after this upgrade was spikes in the reported backlog (and oldest unacked message age). These happened hourly. However, our service appeared not to suffer, and continued to output its product at a steady rate.

Here is an overview of this upgrade as seen in Stackdriver (running Pubsub v1.2 highlighted in red, then after 10am Friday we reverted JUST the pubsub version and the process returned to normal):

Zooming into approx Friday 12am until noon, and showing backlog at the top and oldest message at the bottom:

It is pretty clearly something that happens every hour.

I know there's another Github issue for memory spikes, but at least as far as we can tell that's not the case for us. In fact, I don't think we actually saw a real impact on processing output. This assessment is based on: 1) we didn't see lag in out downstream client which is usually the case with actual backlogs and 2) we didn't see increase in worker cpu when the backlog recovered. The biggest problem is we use these alerts on this subscription as our main indicator that the process may be experiencing problems for some reason.

Environment details

OS: GKE
Node.js version: 12.x
npm version: 6.11.3
@google-cloud/pubsub version: 1.2

Steps to reproduce

Not sure we do anything special. Each worker instance creates a simple client which processes messages. We run this on GKE, with one node instance per pod. Approximately 64 workers are all pulling from the same subscription. We generally just ack messages regardless of successful processing because in this application it's ok to just drop them (a separate process will take care of it).

Hope this is helpful. Let me know if I can provide any additional data.

Thanks!

The text was updated successfully, but these errors were encountered:

feywind · 2020-01-13T22:22:24Z

@pjm17971 Hi Peter, we've been seeing a number of issues lately that involved upgrading the nodejs-pubsub version from one of the 0.x to the latest one. The common suggestion I've seen is to try switching to the C++ gRPC bindings, so you might give that a try.

const {PubSub} = require('@google-cloud/pubsub');
const grpc = require('grpc');
const pubsub = new PubSub({grpc});

(Or however you instantiate the PubSub object, you can provide grpc to replace its implementation.)

I'll look at this some more, but I wanted to give you something quick to try!

pmcnr-hx · 2020-01-14T16:36:52Z

We have seen similar behaviour when using the @grpc/grpc-node pure JavaScript module that is the default in >0.29.1 versions and I can confirm that this is solved, like suggested above, by opting in to the native bindings gRPC module instead.

pjm17971 · 2020-01-15T15:00:08Z

@feywind Thanks for the info. If we give it a try I'll report back here if it worked or not. Is there an issue on the grpc javascript side that we can track so we'll know when to switch this back out?

feywind · 2020-01-16T18:07:42Z

@pjm17971 I'm actually pretty new on the team, so I don't have a map of everything that's going on in my head yet :) but there have been a number of issues lately with gRPC vs grpc-js. I'll be on the lookout for a useful grpc issue for you.

@bcoe Do you know of one off the top of your head?

feywind · 2020-01-16T21:50:16Z

@pjm17971 If you have a chance, can you try turning on the GRPC debug flags with these environment variables?

GRPC_TRACE=all
GRPC_VERBOSITY=DEBUG

Specifically with the configuration described as problematic in the original post, if possible.

Thanks!

ctavan · 2020-02-19T09:03:02Z

Just chiming in to let you know that I'm experience the precise same behavior as @pjm17971 with a couple of subscribers running on GKE.

Here are the versions and the respective behavior that I have been observing:

2020-01-26  pubsub-0.31.1 grpc-js-0.5.2  --> No spikes
2020-02-17  pubsub-1.1.5  grpc-js-0.6.9  --> Spikes
2020-02-17  pubsub-1.5.0  grpc-js-0.6.16 --> Maxed out CPU on idle subscriptions (will report in separate issue)
2020-02-17  pubsub-1.1.5  grpc-js-0.6.9  --> Spikes (reverted due to above issues)
2020-02-18  pubsub-1.5.0  grpc-1.24.2    --> No spikes since then (using C++ grpc instead of grpc-js)

Providing the C++ grpc bindings to the PubSub() constructor as suggested by @feywind seems to be a viable workaround so far.

That said, with the pubsub-1.1.5 grpc-js-0.6.9 combo I occasionally got precisely this error crashing my node process: googleapis/nodejs-firestore#739

node[1]: ../src/node_http2.cc:892:ssize_t node::http2::Http2Session::ConsumeHTTP2Data(): Assertion `(flags_ & SESSION_STATE_READING_STOPPED) != (0)' failed.
 1: 0x9bcb80 node::Abort() [node]
 2: 0x9bcc07  [node]
 3: 0x9f253a node::http2::Http2Session::ConsumeHTTP2Data() [node]
 4: 0x9f2ec0 node::http2::Http2Session::OnStreamRead(long, uv_buf_t const&) [node]
 5: 0xad941b node::TLSWrap::ClearOut() [node]
 6: 0xadb878 node::TLSWrap::OnStreamRead(long, uv_buf_t const&) [node]
 7: 0xa71e66 node::LibuvStreamWrap::OnUvRead(long, uv_buf_t const*) [node]
 8: 0x12baaa9  [node]
 9: 0x12bb0d0  [node]
10: 0x12c0c08  [node]
11: 0x12af54b uv_run [node]
12: 0xa00127 node::NodeMainInstance::Run() [node]
13: 0x993dd8 node::Start(int, char**) [node]
14: 0x7fc6d2cf42e1 __libc_start_main [/lib/x86_64-linux-gnu/libc.so.6]
15: 0x932cb5  [node]

@pjm17971 have you been observing these errors as well?

I was not able to reliably correlate these crashes with the spikes in unacked messages though.

Also, according to this grpc issue the crashes problem might have been fixed in grpc-js-0.6.13 which I haven't been able to verify yet, because at least grpc-js-0.6.16 seems to be causing a whole different range of problems… Will let you know if I find out more.

ctavan · 2020-02-19T09:37:18Z

I just realized that in the original issue description the spikes appear exactly each full hour. This is not the case in my setup. In my setup the spikes appear in irregular intervals around 2-5 times per day.

ctavan · 2020-02-21T19:20:20Z

Just a quick confirmation that the issue is gone for me since switching back to the C++ bindings. Need to wait for #890 until I can try out grpc-js again.

pjm17971 · 2020-02-21T19:47:45Z

@feywind Sorry I didn't reply, but kind of a non-update from my end: we went back to the C++ gRPC implementation and our process behaved fine. We couldn't reproduce easily in a non-production environment so we did not attempt to deploy again with the bad library version. We can try any new version of grpc-js again down the road if a fix is believed to be in place and give feedback as to if this works for us. Sounds like #890 kind of has this problem covered.

feywind · 2020-02-28T01:01:38Z

Thanks, everyone! There's a PR out now to update to the grpc-js that should fix the issues in #890. I'll update here again once that's merged, and an update to that nodejs-pubsub should pull it in.

ctavan · 2020-03-03T12:39:39Z

I have removed the C++ bindings again and upgraded to @grpc/[email protected] (which fixed #890 for me). I have not observed a single spike for 24h now, a situation I have never seen with @grpc/[email protected].

So for the time being I conclude that

@google-cloud/[email protected]
@grpc/[email protected]

is finally a stable setup for my use case again!

@pjm17971 I think the above versions are a setup that is worth trying out.

feywind · 2020-03-03T18:45:35Z

Thanks for the update! I'm working on getting the default grpc-js version pushed up to 0.6.18. There might be a next tag coming up that will pull in 0.7.0, but the timing on that is TBD.

feywind · 2020-04-23T17:25:34Z

grpc-js is up to 0.6.18 in the last few releases, and this issue hasn't seen much traffic lately, so I'm going to go ahead and close it. Please feel free to re-open/comment if necessary.

pjm17971 changed the title ~~Backlog spikes every hour~~ Backlog spikes reported every hour Jan 12, 2020

yoshi-automation added the triage me I really want to be triaged. label Jan 13, 2020

meredithslota assigned feywind Jan 13, 2020

feywind mentioned this issue Jan 27, 2020

Various problems seem to be happening in regards to grpc-js #868

Closed

google-cloud-label-sync bot added the api: pubsub Issues related to the googleapis/nodejs-pubsub API. label Jan 30, 2020

ctavan mentioned this issue Feb 19, 2020

Subscriber maxes out CPU after exactly 60 minutes on GKE #890

Closed

feywind removed the external This issue is blocked on a bug with the actual product. label Feb 28, 2020

yoshi-automation added the 🚨 This issue needs some love. label Feb 28, 2020

feywind closed this as completed Apr 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backlog spikes reported every hour #850

Backlog spikes reported every hour #850

pjm17971 commented Jan 12, 2020

feywind commented Jan 13, 2020

pmcnr-hx commented Jan 14, 2020 •

edited

Loading

pjm17971 commented Jan 15, 2020

feywind commented Jan 16, 2020

feywind commented Jan 16, 2020

ctavan commented Feb 19, 2020

ctavan commented Feb 19, 2020

ctavan commented Feb 21, 2020

pjm17971 commented Feb 21, 2020

feywind commented Feb 28, 2020

ctavan commented Mar 3, 2020

feywind commented Mar 3, 2020

feywind commented Apr 23, 2020

Backlog spikes reported every hour #850

Backlog spikes reported every hour #850

Comments

pjm17971 commented Jan 12, 2020

Environment details

Steps to reproduce

feywind commented Jan 13, 2020

pmcnr-hx commented Jan 14, 2020 • edited Loading

pjm17971 commented Jan 15, 2020

feywind commented Jan 16, 2020

feywind commented Jan 16, 2020

ctavan commented Feb 19, 2020

ctavan commented Feb 19, 2020

ctavan commented Feb 21, 2020

pjm17971 commented Feb 21, 2020

feywind commented Feb 28, 2020

ctavan commented Mar 3, 2020

feywind commented Mar 3, 2020

feywind commented Apr 23, 2020

pmcnr-hx commented Jan 14, 2020 •

edited

Loading