Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Publishing fails after seemingly random interval, requiring application restart #1810

Closed
nrheckman opened this issue Aug 30, 2023 · 2 comments
Labels
api: pubsub Issues related to the googleapis/nodejs-pubsub API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@nrheckman
Copy link

After a period of successful message publishing, all requests will begin timing out until the application is restarted. Interval has been observed to be anywhere from 1 to 7 days.

App is running within a container in GKE. It is a nodejs application built from a Typescript project using the following versions:

# nodejs --version
v14.21.3

# npm --version
6.14.18

# head -n4 node_modules/\@google-cloud/pubsub/package.json 
{
  "name": "@google-cloud/pubsub",
  "description": "Cloud Pub/Sub Client Library for Node.js",
  "version": "4.0.1",

There are 300 instances of our application running at all times, and each individual instance generates approximately 35 to 45 messages per second. The messages are split between three topics, the most heavily published topic receives up to approximately 35 messages per second per instance. Total message rate for all instances combined is between 11,000 and 13,000 messages per second.
Observed publication failures appear to impact only a single instance at random, and can occur half a dozen times or more throughout the day. It does not appear to be uptime-related, as some instances can run for a week before being impacted.
Once an application instance begins failing to publish messages, it fails all message publish requests until it is restarted.
Restart gets triggered by k8s due to memory consumption exceeding its maximum value. This can take 10 minutes or more as message publishing attemps back up and become queued.

The impact of a single application instance being in this state is that as many as 2,500 of our customers are unable to interact with their products until the instance they are communicating with is restarted. This is a significant degradation of service for these customers and is generating sizable support issues for us.

The errors reported by the Pub/Sub client all appear to indicate that the request has timed out.
"Error: Total timeout of API google.pubsub.v1.Publisher exceeded 60000 milliseconds before any response was received."

Other than this being observed on our production applications, I don't have a specific set of circumstances which can be used to reproduce it.

I have included the typescript source file, which is responsible for message publishing and error handling.

import {EventStreamPubSubConfig} from "../config";
import {eventStreamMetrics} from "../observability";
import {EventStream} from "./ievent_stream";
import {EventMessage} from "./event_message";

import * as bunyan from "bunyan";
import {PublishOptions, PubSub, Topic} from "@google-cloud/pubsub";

export interface EventStreamPubSubOptions {
    config: EventStreamPubSubConfig;
    log: bunyan;
    messageOrdering: boolean;
    topic: string;
    pubSub: PubSub;
}

export class EventStreamPubSub implements EventStream {
    private options: EventStreamPubSubOptions;
    private log: bunyan;
    private topic: Topic;

    constructor(options: EventStreamPubSubOptions) {
        this.options = options;
        this.log = options.log;
        this.topic = this.attachToTopic();
    }

    private attachToTopic(): Topic {
        const publishOptions: PublishOptions = {
            batching: {
                // We are intentionally processing 1 message at a time in queue::enqueue,
                // so we should tell the PubSub library to publish right away.
                maxMessages: 1
            },
            gaxOpts: {
                timeout: this.options.config.timeoutMillis
            },
            messageOrdering: this.options.messageOrdering,
        }

        if(this.options.config.retry.enabled) {
            publishOptions.gaxOpts!.retry = {
                backoffSettings: this.options.config.retry.backoffSettings
            }
        }

        this.log.info(`Creating publisher for topic ${this.options.topic} with options: ${JSON.stringify(publishOptions)}`)

        return this.options.pubSub.topic(this.options.topic, publishOptions);
    }

    async create(message: EventMessage): Promise<void> {
        const timer = eventStreamMetrics?.pubSubWriter.startTimer();
        let m = {
            data: Buffer.from(JSON.stringify(message.message)),
            orderingKey: message.orderingKey
        };
        return this.topic.publishMessage(m).then(
            (resp) => {
                this.log.trace("Publish successful: " + resp);
                timer?.({
                    result: "success",
                    topic: this.options.topic
                });
            },
            (reason) => {
                this.log.error("Publish failed: " + reason);
                timer?.({
                    result: "failure",
                    topic: this.options.topic
                });

                // Drop the current message and move on
                this.topic.resumePublishing(m.orderingKey);
            }
        );
    }
}
@nrheckman nrheckman added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels Aug 30, 2023
@product-auto-label product-auto-label bot added the api: pubsub Issues related to the googleapis/nodejs-pubsub API. label Aug 30, 2023
@feywind
Copy link
Collaborator

feywind commented Sep 8, 2023

@nrheckman Can you try the latest version or try implementing the defaults from this PR in your PubSub constructor call?

#1814

We've been seeing a bunch of these and have a belief that it may be fixed by enabling the grpc keepalives.

@nrheckman
Copy link
Author

Looks like enabling GRPC keepalives fixed our problem!
Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: pubsub Issues related to the googleapis/nodejs-pubsub API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

2 participants