Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support custom subscription name for Pub/Sub health check #330

Merged
merged 21 commits into from
Mar 24, 2021
Merged

Support custom subscription name for Pub/Sub health check #330

merged 21 commits into from
Mar 24, 2021

Conversation

patpe
Copy link
Contributor

@patpe patpe commented Feb 21, 2021

First, the main issue that I want to address can be seen by looking at https://github.com/Predictly/gcp-pubsub-health where Kubernetes put's the pod in an restart loop since it fails to respond to /actuator/health.

Here are my thoughts that went into designing a solution:

  1. Started investigating adding pubsub state to PubSubTemplate, PubSubSubscriberTemplate or PubSubPublisherTemplate, decided not to go down this route since it would add code and variables that would have nothing to do with the main flow of these classes
  2. Decided to handle the issues I saw by pulling messages async in the existing PubSubHealthIndicator which will make it possible to return with an unknown state during the health check even if the application is under heavy load
  3. Added support for customizing both health check subscription and timeout according to @elefeint's initial ideas

My concerns related to this could be mitigated by documenting how these attributes in the new PubSubHealthIndicatorProperties should NOT be used

I have tested my code in the gcp-pubsub-health application and it removes the issue with Kubernetes restarting the application due to liveness timeouts, under load it responds based on the timeout configured in the new property spring.cloud.gcp.pubsub.health.timeout-millis with the following

"pubSub": {
  "status": "UNKNOWN",
  "details": {
    "error": "java.util.concurrent.TimeoutException: null"
  }
}

If this looks good I can go ahead and update the docs as well. First time contributing so any feedback is welcome.

Copy link

@ttomsu ttomsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good - great work for a first pass!

Left a comment for @meltsufin / @elefeint re: which return codes we want use for up/down signaling.


public PubSubHealthTemplate(PubSubTemplate pubSubTemplate, String subscription,
long timeoutMillis) {
super();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary to invoke super() here.

if (t instanceof ApiException) {
ApiException aex = (ApiException) t;
Code errorCode = aex.getStatusCode().getCode();
if (errorCode == StatusCode.Code.NOT_FOUND || errorCode == Code.PERMISSION_DENIED) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... Given that we're now encouraging "bring your own subscription to use for health checks", do we really want to say that a not found is ok? Seems to me like this would mask a misconfiguration error, and the error would be swallowed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if we want to require users to bring their own subscription. This certainly would need to be documented well to clear up the confusion on why setting a custom subscription for health check if optional.

That being said, if the user provides a subscription, NOT_FOUND or PERMISSION_DENIED would indeed be error conditions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that custom subscription should be optional -- for backwards compatibility, and for users who can't create a dedicated subscription due to permission structure.

So there are two paths through the healthcheck, which I'd probably handle as follows:

  1. a random subscription expected to return NOT_FOUND or PERMISSION_DENIED. Any other exception would signal downtime.
  2. a custom subscription, for which the expected "up" signal is not getting any exception. Either receiving messages or pulling an empty batch would be "up". But getting not found OR permission denied should be unexpected. My reasoning is that if a team/org has enough permissions to create a dedicated subscription, they are likely able to set permissions appropriately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The if/else will become even more unwieldy with my suggestions, btw, so you may want a helper method.

@dzou
Copy link
Contributor

dzou commented Feb 22, 2021

Would it be better to keep the custom subscription as a field of the PubSubHealthIndicator or introduce a new wrapper class like in here (via PubSubHealthTemplate)?

On my first pass I thought keeping the custom subscription in the PubSubHealthIndicator as an optional string would be simpler to avoid introducing a new class, but am curious to hear everyone's thoughts.

Copy link
Member

@meltsufin meltsufin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see much benefit in making PubSubHealthTemplate part of our public API.
It can either be subsumed under PubSubHealthIndicator, or just made package-private. I would be fine with either.

if (t instanceof ApiException) {
ApiException aex = (ApiException) t;
Code errorCode = aex.getStatusCode().getCode();
if (errorCode == StatusCode.Code.NOT_FOUND || errorCode == Code.PERMISSION_DENIED) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if we want to require users to bring their own subscription. This certainly would need to be documented well to clear up the confusion on why setting a custom subscription for health check if optional.

That being said, if the user provides a subscription, NOT_FOUND or PERMISSION_DENIED would indeed be error conditions.

try {
this.pubSubTemplate.pull("subscription-" + UUID.randomUUID().toString(), 1, true);
future.get(this.pubSubHealthTemplate.getTimeoutMillis(), TimeUnit.MILLISECONDS);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't you want builder.up() after this line?

long timeoutMillis) {
super();
this.pubSubTemplate = pubSubTemplate;
this.subscription = subscription;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This subscription should be validated as existing as early as possible. (Fail fast)

@patpe
Copy link
Contributor Author

patpe commented Feb 24, 2021

Continued working on this and got to a point where it started to get a bit confusing. Consider the following scenario:

PubSubHealthIndicatorAutoConfiguration#pubSubHealthContributor is called with a map of more than one PubSubTemplate and the properties contains a specified subscription. What does this setup mean?

Will it be possible to have several PubSubTemplate beans each pointing to different projects? If so, how should we use the specified subscription? Throw IllegalArgumentException if subscription specified and there are more than one PubSubTemplate bean?

Validate in PubSubHealthIndicatorAutoConfiguration#pubSubHealthContributor which of the PubSubTemplate beans can read from the specified subscription, let the rest pull from UUID.randomUUID()?

@meltsufin
Copy link
Member

@patpe One thing to note is that subscription name could be relative or fully-qualified as in projects/<project_name>/subscriptions/<subscription_name>}. If a relative one is used, it will automatically be routed to the project of the PubSubTemplate bean. Otherwise, the user is explicitly asking to connect to a specific project, which will work seamlessly regardless of which project is defined for the specific PubSubTemplate bean.
So, I don't really see much of an issue here.

@patpe
Copy link
Contributor Author

patpe commented Feb 27, 2021

@meltsufin I understand your point. My concern is the scenario where multiple PubSubTemplate beans are configured and they have separate CredentialsProvider that does eventually not refer to the same project. This is, in theory, possible to do and if there are more than one PubSubTemplate bean a likely reason. Think of a proxy service that reads from one project and writes to the project it is itself running in, cross organization or department in a business.

Which of the PubSubTemplate (i.e. GCP projects) should the health check assume that the configured subscription is in? I will go down the route of validating that a configured subscription is available in all configured PubSubTemplate and fail to start if it does not exist in one of them. This could/should then be documented as a limitation that a configured subscription has to exist in all projects and if this is not possible to setup, the user should leave subscription out in which case it will pull from a randomly generated subscription.

@patpe
Copy link
Contributor Author

patpe commented Feb 27, 2021

Updated PR which now supports the following

  1. User can configure a custom subscription and timeout for health check
  2. If specified, the subscription has to exist in all available PubSubTemplate or the application will not start (BeanInitializationException will be thrown)
  3. Health checks against PubSub done async and expected to return response within specified timeout (1000ms default)
  4. If a custom subscription has been specified, the async pull against pubsub will signal unknown for TimeoutException and down for any other exception
  5. If no custom subscription has been specified, the async pull is done against a random subscription name and will signal ok for NOT_FOUND and PERMISSION_DENIED in ApiException, unknown for TimeoutException and down for anything else

Copy link
Member

@meltsufin meltsufin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was the intent to make the health-check async? If so, currently it doesn't seem to be doing it because of the future.get(). You would instead need to register a success and a failure callback on the future, which will in turn set the heath status. This will probably be easier to do if you merged the PubSubHealthTemplate with the PubSubHealthIndicator into a single class.

@ttomsu
Copy link

ttomsu commented Mar 4, 2021

Also please take a look at the Sonar Cloud report for the 5 code smells - they should be pretty simple to fix.

@patpe
Copy link
Contributor Author

patpe commented Mar 8, 2021

Thanks for the feedback, incorporated your suggestions. ⛷️ for a week, hence late reply.

Was the intent to make the health-check async? If so, currently it doesn't seem to be doing it because of the future.get().

The intent was never to make the health check async and this is just poor naming of the PR from me. The intent was to use the async functionality of the PubSub API to guarantee a response from the health check in a configurable amount of time, thereby avoiding not returning a response under heavy load to the Kubernetes health probe.

That being said, if making the health probe async is a better solution I can continue working with that as a target. My gut feeling about making it async was that the use case did not merit introducing that level of complexity.


public PubSubHealthIndicator(PubSubTemplate pubSubTemplate) {
public PubSubHealthIndicator(PubSubHealthTemplate pubSubHealthTemplate) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since PubSubHealthTemplate is now a package-private class, we can't make it part of the public interface here. So, we'll have to accept a bean of PubSubHealthTemplate like before and just create the PubSubHealthTemplate internally. Alternatively, you can just move the methods from the PubSubHealthTemplate class into this class and remove PubSubHealthTemplate altogether.

* if connection is successful by pulling message asynchronously from the pubSubHealthTemplate.
*
* If a custom subscription has been specified, this health indicator will only signal up
* if messages are successfully pulled and acknowledged.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth noting here that the custom subscription doesn't have to have messages on it. Also, it's important for users to realize that the topic that the custom subscription is for needs to be dedicated to the health check. I'm very worried that users will accidentally create a health subscription to a topic that is also used for other purposes and potentially lose messages.

return createContributor(pubSubHealthTemplates);
}

private void validatePubSubHealthTemplate(String name, PubSubHealthTemplate pubSubHealthTemplate) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this bit of code duplication with the PubSubHealthIndicator be avoided?

* @author Patrik Hörlin
*
*/
class PubSubHealthTemplate {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know at first I said it probably didn't matter that we're introducing this additional class, as long as it's package-private, but I'm now seeing more complexity introduced due to it. What's the benefit you're seeing from having this class?

@patpe
Copy link
Contributor Author

patpe commented Mar 12, 2021

Refactored so that PubSubHealthTemplate isn't needed anymore and added javadoc to PubSubHealthIndicator about the behaviour when pulling messages.

I have some thoughts about the way the health indicator is validated. Should it

  1. be moved to the PubSubHealthIndicator constructor?
  2. be removed (no other HealthIndicator seems to have this behaviour)?

@meltsufin I agree with your concerns and I have some doubts about the current implementation in this PR. If forced to choose between the following two scenarios

  1. pulling and acking a message from a business subscription which has been accidentally configured as health check subscription
  2. not ack:ing messages on a health check subscription that will produce side effects in GCP statistics (expired metric)

I would prefer to trigger option 2 since it is the lesser of the two evils. Should we perhaps remove the ack?

@meltsufin
Copy link
Member

I have some thoughts about the way the health indicator is validated. Should it

  1. be moved to the PubSubHealthIndicator constructor?
  2. be removed (no other HealthIndicator seems to have this behaviour)?

The nice thing about validation in the constructor is that we force any problems to surface during app startup, but on the other hand, if no other health indicator does it, I think it should be fine to just remove it.

@meltsufin I agree with your concerns and I have some doubts about the current implementation in this PR. If forced to choose between the following two scenarios

  1. pulling and acking a message from a business subscription which has been accidentally configured as health check subscription
  2. not ack:ing messages on a health check subscription that will produce side effects in GCP statistics (expired metric)

I would prefer to trigger option 2 since it is the lesser of the two evils. Should we perhaps remove the ack?

What do you think of a compromise where we just add a configuration property for acking and default it to false?

@meltsufin
Copy link
Member

@patpe I would like to help you polish this PR. Would you mind giving me access to your fork?

@patpe
Copy link
Contributor Author

patpe commented Mar 21, 2021

@patpe I would like to help you polish this PR. Would you mind giving me access to your fork?

Thanks @meltsufin, I have given you access and you should be receiving a notification about it soon. Combining this PR with work proved a greater challenge than I anticipated, will dive into it again right now.

@patpe
Copy link
Contributor Author

patpe commented Mar 21, 2021

@meltsufin Added support for toggling acknowledge of messages on or off, off by default.

Copy link
Member

@meltsufin meltsufin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@patpe I've polished up the pull request and added some reference documentation. Please see if looks good to you.

Copy link
Contributor

@elefeint elefeint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only substantial comment is about subscription names needing to start with a letter.

Otherwise lgtm with minor suggestions.

this.subscription = healthCheckSubscription;
}
else {
this.subscription = UUID.randomUUID().toString();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Subscription names must start with a letter. I'd suggest a prefix of "spring-cloud-gcp-healthcheck".
https://googleapis.dev/java/google-cloud-pubsub/latest/com/google/pubsub/v1/Subscription.Builder.html#setName-java.lang.String-

catch (ExecutionException e) {
if (!isHealthyException(e)) {
validationFailed(e);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add a comment after the if block for // ignore expected exceptions in lieu of the else clause.

public void healthIndicatorPresent() {
public void healthIndicatorPresent() throws Exception {
PubSubTemplate mockPubSubTemplate = mock(PubSubTemplate.class);
ListenableFuture<List<AcknowledgeablePubsubMessage>> future = mock(ListenableFuture.class);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need to mock it; use ApiFutures.immediateFuture(). It might help with unchecked warnings, too, since you can return a typed list.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I actually want to verify calls on the future, I think I'll keep the mock.

"management.health.pubsub.enabled=true",
"spring.cloud.gcp.pubsub.health.subscription=test",
"spring.cloud.gcp.pubsub.health.timeout-millis=1500",
"spring.cloud.gcp.pubsub.health.acknowledgeMessages=true")
.run(ctx -> {
PubSubHealthIndicator healthIndicator = ctx.getBean(PubSubHealthIndicator.class);
assertThat(healthIndicator).isNotNull();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is testing context loading, might want to verify that the health indicator validation got invoked (that template.pullAsync() got called)

properties.setSubscription("test");

PubSubTemplate mockPubSubTemplate = mock(PubSubTemplate.class);
ListenableFuture<List<AcknowledgeablePubsubMessage>> future = mock(ListenableFuture.class);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ApiFutures.immediateFailedFuture() will help.

dzou
dzou previously approved these changes Mar 24, 2021
}

@Test
void customSubsription_TimeoutException() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void customSubsription_TimeoutException() throws Exception {
void customSubscription_TimeoutException() throws Exception {


private void pullMessage() throws InterruptedException, ExecutionException, TimeoutException {
ListenableFuture<List<AcknowledgeablePubsubMessage>> future = pubSubTemplate.pullAsync(this.subscription, 1, true);
List<AcknowledgeablePubsubMessage> messages = future.get(timeoutMillis, TimeUnit.MILLISECONDS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using the future.get(..) is blocking right? Should you register a callback here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We decided to keep it synchronous.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Is there a benefit to using pullAsync vs. the Sync pull then?

…spring/autoconfigure/pubsub/health/PubSubHealthIndicator.java

Co-authored-by: Elena Felder <[email protected]>
@dzou dzou dismissed their stale review March 24, 2021 16:03

didn't mean to approve yet

@meltsufin meltsufin requested review from elefeint and dzou March 24, 2021 16:26
@sonarcloud
Copy link

sonarcloud bot commented Mar 24, 2021

Kudos, SonarCloud Quality Gate passed!

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

100.0% 100.0% Coverage
0.0% 0.0% Duplication

@meltsufin meltsufin changed the title Async pubsub health check and custom subscription name (#236) Support custom subscription name for Pub/Sub health check Mar 24, 2021
@meltsufin meltsufin merged commit bb71964 into GoogleCloudPlatform:main Mar 24, 2021
kateryna216 added a commit to kateryna216/spring-cloud-gcp that referenced this pull request Oct 20, 2022
prash-mi pushed a commit that referenced this pull request Jun 20, 2023
Bumps [maven-surefire-plugin](https://github.com/apache/maven-surefire) from 3.0.0-M4 to 3.0.0-M5.
- [Release notes](https://github.com/apache/maven-surefire/releases)
- [Commits](apache/maven-surefire@surefire-3.0.0-M4...surefire-3.0.0-M5)

Signed-off-by: dependabot-preview[bot] <[email protected]>

Co-authored-by: dependabot-preview[bot] <27856297+dependabot-preview[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants