Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[improve][broker] Gracefully shut down load balancer extension #20315

Merged

Conversation

heesung-sn
Copy link
Contributor

@heesung-sn heesung-sn commented May 13, 2023

PIP: #16691

Motivation

Load balancer extension needs to shut down gracefully, especially when shutting down the leader broker. When the leader broker closes the leader election service too late, service unit lookups to the leader broker could fail during the shutdown. This could delay client reconnection time.

Lookup failure logs

(shutdown starts)
pulsar-broker-8 pulsar-broker 2023-04-22T00:19:52,630+0000 [pulsar-service-shutdown] INFO  org.apache.pulsar.broker.PulsarService - Closing PulsarService
pulsar-broker-3 pulsar-broker 2023-04-22T00:19:52,690+0000 [pulsar-io-4-5] INFO  org.apache.pulsar.client.impl.ConnectionHandler - [persistent://pulsar/system/loadbalancer-service-unit-state] [pulsar-18-9] Closed connection [id: 0x907e74b0, L:/10.0.13.6:35838 ! R:pulsar-broker-8.pulsar-broker.pulsar.svc.cluster.local/10.0.18.18:6650] -- Will try again in 0.1 s
pulsar-broker-3 pulsar-broker 2023-04-22T00:19:52,691+0000 [pulsar-io-4-5] INFO  org.apache.pulsar.client.impl.ConnectionHandler - [persistent://pulsar/system/loadbalancer-service-unit-state] [reader-30e6b40dd9] Closed connection [id: 0x907e74b0, L:/10.0.13.6:35838 ! R:pulsar-broker-8.pulsar-broker.pulsar.svc.cluster.local/10.0.18.18:6650] -- Will try again in 0.1 s

(znode is gone)
pulsar-broker-8 pulsar-broker 2023-04-22T00:19:52,652+0000 [pulsar-load-manager-1-1] INFO  org.apache.pulsar.broker.loadbalance.extensions.channel.ServiceUnitStateChannelImpl - BrokerRegistry detected the broker:pulsar-broker-8.pulsar-broker.pulsar.svc.cluster.local:8080 registry has been deleted.

(lookup failure)
org.apache.pulsar.client.impl.ConnectionHandler - [persistent://pulsar/system/loadbalancer-service-unit-state] [reader-30e6b40dd9] Could not get connection to broker: org.apache.pulsar.client.api.PulsarClientException$ConnectException: Disconnected from server at pulsar-broker-3.pulsar-broker.pulsar.svc.cluster.local/10.0.13.6:6650 -- Will try again in 0.193 s
pulsar-broker-3 pulsar-broker 2023-04-22T00:19:52,827+0000 [main-EventThread] INFO  org.apache.pulsar.broker.lookup.TopicLookupBase - Failed to lookup null for topic persistent://pulsar/system/loadbalancer-service-unit-state with error Failed to look up a broker registry:pulsar-broker-8.pulsar-broker.pulsar.svc.cluster.local:8080 for bundle:pulsar/system/0x40000000_0x50000000
pulsar-broker-3 pulsar-broker 2023-04-22T00:19:52,827+0000 [main-EventThread] INFO  org.apache.pulsar.broker.lookup.TopicLookupBase - Failed to lookup null for topic persistent://pulsar/system/loadbalancer-service-unit-state with error Failed to look up a broker registry:pulsar-broker-8.pulsar-broker.pulsar.svc.cluster.local:8080 for bundle:pulsar/system/0x40000000_0x50000000

(leader election service has been closed)
pulsar-broker-3 pulsar-broker 2023-04-22T00:19:55,569+0000 [pulsar-load-manager-1-1] INFO  org.apache.pulsar.broker.loadbalance.extensions.ExtensibleLoadManagerImpl - This broker:pulsar-broker-3.pulsar-broker.pulsar.svc.cluster.local:8080 plays the leader now.

Modifications

During the shutdown flow, when calling loadbalancer.disableBroker(), the load balancer extension gracefully gives up the owned bundles to other brokers. After that, it closes the leader election service and removes its register in the metadata store.

Verifying this change

  • Make sure that the change passes the CI checks.

This change is already covered by existing tests, and I added a new test case to cover this diableBroker logic.

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: heesung-sn#47

@heesung-sn heesung-sn force-pushed the new-lb-leader-election-close-faster branch from ed773f4 to 4ee4b99 Compare May 13, 2023 00:24
@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label May 13, 2023
@@ -1316,7 +1324,11 @@ public boolean isRunning() {
* @return a reference of the current <code>LeaderElectionService</code> instance.
*/
public LeaderElectionService getLeaderElectionService() {
return this.leaderElectionService;
if (ExtensibleLoadManagerImpl.isLoadManagerExtensionEnabled(config)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Demogorgon314 Plz confirm this change.

}

public CompletableFuture<Optional<String>> selectAsync(ServiceUnitId bundle,
Optional<Set<String>> excludeBrokerSet) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can just use empty set like Collections.emptySet() to replace Optional

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@Demogorgon314 Demogorgon314 added type/enhancement The enhancements for the existing features or docs. e.g. reduce memory usage of the delayed messages area/broker labels May 13, 2023
@heesung-sn heesung-sn force-pushed the new-lb-leader-election-close-faster branch from f72b25f to 2e8eb09 Compare May 17, 2023 00:41
@heesung-sn heesung-sn force-pushed the new-lb-leader-election-close-faster branch from c9cac97 to a3f7b26 Compare May 18, 2023 20:32
@heesung-sn heesung-sn force-pushed the new-lb-leader-election-close-faster branch from a3f7b26 to 0426d30 Compare May 18, 2023 22:53
@Demogorgon314 Demogorgon314 merged commit a9f2f28 into apache:master May 22, 2023
@Technoboy- Technoboy- added this to the 3.1.0 milestone May 28, 2023
Technoboy- pushed a commit that referenced this pull request May 30, 2023
### Motivation

Load balancer extension needs to shut down gracefully, especially when shutting down the leader broker. When the leader broker closes the leader election service too late, service unit lookups to the leader broker could fail during the shutdown. This could delay client reconnection time.

Lookup failure logs
```
(shutdown starts)
pulsar-broker-8 pulsar-broker 2023-04-22T00:19:52,630+0000 [pulsar-service-shutdown] INFO  org.apache.pulsar.broker.PulsarService - Closing PulsarService
pulsar-broker-3 pulsar-broker 2023-04-22T00:19:52,690+0000 [pulsar-io-4-5] INFO  org.apache.pulsar.client.impl.ConnectionHandler - [persistent://pulsar/system/loadbalancer-service-unit-state] [pulsar-18-9] Closed connection [id: 0x907e74b0, L:/10.0.13.6:35838 ! R:pulsar-broker-8.pulsar-broker.pulsar.svc.cluster.local/10.0.18.18:6650] -- Will try again in 0.1 s
pulsar-broker-3 pulsar-broker 2023-04-22T00:19:52,691+0000 [pulsar-io-4-5] INFO  org.apache.pulsar.client.impl.ConnectionHandler - [persistent://pulsar/system/loadbalancer-service-unit-state] [reader-30e6b40dd9] Closed connection [id: 0x907e74b0, L:/10.0.13.6:35838 ! R:pulsar-broker-8.pulsar-broker.pulsar.svc.cluster.local/10.0.18.18:6650] -- Will try again in 0.1 s

(znode is gone)
pulsar-broker-8 pulsar-broker 2023-04-22T00:19:52,652+0000 [pulsar-load-manager-1-1] INFO  org.apache.pulsar.broker.loadbalance.extensions.channel.ServiceUnitStateChannelImpl - BrokerRegistry detected the broker:pulsar-broker-8.pulsar-broker.pulsar.svc.cluster.local:8080 registry has been deleted.

(lookup failure)
org.apache.pulsar.client.impl.ConnectionHandler - [persistent://pulsar/system/loadbalancer-service-unit-state] [reader-30e6b40dd9] Could not get connection to broker: org.apache.pulsar.client.api.PulsarClientException$ConnectException: Disconnected from server at pulsar-broker-3.pulsar-broker.pulsar.svc.cluster.local/10.0.13.6:6650 -- Will try again in 0.193 s
pulsar-broker-3 pulsar-broker 2023-04-22T00:19:52,827+0000 [main-EventThread] INFO  org.apache.pulsar.broker.lookup.TopicLookupBase - Failed to lookup null for topic persistent://pulsar/system/loadbalancer-service-unit-state with error Failed to look up a broker registry:pulsar-broker-8.pulsar-broker.pulsar.svc.cluster.local:8080 for bundle:pulsar/system/0x40000000_0x50000000
pulsar-broker-3 pulsar-broker 2023-04-22T00:19:52,827+0000 [main-EventThread] INFO  org.apache.pulsar.broker.lookup.TopicLookupBase - Failed to lookup null for topic persistent://pulsar/system/loadbalancer-service-unit-state with error Failed to look up a broker registry:pulsar-broker-8.pulsar-broker.pulsar.svc.cluster.local:8080 for bundle:pulsar/system/0x40000000_0x50000000

(leader election service has been closed)
pulsar-broker-3 pulsar-broker 2023-04-22T00:19:55,569+0000 [pulsar-load-manager-1-1] INFO  org.apache.pulsar.broker.loadbalance.extensions.ExtensibleLoadManagerImpl - This broker:pulsar-broker-3.pulsar-broker.pulsar.svc.cluster.local:8080 plays the leader now.

```

### Modifications
During the shutdown flow, when calling loadbalancer.disableBroker(), the load balancer extension gracefully gives up the owned bundles to other brokers. After that, it closes the leader election service and removes its register in the metadata store.
@heesung-sn heesung-sn deleted the new-lb-leader-election-close-faster branch April 2, 2024 17:44
BewareMyPower added a commit to BewareMyPower/pulsar that referenced this pull request Apr 24, 2024
…rService

Fixes apache#22569

### Motivation

`BrokerService#closeAsync` calls `unloadNamespaceBundlesGracefully` to
unload namespaces gracefully. With extensible load manager, it
eventually calls `TableViewLoadDataStoreImpl#validateProducer`:

```
BrokerService#unloadNamespaceBundlesGracefully
  ExtensibleLoadManagerWrapper#disableBroker
    ExtensibleLoadManagerImpl#disableBroker
      ServiceUnitStateChannelImpl#cleanOwnerships
        ServiceUnitStateChannelImpl#doCleanup
          TableViewLoadDataStoreImpl#removeAsync
            TableViewLoadDataStoreImpl#validateProducer
```

In `validateProducer`, if the producer is not connected, it will
recreate the producer synchronously. However, since the state of
`PulsarService` has already been changed to `Closing`, all connect or
lookup requests will fail with `ServiceNotReady`. Then the client will
retry until timeout.

Besides, the unload operation could also trigger the reconnection
because the extensible load manager sends the unload event to the
`loadbalancer-service-unit-state` topic.

### Modifications

The major fix:
Before changing PulsarService's state to `Closing`, call
`BrokerService#unloadNamespaceBundlesGracefully` first to make the load
manager complete the unload operations first.

Minor fixes:
- Record the time when `LoadManager#disableBroker` is done.
- Don't check if producer is disconnected because the producer could
  retry if it's disconnected.

### Verifications

Add `ExtensibleLoadManagerCloseTest` to verify closing `PulsarService`
won't take too much time. Here are some test results locally:

```
2024-04-24T19:43:38,851 - INFO  - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3342, 3276, 3310]
2024-04-24T19:44:26,711 - INFO  - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3357, 3258, 3298]
2024-04-24T19:46:16,791 - INFO  - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3313, 3257, 3263]
2024-04-24T20:13:05,763 - INFO  - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3304, 3279, 3299]
2024-04-24T20:13:43,979 - INFO  - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3343, 3308, 3310]
```

As you can see, each broker takes only about 3 seconds to close due to
`OWNERSHIP_CLEAN_UP_CONVERGENCE_DELAY_IN_MILLIS` value added in
apache#20315
BewareMyPower added a commit to BewareMyPower/pulsar that referenced this pull request Apr 24, 2024
…rService

Fixes apache#22569

### Motivation

`BrokerService#closeAsync` calls `unloadNamespaceBundlesGracefully` to
unload namespaces gracefully. With extensible load manager, it
eventually calls `TableViewLoadDataStoreImpl#validateProducer`:

```
BrokerService#unloadNamespaceBundlesGracefully
  ExtensibleLoadManagerWrapper#disableBroker
    ExtensibleLoadManagerImpl#disableBroker
      ServiceUnitStateChannelImpl#cleanOwnerships
        ServiceUnitStateChannelImpl#doCleanup
          TableViewLoadDataStoreImpl#removeAsync
            TableViewLoadDataStoreImpl#validateProducer
```

In `validateProducer`, if the producer is not connected, it will
recreate the producer synchronously. However, since the state of
`PulsarService` has already been changed to `Closing`, all connect or
lookup requests will fail with `ServiceNotReady`. Then the client will
retry until timeout.

Besides, the unload operation could also trigger the reconnection
because the extensible load manager sends the unload event to the
`loadbalancer-service-unit-state` topic.

### Modifications

The major fix:
Before changing PulsarService's state to `Closing`, call
`BrokerService#unloadNamespaceBundlesGracefully` first to make the load
manager complete the unload operations first.

Minor fixes:
- Record the time when `LoadManager#disableBroker` is done.
- Don't check if producer is disconnected because the producer could
  retry if it's disconnected.

### Verifications

Add `ExtensibleLoadManagerCloseTest` to verify closing `PulsarService`
won't take too much time. Here are some test results locally:

```
2024-04-24T19:43:38,851 - INFO  - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3342, 3276, 3310]
2024-04-24T19:44:26,711 - INFO  - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3357, 3258, 3298]
2024-04-24T19:46:16,791 - INFO  - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3313, 3257, 3263]
2024-04-24T20:13:05,763 - INFO  - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3304, 3279, 3299]
2024-04-24T20:13:43,979 - INFO  - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3343, 3308, 3310]
```

As you can see, each broker takes only about 3 seconds to close due to
`OWNERSHIP_CLEAN_UP_CONVERGENCE_DELAY_IN_MILLIS` value added in
apache#20315
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/broker cherry-picked/branch-3.0 doc-not-needed Your PR changes do not impact docs release/3.0.1 type/enhancement The enhancements for the existing features or docs. e.g. reduce memory usage of the delayed messages
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants