-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[improve][broker] Gracefully shut down load balancer extension #20315
[improve][broker] Gracefully shut down load balancer extension #20315
Conversation
ed773f4
to
4ee4b99
Compare
@@ -1316,7 +1324,11 @@ public boolean isRunning() { | |||
* @return a reference of the current <code>LeaderElectionService</code> instance. | |||
*/ | |||
public LeaderElectionService getLeaderElectionService() { | |||
return this.leaderElectionService; | |||
if (ExtensibleLoadManagerImpl.isLoadManagerExtensionEnabled(config)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Demogorgon314 Plz confirm this change.
} | ||
|
||
public CompletableFuture<Optional<String>> selectAsync(ServiceUnitId bundle, | ||
Optional<Set<String>> excludeBrokerSet) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can just use empty set like Collections.emptySet()
to replace Optional
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
f72b25f
to
2e8eb09
Compare
...ava/org/apache/pulsar/broker/loadbalance/extensions/channel/ServiceUnitStateChannelImpl.java
Show resolved
Hide resolved
c9cac97
to
a3f7b26
Compare
a3f7b26
to
0426d30
Compare
### Motivation Load balancer extension needs to shut down gracefully, especially when shutting down the leader broker. When the leader broker closes the leader election service too late, service unit lookups to the leader broker could fail during the shutdown. This could delay client reconnection time. Lookup failure logs ``` (shutdown starts) pulsar-broker-8 pulsar-broker 2023-04-22T00:19:52,630+0000 [pulsar-service-shutdown] INFO org.apache.pulsar.broker.PulsarService - Closing PulsarService pulsar-broker-3 pulsar-broker 2023-04-22T00:19:52,690+0000 [pulsar-io-4-5] INFO org.apache.pulsar.client.impl.ConnectionHandler - [persistent://pulsar/system/loadbalancer-service-unit-state] [pulsar-18-9] Closed connection [id: 0x907e74b0, L:/10.0.13.6:35838 ! R:pulsar-broker-8.pulsar-broker.pulsar.svc.cluster.local/10.0.18.18:6650] -- Will try again in 0.1 s pulsar-broker-3 pulsar-broker 2023-04-22T00:19:52,691+0000 [pulsar-io-4-5] INFO org.apache.pulsar.client.impl.ConnectionHandler - [persistent://pulsar/system/loadbalancer-service-unit-state] [reader-30e6b40dd9] Closed connection [id: 0x907e74b0, L:/10.0.13.6:35838 ! R:pulsar-broker-8.pulsar-broker.pulsar.svc.cluster.local/10.0.18.18:6650] -- Will try again in 0.1 s (znode is gone) pulsar-broker-8 pulsar-broker 2023-04-22T00:19:52,652+0000 [pulsar-load-manager-1-1] INFO org.apache.pulsar.broker.loadbalance.extensions.channel.ServiceUnitStateChannelImpl - BrokerRegistry detected the broker:pulsar-broker-8.pulsar-broker.pulsar.svc.cluster.local:8080 registry has been deleted. (lookup failure) org.apache.pulsar.client.impl.ConnectionHandler - [persistent://pulsar/system/loadbalancer-service-unit-state] [reader-30e6b40dd9] Could not get connection to broker: org.apache.pulsar.client.api.PulsarClientException$ConnectException: Disconnected from server at pulsar-broker-3.pulsar-broker.pulsar.svc.cluster.local/10.0.13.6:6650 -- Will try again in 0.193 s pulsar-broker-3 pulsar-broker 2023-04-22T00:19:52,827+0000 [main-EventThread] INFO org.apache.pulsar.broker.lookup.TopicLookupBase - Failed to lookup null for topic persistent://pulsar/system/loadbalancer-service-unit-state with error Failed to look up a broker registry:pulsar-broker-8.pulsar-broker.pulsar.svc.cluster.local:8080 for bundle:pulsar/system/0x40000000_0x50000000 pulsar-broker-3 pulsar-broker 2023-04-22T00:19:52,827+0000 [main-EventThread] INFO org.apache.pulsar.broker.lookup.TopicLookupBase - Failed to lookup null for topic persistent://pulsar/system/loadbalancer-service-unit-state with error Failed to look up a broker registry:pulsar-broker-8.pulsar-broker.pulsar.svc.cluster.local:8080 for bundle:pulsar/system/0x40000000_0x50000000 (leader election service has been closed) pulsar-broker-3 pulsar-broker 2023-04-22T00:19:55,569+0000 [pulsar-load-manager-1-1] INFO org.apache.pulsar.broker.loadbalance.extensions.ExtensibleLoadManagerImpl - This broker:pulsar-broker-3.pulsar-broker.pulsar.svc.cluster.local:8080 plays the leader now. ``` ### Modifications During the shutdown flow, when calling loadbalancer.disableBroker(), the load balancer extension gracefully gives up the owned bundles to other brokers. After that, it closes the leader election service and removes its register in the metadata store.
…rService Fixes apache#22569 ### Motivation `BrokerService#closeAsync` calls `unloadNamespaceBundlesGracefully` to unload namespaces gracefully. With extensible load manager, it eventually calls `TableViewLoadDataStoreImpl#validateProducer`: ``` BrokerService#unloadNamespaceBundlesGracefully ExtensibleLoadManagerWrapper#disableBroker ExtensibleLoadManagerImpl#disableBroker ServiceUnitStateChannelImpl#cleanOwnerships ServiceUnitStateChannelImpl#doCleanup TableViewLoadDataStoreImpl#removeAsync TableViewLoadDataStoreImpl#validateProducer ``` In `validateProducer`, if the producer is not connected, it will recreate the producer synchronously. However, since the state of `PulsarService` has already been changed to `Closing`, all connect or lookup requests will fail with `ServiceNotReady`. Then the client will retry until timeout. Besides, the unload operation could also trigger the reconnection because the extensible load manager sends the unload event to the `loadbalancer-service-unit-state` topic. ### Modifications The major fix: Before changing PulsarService's state to `Closing`, call `BrokerService#unloadNamespaceBundlesGracefully` first to make the load manager complete the unload operations first. Minor fixes: - Record the time when `LoadManager#disableBroker` is done. - Don't check if producer is disconnected because the producer could retry if it's disconnected. ### Verifications Add `ExtensibleLoadManagerCloseTest` to verify closing `PulsarService` won't take too much time. Here are some test results locally: ``` 2024-04-24T19:43:38,851 - INFO - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3342, 3276, 3310] 2024-04-24T19:44:26,711 - INFO - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3357, 3258, 3298] 2024-04-24T19:46:16,791 - INFO - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3313, 3257, 3263] 2024-04-24T20:13:05,763 - INFO - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3304, 3279, 3299] 2024-04-24T20:13:43,979 - INFO - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3343, 3308, 3310] ``` As you can see, each broker takes only about 3 seconds to close due to `OWNERSHIP_CLEAN_UP_CONVERGENCE_DELAY_IN_MILLIS` value added in apache#20315
…rService Fixes apache#22569 ### Motivation `BrokerService#closeAsync` calls `unloadNamespaceBundlesGracefully` to unload namespaces gracefully. With extensible load manager, it eventually calls `TableViewLoadDataStoreImpl#validateProducer`: ``` BrokerService#unloadNamespaceBundlesGracefully ExtensibleLoadManagerWrapper#disableBroker ExtensibleLoadManagerImpl#disableBroker ServiceUnitStateChannelImpl#cleanOwnerships ServiceUnitStateChannelImpl#doCleanup TableViewLoadDataStoreImpl#removeAsync TableViewLoadDataStoreImpl#validateProducer ``` In `validateProducer`, if the producer is not connected, it will recreate the producer synchronously. However, since the state of `PulsarService` has already been changed to `Closing`, all connect or lookup requests will fail with `ServiceNotReady`. Then the client will retry until timeout. Besides, the unload operation could also trigger the reconnection because the extensible load manager sends the unload event to the `loadbalancer-service-unit-state` topic. ### Modifications The major fix: Before changing PulsarService's state to `Closing`, call `BrokerService#unloadNamespaceBundlesGracefully` first to make the load manager complete the unload operations first. Minor fixes: - Record the time when `LoadManager#disableBroker` is done. - Don't check if producer is disconnected because the producer could retry if it's disconnected. ### Verifications Add `ExtensibleLoadManagerCloseTest` to verify closing `PulsarService` won't take too much time. Here are some test results locally: ``` 2024-04-24T19:43:38,851 - INFO - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3342, 3276, 3310] 2024-04-24T19:44:26,711 - INFO - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3357, 3258, 3298] 2024-04-24T19:46:16,791 - INFO - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3313, 3257, 3263] 2024-04-24T20:13:05,763 - INFO - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3304, 3279, 3299] 2024-04-24T20:13:43,979 - INFO - [main:ExtensibleLoadManagerCloseTest] - Brokers close time: [3343, 3308, 3310] ``` As you can see, each broker takes only about 3 seconds to close due to `OWNERSHIP_CLEAN_UP_CONVERGENCE_DELAY_IN_MILLIS` value added in apache#20315
PIP: #16691
Motivation
Load balancer extension needs to shut down gracefully, especially when shutting down the leader broker. When the leader broker closes the leader election service too late, service unit lookups to the leader broker could fail during the shutdown. This could delay client reconnection time.
Lookup failure logs
Modifications
During the shutdown flow, when calling loadbalancer.disableBroker(), the load balancer extension gracefully gives up the owned bundles to other brokers. After that, it closes the leader election service and removes its register in the metadata store.
Verifying this change
This change is already covered by existing tests, and I added a new test case to cover this diableBroker logic.
Does this pull request potentially affect one of the following parts:
If the box was checked, please highlight the changes
Documentation
doc
doc-required
doc-not-needed
doc-complete
Matching PR in forked repository
PR in forked repository: heesung-sn#47