SAI-5162: Add sysprop `solrcloud.publishDownOnStart` to controller whether publish down on node start or not #233

patsonluk · 2024-10-11T19:03:23Z

Descriptions

Detailed in https://fullstory.atlassian.net/browse/SAI-5162

We are adding a sys prop solrcloud.publishDownOnStart to give us an option to bypass downnode publishing upon node start. If set to true, it will publish down on start (as in the 9.7 behavior). However, if set to false or undefined, it will NOT publish down on start (bypass the fix in 9.7)

Also adding logging to assess actual overhead of the downnode call to determine if we need further action

This change is likely to be temporary, depending on the latency reported, we might pursue further optimization or just take 9.7 change as is.

…publish down on node start or not

hiteshk25 · 2024-10-11T20:13:06Z

QQ: is this PRS message to add the replica as down "core_node2:54:D:L" ?

hiteshk25 · 2024-10-11T20:26:38Z

@patsonluk Here are the tests which I think Ishan/Noble used to run. Would be good to run those tests with 9.7 and 9.3 to compare the results.

1. cluster-test.json : Creates an 8 node cluster and create 1000 collections of various numShards and measure shutdown & restart performance
2. stress-facets-local.json : Indexes 20 million documents from an ecommerce events dataset, issues 5k facet queries against it.

patsonluk · 2024-10-11T23:01:35Z

QQ: is this PRS message to add the replica as down "core_node2:54:D:L" ?

Yes

@patsonluk Here are the tests which I think Ishan/Noble used to run. Would be good to run those tests with 9.7 and 9.3 to compare the results.
1. cluster-test.json : Creates an 8 node cluster and create 1000 collections of various numShards and measure shutdown & restart performance
2. stress-facets-local.json : Indexes 20 million documents from an ecommerce events dataset, issues 5k facet queries against it.

Thanks I will run those tests!

patsonluk · 2024-10-15T17:36:20Z

I have only run the cluster-test.json as that one is more relevant to this PR, which focus on node startup. The results are updated in https://fullstory.atlassian.net/browse/SAI-5162 description -> benchmarking -> version comparison. Take note that running such test as is might not be ideal, as the ZK time could be greatly underestimated as both the solr processes and the ZK process are run on the same machine.

We probably want to run it using solrperf clusters, however i suspect the impact will be very similar to the test that isolate out the ZK fetching part (https://fullstory.atlassian.net/browse/SAI-5162 description -> benchmarking -> Cluster state fetching)

Testing against 9.7 vs 9.3 could also hide performance issue of such change as other changes (?) might actually speed up start up (we even see 9.7 has faster startup with the solrbench test), that however, does not mean publish downnode on start has no performance impact.

That being said, I think we should still run 9.7 vs 9.3 benchmarking (with the FS changes and setup). Which is similar to what we have run for Solr 8 -> 9 migration https://fullstory.atlassian.net/issues/SAI-4430?jql=text%20~%20%22benchmark%20solr%209%2A%22) + another test for restart with high number of collections/replicas. Even though the new test will not pinpoint the publish downnode on start change, however, it should still give us confidence on restart performance in general.

hiteshk25 · 2024-10-15T17:42:28Z

QQ: is this message "core_node2:54:D:L" goes to overseer node and then overseer node updates this message to zk?

patsonluk · 2024-10-15T17:50:14Z

QQ: is this message "core_node2:54:D:L" goes to overseer node and then overseer node updates this message to zk?

No. For PRS, the downnode change is applied from the data node to ZK directly as in here

patsonluk · 2024-11-12T20:00:16Z

@hiteshk25 can we get this into our fs/branch_9x. This is likely to be temporary and we could totally remove it after we confirm the performance of publishNodeAsDown does not adversely affect our prod environment

which means if such flag is NOT defined (hence solrcloud.publishDownOnStart=false), then by default it will bypass publishAndWaitForDownStates

hiteshk25

LGTM

…ether publish down on node start or not (#233) * Add sysprop `solrcloud.skipPublishDownOnStart` to controller whether publish down on node start or not * Use RTimer instead * Added timer for the whole publish down ops, including persist ops * ./gradlew tidy * Changed solrcloud.skipPublishDownOnStart to solrcloud.publishDownOnStart which means if such flag is NOT defined (hence solrcloud.publishDownOnStart=false), then by default it will bypass publishAndWaitForDownStates

Add sysprop solrcloud.skipPublishDownOnStart to controller whether …

7ee632d

…publish down on node start or not

patsonluk assigned magibney and hiteshk25 Oct 11, 2024

patsonluk added 3 commits October 11, 2024 15:48

Use RTimer instead

6a79752

Added timer for the whole publish down ops, including persist ops

941d95d

./gradlew tidy

5469598

patsonluk changed the title ~~SAI-5162: Add sysprop solrcloud.skipPublishDownOnStart to controller whether publish down on node start or not~~ SAI-5162: Add sysprop solrcloud.publishDownOnStart to controller whether publish down on node start or not Nov 12, 2024

Changed solrcloud.skipPublishDownOnStart to solrcloud.publishDownOnStart

f8a25e0

which means if such flag is NOT defined (hence solrcloud.publishDownOnStart=false), then by default it will bypass publishAndWaitForDownStates

hiteshk25 approved these changes Nov 12, 2024

View reviewed changes

patsonluk merged commit d5de9d9 into fs/branch_9x Nov 12, 2024
2 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SAI-5162: Add sysprop `solrcloud.publishDownOnStart` to controller whether publish down on node start or not #233

SAI-5162: Add sysprop `solrcloud.publishDownOnStart` to controller whether publish down on node start or not #233

patsonluk commented Oct 11, 2024 •

edited

Loading

hiteshk25 commented Oct 11, 2024

hiteshk25 commented Oct 11, 2024

patsonluk commented Oct 11, 2024

patsonluk commented Oct 15, 2024 •

edited

Loading

hiteshk25 commented Oct 15, 2024

patsonluk commented Oct 15, 2024

patsonluk commented Nov 12, 2024

hiteshk25 left a comment

SAI-5162: Add sysprop solrcloud.publishDownOnStart to controller whether publish down on node start or not #233

SAI-5162: Add sysprop solrcloud.publishDownOnStart to controller whether publish down on node start or not #233

Conversation

patsonluk commented Oct 11, 2024 • edited Loading

Descriptions

hiteshk25 commented Oct 11, 2024

hiteshk25 commented Oct 11, 2024

patsonluk commented Oct 11, 2024

patsonluk commented Oct 15, 2024 • edited Loading

hiteshk25 commented Oct 15, 2024

patsonluk commented Oct 15, 2024

patsonluk commented Nov 12, 2024

hiteshk25 left a comment

Choose a reason for hiding this comment

SAI-5162: Add sysprop `solrcloud.publishDownOnStart` to controller whether publish down on node start or not #233

SAI-5162: Add sysprop `solrcloud.publishDownOnStart` to controller whether publish down on node start or not #233

patsonluk commented Oct 11, 2024 •

edited

Loading

patsonluk commented Oct 15, 2024 •

edited

Loading