Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SAI-5162: Add sysprop solrcloud.publishDownOnStart to controller whether publish down on node start or not #233

Merged
merged 5 commits into from
Nov 12, 2024

Conversation

patsonluk
Copy link
Collaborator

@patsonluk patsonluk commented Oct 11, 2024

Descriptions

Detailed in https://fullstory.atlassian.net/browse/SAI-5162

We are adding a sys prop solrcloud.publishDownOnStart to give us an option to bypass downnode publishing upon node start. If set to true, it will publish down on start (as in the 9.7 behavior). However, if set to false or undefined, it will NOT publish down on start (bypass the fix in 9.7)

Also adding logging to assess actual overhead of the downnode call to determine if we need further action

This change is likely to be temporary, depending on the latency reported, we might pursue further optimization or just take 9.7 change as is.

@hiteshk25
Copy link
Collaborator

QQ: is this PRS message to add the replica as down "core_node2:54:D:L" ?

@hiteshk25
Copy link
Collaborator

@patsonluk Here are the tests which I think Ishan/Noble used to run. Would be good to run those tests with 9.7 and 9.3 to compare the results.

1. cluster-test.json : Creates an 8 node cluster and create 1000 collections of various numShards and measure shutdown & restart performance
2. stress-facets-local.json : Indexes 20 million documents from an ecommerce events dataset, issues 5k facet queries against it.

@patsonluk
Copy link
Collaborator Author

QQ: is this PRS message to add the replica as down "core_node2:54:D:L" ?

Yes

@patsonluk Here are the tests which I think Ishan/Noble used to run. Would be good to run those tests with 9.7 and 9.3 to compare the results.

1. cluster-test.json : Creates an 8 node cluster and create 1000 collections of various numShards and measure shutdown & restart performance
2. stress-facets-local.json : Indexes 20 million documents from an ecommerce events dataset, issues 5k facet queries against it.

Thanks I will run those tests!

@patsonluk
Copy link
Collaborator Author

patsonluk commented Oct 15, 2024

I have only run the cluster-test.json as that one is more relevant to this PR, which focus on node startup. The results are updated in https://fullstory.atlassian.net/browse/SAI-5162 description -> benchmarking -> version comparison. Take note that running such test as is might not be ideal, as the ZK time could be greatly underestimated as both the solr processes and the ZK process are run on the same machine.

We probably want to run it using solrperf clusters, however i suspect the impact will be very similar to the test that isolate out the ZK fetching part (https://fullstory.atlassian.net/browse/SAI-5162 description -> benchmarking -> Cluster state fetching)

Testing against 9.7 vs 9.3 could also hide performance issue of such change as other changes (?) might actually speed up start up (we even see 9.7 has faster startup with the solrbench test), that however, does not mean publish downnode on start has no performance impact.

That being said, I think we should still run 9.7 vs 9.3 benchmarking (with the FS changes and setup). Which is similar to what we have run for Solr 8 -> 9 migration https://fullstory.atlassian.net/issues/SAI-4430?jql=text%20~%20%22benchmark%20solr%209%2A%22) + another test for restart with high number of collections/replicas. Even though the new test will not pinpoint the publish downnode on start change, however, it should still give us confidence on restart performance in general.

@hiteshk25
Copy link
Collaborator

QQ: is this message "core_node2:54:D:L" goes to overseer node and then overseer node updates this message to zk?

@patsonluk
Copy link
Collaborator Author

QQ: is this message "core_node2:54:D:L" goes to overseer node and then overseer node updates this message to zk?

No. For PRS, the downnode change is applied from the data node to ZK directly as in here

@patsonluk
Copy link
Collaborator Author

@hiteshk25 can we get this into our fs/branch_9x. This is likely to be temporary and we could totally remove it after we confirm the performance of publishNodeAsDown does not adversely affect our prod environment

@patsonluk patsonluk changed the title SAI-5162: Add sysprop solrcloud.skipPublishDownOnStart to controller whether publish down on node start or not SAI-5162: Add sysprop solrcloud.publishDownOnStart to controller whether publish down on node start or not Nov 12, 2024
which means if such flag is NOT defined (hence solrcloud.publishDownOnStart=false), then by default it will bypass publishAndWaitForDownStates
Copy link
Collaborator

@hiteshk25 hiteshk25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@patsonluk patsonluk merged commit d5de9d9 into fs/branch_9x Nov 12, 2024
2 of 3 checks passed
patsonluk added a commit that referenced this pull request Nov 12, 2024
…ether publish down on node start or not (#233)

* Add sysprop `solrcloud.skipPublishDownOnStart` to controller whether publish down on node start or not

* Use RTimer instead

* Added timer for the whole publish down ops, including persist ops

* ./gradlew tidy

* Changed solrcloud.skipPublishDownOnStart to solrcloud.publishDownOnStart

which means if such flag is NOT defined (hence solrcloud.publishDownOnStart=false), then by default it will bypass publishAndWaitForDownStates
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants