Change synonyms index auto-expand replicas to 0-1 #115078

carlosdelest · 2024-10-18T08:01:15Z

Changes auto-expand replicas in synonyms index to 0-1 from 0-all. This is the only system index that uses a 0-all setting, and that causes failures in CI related to #113478.

#113478 changed fast indices so now they read from search shards. This broke some CI tests, as synonyms system index is only created when accessed for the first time, and accessing it immediately afterwards did not guarantee that a search replica shard had been created yet (see elasticsearch serverless issue 2922).

The obvious solution was to wait to shards been green, which was introduced in #114400. However, this broke BwC tests in non-serverless CI.

The reason is the index cannot be green in BwC tests:

A 4 node cluster is created for BwC testing
2 nodes are updated to the newer version
Synonyms test start, and create the synonym index. It should create 4 replica shards as we have 0-all auto-expand setting.
In case a node in a newer version is used, it creates the shard but it cannot be replicated to the older nodes, as they are using an index and node version that is less than the updated one
Index can't be green as 2 shards can be replicated to the new nodes, but 2 cannot.

This change should fix the above problem when backported to 8.x, and is a prerequirement to unmute synonyms tests (see issues #114432, #114443, #114444). Once this is merged, we can fix the CI tests by waiting for a green index status, as 1 replica will be able to be created (1 primary and 1 replica corresponding to the two nodes updated to the new version).

I can't think of a simple change in tests that would fix this (as we have non-bwc and bwc tests involved, which have different number of nodes), and this would align synonyms system index replica settings with the rest of the system indices.

elasticsearchmachine · 2024-10-18T08:08:09Z

Pinging @elastic/es-search-relevance (Team:Search Relevance)

kingherc

and that causes failures in CI related to #113478.

Note we've just reverted this in #115019 due to an incident (for another component). However, we'll be re-introducing it in the near future, so it's indeed best to have the synonyms test working for that future.

I can't think of a simple change in tests that would fix this

Another one I mentioned would be to ignore the timeout of the green status like:

  - do:
      cluster.health:
        index: .synonyms
        timeout: 1m
        wait_for_status: green
        ignore: 408

Your tests either way previously worked with 2 unassigned replicas, because the synonyms were got/searched in the 2 assigned replicas. It was in yellow state. A 1m timeout will give time to the search shard in stateless to be ready for searches. However it does not explicitly require it, e.g., if for some reason the search shard takes more than 1m to start, it will break, although that should be rare I hope.

Changes auto-expand replicas in synonyms index to 0-1 from 0-all.

Yes this should work then after reading your description! However, please consider whether 0-1 is fine for your use case -- I am not an expert in this, so I'm reviewing just by the perspective of making the test(s) work. So please get approval(s) from your team as well.

benwtrent · 2024-10-18T12:12:05Z

@carlosdelest I have a stupid question. Why can't we wait for yellow status?

kingherc · 2024-10-18T12:13:40Z

I can answer that @benwtrent I believe: yellow means only primary is ready. In stateless though that means the search shard might not yet be ready to service searches, so the test fails.

benwtrent · 2024-10-18T12:47:28Z

@kingherc seems like we are chasing the wrong thing here. That seems like a bug in how we determine health in serverless. Traditionally, yellow means "Data is available but degraded", now you are saying yellow in serverless means "Your data isn't available". That seems bad.

@carlosdelest why did we have this replica to all data nodes to begin with? What happens if for a search, we hit a node where a replica doesn't exist?

Maybe I am misunderstanding the purpose of this system index.

carlosdelest · 2024-10-18T12:55:12Z

@carlosdelest why did we have this replica to all data nodes to begin with?

I don't think we were looking closely at this when we created the system index, as no other system index does have 0-all. I don't remember a specific reason - do you @mayya-sharipova ?

What happens if for a search, we hit a node where a replica doesn't exist?

The content of this index is retrieved when building the analyzers (either shard recovery or analyzer reloading). So it doesn't matter in terms of replica availability as long as we could load the synonyms from the system index when the analyzers were loaded.

The index is not needed for the search operation after analyzers are created.

benwtrent · 2024-10-18T13:00:54Z

The content of this index is retrieved when building the analyzers (either shard recovery or analyzer reloading). So it doesn't matter in terms of replica availability as long as we could load the synonyms from the system index when the analyzers were loaded.

OK, this means at the time of index opening, we must serialize potentially large things between nodes. Besides the load time delay this will add, it will add more inter-node traffic costs to our users. With 0-all we only incur this serialization cost once, at the time of the data node being added or index updated.

With this change, every time an analyzer is opened, the cost will be larger.

kingherc · 2024-10-18T13:24:02Z

Traditionally, yellow means "Data is available but degraded", now you are saying yellow in serverless means "Your data isn't available". That seems bad.

I think there have been discussion on that for serverless, but I also believe that these APIs are internal in serverless. Note also that there are some subtle differences between using the health API and the cluster health API. I would say this discussion is out of the scope of this PR, but we can open it elsewhere/broadly if needed. From my point of view, it's weird that the test was running with yellow index to begin with -- I expect it to be green for OK/passing tests. So at least 0-1 does that, although I'm not familiar to say whether it's good for the index and use case.

carlosdelest · 2024-10-18T13:59:40Z

With this change, every time an analyzer is opened, the cost will be larger.

It's a tradeoff:

We're not replicating changes to the synonyms index, which will be replicated independently of whether a node has an index that depends on synonyms
We're creating traffic when an index is opened. The synonyms used by the index (no the entire index) need to be retrieved from the synonyms index.

Synonyms retrieval being limited to shard recovery and synonyms updates, and retrieving just the synonyms used by the analyzer, seem enough guarantee to me that this traffic should be limited.

I don't necessarily see this as a bad tradeoff, unless I'm missing something?

benwtrent · 2024-10-18T14:50:23Z

@kingherc

I would say this discussion is out of the scope of this PR, but we can open it elsewhere/broadly if needed. From my point of view, it's weird that the test was running with yellow index to begin with -- I expect it to be green for OK/passing tests. So at least 0-1 does that, although I'm not familiar to say whether it's good for the index and use case.

I would say its 100% in scope as we are making a production code change because of our inability to correctly test it. What we want is "Is this index searchable, please wait until its searchable", that is what we need. Does that not exist at all?

We don't care about green, yellow, etc. Just, is this index searchable, please wait until it is and that our settings for the index are applied in such a way that eventually the index is searchable.

kingherc · 2024-10-18T14:58:25Z

I would say its 100% in scope as we are making a production code change because of our inability to correctly test it. What we want is "Is this index searchable, please wait until its searchable", that is what we need. Does that not exist at all?

Yes there is. Wait for green.

What I meant to say is that we should not delve into the details of yellow. Green makes total sense for a test like this.

as we are making a production code change because of our inability to correctly test it.

No need for me to make a production code change if there's another way to make it wait for green BTW. Since synonyms are out of my realm of expertise, I will revoke my approval, just to avoid a misinterpretation that the production change is generally approved. Search team definitely should agree on best way to wait for green in the test.

Until best way to wait for green is agreed upon.

carlosdelest · 2024-10-18T17:48:13Z

Alternatively, you may consider for the test waiting somehow on the Health API for the index's started_replicas >= 1. This would mean the search shard has started in Serverless, and at least one of the stateful replicas assigned. This means the health would still be yellow (which is weird IMHO), but enough to search in both stateful & stateless.

@kingherc I'm afraid that won't work - there is no need for replicas to be present in stateful. BwC tests do use replicas as there are more than 1 node, but default YAML tests are run against a single node - so no replicas.

We would need different test conditions for stateful and BwC / serverless 😢

kingherc · 2024-10-18T20:21:06Z

True. I actually deleted my comment as soon as I posted it 😆 Still believe somehow waiting for green health is best. But I do not have an immediate idea of an easy way to do that in the upgrade/bwc case.

kingherc · 2024-10-21T13:42:26Z

@carlosdelest another idea I got from the team is that you can try to wait for yellow combined with wait_for_no_initializing_shards=true . Could you check?

carlosdelest · 2024-10-22T13:08:34Z

@kingherc , checked and that works, thank you very much!

Closing this in favour of #114641. We could apply the auto expand replicas change, but let's have it as a separate issue.

mayya-sharipova · 2024-10-22T13:31:53Z

Sorry for being late with the conversation, I am +1 to set auto-expand replicas to 0-1. Reasons for that:

all other Elasticsearch system indices use the same 0-1 setting (even security indices which are used more broadly)
an implication is minor. If there are no local replicas available, it will a little slower to start synonym analyzer and a its corresponding index (but considering that overall initializing a shard is slow, the impact is minor).

Change synonyms index auto-expand replicas to 0-1

4bb9d7b

carlosdelest marked this pull request as ready for review October 18, 2024 08:01

carlosdelest requested review from benwtrent, kingherc and mayya-sharipova October 18, 2024 08:01

This was referenced Oct 18, 2024

Fix Synonyms CI for 8.16 #115012

Closed

Fix synonyms CI tests timeout #114641

Merged

kingherc previously approved these changes Oct 18, 2024

View reviewed changes

carlosdelest closed this Oct 22, 2024

carlosdelest mentioned this pull request Oct 23, 2024

Synonyms: Update auto expand replicas in system index to 0-1 #115382

Open

carlosdelest mentioned this pull request Nov 22, 2024

Wait for all shards to be available in synonyms test #117281

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change synonyms index auto-expand replicas to 0-1 #115078

Change synonyms index auto-expand replicas to 0-1 #115078

carlosdelest commented Oct 18, 2024 •

edited

Loading

elasticsearchmachine commented Oct 18, 2024

kingherc left a comment

benwtrent commented Oct 18, 2024

kingherc commented Oct 18, 2024

benwtrent commented Oct 18, 2024

carlosdelest commented Oct 18, 2024 •

edited

Loading

benwtrent commented Oct 18, 2024

kingherc commented Oct 18, 2024

carlosdelest commented Oct 18, 2024

benwtrent commented Oct 18, 2024

kingherc commented Oct 18, 2024

carlosdelest commented Oct 18, 2024

kingherc commented Oct 18, 2024 •

edited

Loading

kingherc commented Oct 21, 2024

carlosdelest commented Oct 22, 2024

mayya-sharipova commented Oct 22, 2024

Change synonyms index auto-expand replicas to 0-1 #115078

Change synonyms index auto-expand replicas to 0-1 #115078

Conversation

carlosdelest commented Oct 18, 2024 • edited Loading

elasticsearchmachine commented Oct 18, 2024

kingherc left a comment

Choose a reason for hiding this comment

benwtrent commented Oct 18, 2024

kingherc commented Oct 18, 2024

benwtrent commented Oct 18, 2024

carlosdelest commented Oct 18, 2024 • edited Loading

benwtrent commented Oct 18, 2024

kingherc commented Oct 18, 2024

carlosdelest commented Oct 18, 2024

benwtrent commented Oct 18, 2024

kingherc commented Oct 18, 2024

carlosdelest commented Oct 18, 2024

kingherc commented Oct 18, 2024 • edited Loading

kingherc commented Oct 21, 2024

carlosdelest commented Oct 22, 2024

mayya-sharipova commented Oct 22, 2024

carlosdelest commented Oct 18, 2024 •

edited

Loading

carlosdelest commented Oct 18, 2024 •

edited

Loading

kingherc commented Oct 18, 2024 •

edited

Loading