Shards in a down state after an HPA scale up / scale down event. #682

aloosnetmatch · 2024-01-31T12:33:04Z

I installed the solr operator 0.8.0 with solr image 9.4.1 on AKS.
Using a a guideline this video : Rethinking Autoscaling for Apache Solr using Kubernetes - Berlin Buzzwords 2023

The setup uses persistent disks.

I created 2 indexes and put some data in it.
index test: 3 shards and 2 replica's
index test2: 6 shards and 2 replica's

I configured an HPA and stressed the cluster a bit to make sure the cluster would scale up from 5 to 11 nodes.
Scaling up went fine. Shards for the 2 indexes got moved to the new nodes.

During scaling down, however , some shards get a lot of "down" replica's

The HPA mentioned it would scale down to 5 pods, but there kept 6 running.

The logs offcourse reveal

2024-01-31 10:32:57.332 ERROR (recoveryExecutor-10-thread-16-processing-test2_shard4_replica_n113 netm-solr-operator-solr-cluster-netm-solrcloud-1.ing.local.domain-7376 move-replicas-solr-cluster-netm-solrcloud-6162936163817514 core_node114 create netm-solr-operator-solr-cluster-netm-solrcloud-1.ing.local.domain:80_solr test2 shard4) [c:test2 s:shard4 r:core_node114 x:test2_shard4_replica_n113 t:netm-solr-operator-solr-cluster-netm-solrcloud-1.ing.local.domain-7376] o.a.s.c.RecoveryStrategy Failed to connect leader http://netm-solr-operator-solr-cluster-netm-solrcloud-5.ing.local.domain:80/solr on recovery, try again
2024-01-31 10:32:57.472 ERROR (recoveryExecutor-10-thread-11-processing-netm-solr-operator-solr-cluster-netm-solrcloud-1.ing.local.domain:80_solr test_shard3_replica_n75 test shard3 core_node76) [c:test s:shard3 r:core_node76 x:test_shard3_replica_n75 t:] o.a.s.c.RecoveryStrategy Failed to connect leader http://netm-solr-operator-solr-cluster-netm-solrcloud-6.ing.local.domain:80/solr on recovery, try again
2024-01-31 10:32:57.472 ERROR (recoveryExecutor-10-thread-13-processing-netm-solr-operator-solr-cluster-netm-solrcloud-1.ing.local.domain:80_solr test_shard3_replica_n87 test shard3 core_node88) [c:test s:shard3 r:core_node88 x:test_shard3_replica_n87 t:] o.a.s.c.RecoveryStrategy Failed to connect leader http://netm-solr-operator-solr-cluster-netm-solrcloud-6.ing.local.domain:80/solr on recovery, try again
20

In the overseer there are items still in the work queue.

On the disk for the given shards , i can see the folders of the shards

solr@solr-cluster-netm-solrcloud-1:/var/solr/data$ ls -l
total 108
drwxrws--- 2 root solr 16384 Jan 30 10:37 lost+found
-rw-r-xr-- 1 root solr 1203 Jan 30 13:56 solr.xml
drwxrwsr-x 3 solr solr 4096 Jan 30 12:59 test2_shard1_replica_n12
drwxrwsr-x 3 solr solr 4096 Jan 30 12:59 test2_shard3_replica_n2
drwxr-sr-x 3 solr solr 4096 Jan 31 01:30 test2_shard4_replica_n101
drwxr-sr-x 3 solr solr 4096 Jan 31 03:31 test2_shard4_replica_n113
drwxr-sr-x 3 solr solr 4096 Jan 31 05:32 test2_shard4_replica_n125
drwxr-sr-x 3 solr solr 4096 Jan 31 07:33 test2_shard4_replica_n137
drwxr-sr-x 3 solr solr 4096 Jan 31 09:34 test2_shard4_replica_n149
drwxr-sr-x 3 solr solr 4096 Jan 30 15:24 test2_shard4_replica_n41
drwxr-sr-x 3 solr solr 4096 Jan 30 17:25 test2_shard4_replica_n53
drwxr-sr-x 3 solr solr 4096 Jan 30 19:26 test2_shard4_replica_n65
drwxr-sr-x 3 solr solr 4096 Jan 30 21:28 test2_shard4_replica_n77
drwxr-sr-x 3 solr solr 4096 Jan 30 23:29 test2_shard4_replica_n89
drwxr-sr-x 3 solr solr 4096 Jan 31 04:31 test_shard3_replica_n111
drwxr-sr-x 3 solr solr 4096 Jan 31 06:32 test_shard3_replica_n123
drwxr-sr-x 3 solr solr 4096 Jan 31 08:33 test_shard3_replica_n135
drwxr-sr-x 3 solr solr 4096 Jan 30 16:25 test_shard3_replica_n39
drwxr-sr-x 3 solr solr 4096 Jan 30 18:26 test_shard3_replica_n51
drwxrwsr-x 3 solr solr 4096 Jan 30 11:18 test_shard3_replica_n6
drwxr-sr-x 3 solr solr 4096 Jan 30 20:27 test_shard3_replica_n63
drwxr-sr-x 3 solr solr 4096 Jan 30 22:28 test_shard3_replica_n75
drwxr-sr-x 3 solr solr 4096 Jan 31 00:29 test_shard3_replica_n87
drwxr-sr-x 3 solr solr 4096 Jan 31 02:30 test_shard3_replica_n99
solr@solr-cluster-netm-solrcloud-1:/var/solr/data$

They all seemed empty though.

So i suspect something wrong with the scale down/up / migration of the shards.
Every pod gets restarted during the downgrade......

What could be the issue for the number of down shards to be so huge.

PS i did the same test on a Kind cluster with the same results.

sabaribose · 2024-02-28T06:19:39Z

@HoustonPutman
The same issue is happening to me,

The HPA mentioned it would scale to 14 pods, but they kept 58 running.

For me the leader election was successful, but I could see a lot of down replicas which is causing query issues and getting the error like shards are down

HoustonPutman · 2024-02-28T16:43:47Z

Ok, so y'alls issues seem somewhat related.

I have seen problems with Solr failing to delete bad replicas during an unsuccessful migration. And that's the reason why you are seeing a large increase in the number of replicas.

So i suspect something wrong with the scale down/up / migration of the shards. Every pod gets restarted during the downgrade......

This is definitely a problem, and related to the fact that you are addressing your solr nodes through the ingress. In order for all Solr traffic to not be directed through the ingress (which would slow things down considerably), we use basically /etc/hosts on the pods to map each ingress address to the IP of the pod it maps to. And since you are scaling down, it is removing some of the /etc/hosts entries, thus requiring full restarts every time.

An easy solution to this would be to only update the /etc/hosts if an IP is changed or added. It doesn't really matter if we have unused entries there.

Anyways, we should definitely have an integration test that stresses the HPA with ingresses, because this seems like a very iffy edge case.

The same issue is happening to me

@sabaribose I think this is separate, because you are not using an ingress, but using the headless service.

I think your is from the BalanceReplicas command not queueing for a retry when it fails. But I will do more investigation here.

HoustonPutman mentioned this issue Mar 7, 2024

Fix scaling when using ingress-addressed nodes #692

Merged

HoustonPutman added bug Something isn't working cloud autoscaling Autoscaling of Solr Nodes using the HPA and Solr APIs networking Related to Services or Ingresses labels Mar 11, 2024

HoustonPutman added this to the v0.8.1 milestone Mar 11, 2024

HoustonPutman closed this as completed in #692 Apr 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shards in a down state after an HPA scale up / scale down event. #682

Shards in a down state after an HPA scale up / scale down event. #682

aloosnetmatch commented Jan 31, 2024

sabaribose commented Feb 28, 2024

HoustonPutman commented Feb 28, 2024

Shards in a down state after an HPA scale up / scale down event. #682

Shards in a down state after an HPA scale up / scale down event. #682

Comments

aloosnetmatch commented Jan 31, 2024

sabaribose commented Feb 28, 2024

HoustonPutman commented Feb 28, 2024