-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shard stuck in relocating state with recovery stage=translog #9226
Comments
When the shard got unstuck, did it become green on the same node or did it move to another node? Also , which version of Es was it? On Sat, Jan 10, 2015 at 12:10 AM, Pius [email protected] wrote:
|
Once it got unstuck, it became green on the expected target node, thx. This is 1.4.2. The following is from the output of running the empty reroute command (which got it unstuck): "shards" : {
"0" : [ {
"state" : "STARTED",
"primary" : true,
"node" : "ctCRm_huQsSBoTobhmqJdg",
"relocating_node" : null,
"shard" : 0,
"index" : "index_name"
} ], |
I've been seeing what I think is a similar issue. We upgraded from 1.3.5 to 1.4.2 last week but now from time to time will see a shard get stuck in RELO or INIT. Most recent example was overnight after a node was restarted late yesterday. It moved a shard on restart and this morning it was still sitting in translog state well beyond what would be normal recovery time. I canceled the allocation and the shard is now allocating on another node currently in the translog phase. It hasn't been running long enough to say if there is a problem with this allocation. There seems to be nothing in the logs that indicates there was a problem. |
Nothing in the logs that I can see. I've seen this happen maybe 4-5 times since we upgraded to 1.4.2. Don't remember seeing this particular issue on 1.3.0 or 1.3.5. The new allocation of this same shard is also taking a long time. Is there any way to see deeper into the translog activity? If I look at the initializing shard on disk I see the translog files grow and then disappear to be replaced by a new file that repeats the cycle. There's also a second translog file sitting in the same directory that doesn't seem to change. |
@kstaken can you enable trace logging on the |
This recently happened with a user that was about to delete the transaction log a day later, but right before they were going to do it, then it finished. Ready to apply the logger for next time though. |
@bleskes I figured out the issue with the second allocation of the problem shard. When the shard got stuck the translog had grown so large it couldn't replay it on recovery. It appeared to be replaying the translog at a rate that was about 50% of the rate new records were being added to it. It's conceivable that the shard appearing to be stuck originally may have been the result of the same thing. Now that I know what to look for I can confirm that the next time this happens. The normal replicas seem to have no issue keeping up, is there throttling on the replay of translogs during recovery? BTW, to recover this shard I canceled the allocation and forced a fresh allocation on a new node. That seemed to allow the primary to clear the translog and then it was able to get back in sync. |
@kstaken sorry for the late response. I was travelling.
There is no throttling but it is dealt with using a single thread. That's less then the amount of threads that are available for indexing. That said, the recovery works in two stages - it first acquires a snapshot of the current translog and performs that on the replica (while indexing is on going on the primary). Once that is done, it blocks the indexing on the primary and replicate the last operations. So in theory it should always be able to catch up. I do share the feeling that this is due to a very large transaction log. If this happens again, do you mind sharing this information, issued once and then 5m later: |
@bleskes Thanks. I've had a couple more instances of the issue and all have been on the same shard and I've been able to confirm on that shard it's a case of translog replay not being able to keep up with the rate of indexing. If I stop indexing, recovery will complete. Turns out we have a hotspot on that shard due to routing, it had grown much too large at over 1billion docs and 220GB in size. We're in the process of re-indexing to deal with this but at this size the translog on the primary seemed to be growing at about 2x the rate of consumption on the recovering node and I'm pretty sure it would never catch up. I'm not sure if this is an actual issue now or just an example of why a 220GB shard is not a good thing to have. |
@kstaken thx. Replying in reverse order.
A 220GB shard is indeed unusually big but it should work. I
I see. For what it's worth, as long as there is an ongoing recovery the translog will not be trimmed as we need to make sure everything is replicated. Once replication is done, the translog will be dropped as one.
The replica is guaranteed to catch up because once it's will pause indexing in order to complete the operation. We need to figure out why it's so slow to deal with the initial snapshot when it allows concurrent indexing. When you say 2x the rate of consumption, do you measure the operation count in `GET {index}/_recovery} output?
That's interesting. Maybe indexing just put load on the machine. I've chasing this for the last couple of days and if you can share the following, it would be very helpful:
|
@kstaken one more question - what environment are you running on? which OS? |
@bleskes I've been measuring the output by looking at {index}/_status?recovery=true and looking at the translog.recovered on the replica vs. translog.operations on the primary. Is that a valid thing to do? When this has occurred there has been nothing in the logs indicating any kind of issue, not even the throttling message. If I see the problem again I'll try to come up with the other things you asked for however, I'll be retiring this index in the next couple days once the re-indexing process completes. I thought I had seen the issue on other shards but that was before I was paying specific attention to this shard and the last four occurrences have all been on the same shard so I'm doubting that I've seen it elsewhere now. Our environment is Ubuntu 14.04 on physical hardware. ES 1.4.2 on Java 8. We have 72 ES nodes of two types, one with SSDs and one with HDs. The impacted index is on nodes with SSDs. |
@bleskes I should also add that the node holding the primary here is indeed under very heavy load due to indexing. It will bounce around 60-90 in load average, however the load on the replicas is not elevated and is consistent with other nodes in the cluster. Stopping indexing will return the load to normal. Since it's a major hot spot, it's conceivable that shard could be receiving well over 10,000 indexing requests per second. The index overall is receiving 50K-100K/sec, sometimes more. Those requests are also 'create' requests with unique IDs and most will be dropped as duplicates. The actual write volume on the index overall is less than 1000/sec. |
One change in 1.4.0 was to disable loading bloom filters by default since we made other performance improvements that should have made them unnecessary in most cases: #6959 I wonder if that is causing the performance issues here? Can you try setting index.codec.bloom.load to true and see if it makes a difference? |
@kstaken thx. The index _status api is a good way, but it's deprecated and replaced with the _recovery API . can you let us how it goes with @mikemccand 's suggestion? |
Unfortunately I'm not able to test re-enabling bloom filters as our re-indexing process completed over the weekend and the problem index was removed from usage. I'll have the index around for a couple more days and then it will be deleted completely as the variation in shard sizes is also causing disk allocation problems. |
@kstaken OK. Let us know if this happens again. |
@bleskes sorry I can't test enabling bloom filters to confirm, but I did some more digging and looking at the historical CPU usage of the node holding the problem primary does show a dramatic increase in load right after it was upgraded to 1.4.2. Prior to 1.4.2 we didn't have any nodes that showed consistently high load, after the upgrade which ever node this shard resided on pushed 80-90 load non stop. Does this imply we should enable bloom filters on the new index for the use case where we're generating the IDs? |
@kstaken it would be great if can do a "before" and "after" check regarding the effect bloom filters have on your index. |
Bumped into same issue during shards initialization on es 1.4.2, tested with enabled bloom filters - this doesn't helped. Only deletion of transactions logs helped to finish initialization. |
@drax68 do you know how big the translog was before you deleted it? Did you see any index throttling messages in the log? Also, how long was it "stuck" in the TRANSLOG phase? |
4-6 Gb translog, on replica it's size was spinning around 200Mb. Shard size about 40Gb. It was stuck for hours, then I have to delete translogs to complete initialization. |
This happened to us over the weekend too. In most cases it's fine. But sometimes, just sometimes, the target host doesn't finish catching up the translog and it gets pretty big. Unfortunately I wasn't able to get |
@drax68 see my previous comment - I don't know if you still have the logs of the node with the primary shard, but if you do it would be great if you can check them. |
And it happened again. hot_threads from the target host:
Source host:
We can't keep stopping nodes to delete their translog, as it just results in the problem jumping to a different host. Nothing mentioning throttling at all :( |
Typical, computers making me look bad.
It looks like this came in right around the time the replica the translog caught up. It only happened one on this node, and possibly once on another node at the time when it got over this problem. |
@avleen as far as I can tell, this code is the same in 1.3 vs 1.4, so I'm not sure this theory can explain what you're seeing ... |
@avleen actually bloom filters (enabled in 1.3 and disabled in 1.4) could have masked this issue, since they would have somewhat hidden the cost of segment explosion. |
That makes a lot of sense. Thanks! On Mon, Apr 6, 2015, 13:25 Michael McCandless [email protected]
|
@mikemccand @bleskes I have some questions,where and when es start the async merging normally such as create a new index? And if we start the async merging on engine init,is it possible to fix this issue?Could it have some negative effects? |
@sylvae it is started when recovery is finished (which is no good). @mikemccand is already working on a fix. |
This does not affect 2.0, where we let Lucene launch merges normally (#8643). In 1.x, every 1 sec (default), we ask Lucene to kick off any new merges, but we unfortunately don't turn that logic on in the target shard until after recovery has finished. This means if you have a large translog, and/or a smallish index buffer, way too many segments can accumulate in the target shard during recovery, making version lookups slower and slower (OI(N^2)) and possibly causing slow recovery issues like #9226. This fix changes IndexShard to launch merges as soon as the shard is created, so merging runs during recovery. Closes #10463
This does not affect 2.0, where we let Lucene launch merges normally (#8643). In 1.x, every 1 sec (default), we ask Lucene to kick off any new merges, but we unfortunately don't turn that logic on in the target shard until after recovery has finished. This means if you have a large translog, and/or a smallish index buffer, way too many segments can accumulate in the target shard during recovery, making version lookups slower and slower (OI(N^2)) and possibly causing slow recovery issues like #9226. This fix changes IndexShard to launch merges as soon as the shard is created, so merging runs during recovery. Closes #10463
@mikemccand @bleskes I modified the code similar to #10463 and did some tests, the issue is no longer reproduce, shards recovered normally . I think this issue has already been fixed.Thank you. |
That's great news. Thank you for pointing us at the segment count (I.e., many small files). Let's wait until 1.5.1 is out (soon) and see how works for the others On Thu, Apr 9, 2015 at 8:45 AM, sylvae [email protected] wrote:
|
@bleskes thanks! I'll upgrade sometime next week and let you know how it goes |
This does not affect 2.0, where we let Lucene launch merges normally (elastic#8643). In 1.x, every 1 sec (default), we ask Lucene to kick off any new merges, but we unfortunately don't turn that logic on in the target shard until after recovery has finished. This means if you have a large translog, and/or a smallish index buffer, way too many segments can accumulate in the target shard during recovery, making version lookups slower and slower (OI(N^2)) and possibly causing slow recovery issues like elastic#9226. This fix changes IndexShard to launch merges as soon as the shard is created, so merging runs during recovery. Closes elastic#10463
This does not affect 2.0, where we let Lucene launch merges normally (#8643). In 1.x, every 1 sec (default), we ask Lucene to kick off any new merges, but we unfortunately don't turn that logic on in the target shard until after recovery has finished. This means if you have a large translog, and/or a smallish index buffer, way too many segments can accumulate in the target shard during recovery, making version lookups slower and slower (OI(N^2)) and possibly causing slow recovery issues like #9226. This fix changes IndexShard to launch merges as soon as the shard is created, so merging runs during recovery. Closes #10463
@twitchjorge and others on this thread - any news you can share? |
sorry @bleskes priorities have somewhat pulled me away from this. reviewing the fix is still on my todo list though! |
Unfortunately I haven't had a chance to upgrade yet either. Although I am On Thu, Apr 23, 2015 at 5:56 PM twitchjorge [email protected]
|
@bleskes We upgraded our production cluster to 1.5.1 last weekend and no long running translog recovery spot. I feel the issue is fixed. |
@xgwu thanks! I'm going to leave this open for a little more giving others some time as well.. if we don't hear anything in a week or two, I'll close this. |
Hey all, it's been almost 2 months since 1.5.1 was released. Since no one reported this still happens , I'm going close it as we seem to have found the root cause (#10463). Thank you all for the information and help. |
This does not affect 2.0, where we let Lucene launch merges normally (elastic#8643). In 1.x, every 1 sec (default), we ask Lucene to kick off any new merges, but we unfortunately don't turn that logic on in the target shard until after recovery has finished. This means if you have a large translog, and/or a smallish index buffer, way too many segments can accumulate in the target shard during recovery, making version lookups slower and slower (OI(N^2)) and possibly causing slow recovery issues like elastic#9226. This fix changes IndexShard to launch merges as soon as the shard is created, so merging runs during recovery. Closes elastic#10463
This does not affect 2.0, where we let Lucene launch merges normally (elastic#8643). In 1.x, every 1 sec (default), we ask Lucene to kick off any new merges, but we unfortunately don't turn that logic on in the target shard until after recovery has finished. This means if you have a large translog, and/or a smallish index buffer, way too many segments can accumulate in the target shard during recovery, making version lookups slower and slower (OI(N^2)) and possibly causing slow recovery issues like elastic#9226. This fix changes IndexShard to launch merges as soon as the shard is created, so merging runs during recovery. Closes elastic#10463
A primary shard was stuck in RELOCATION state. Recovery API shows that all files have completed at 100%. The stage shows that it is TRANSLOG. But it has been sitting there for > 15 hours.
Running a reroute command with no post body got it to unstuck:
Not sure how it got into this state (no log entries related to recovery of this shard in the data or master logs) and why we have to run an empty reroute request to get it unstuck.
The text was updated successfully, but these errors were encountered: