-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible to not ACK upgrade action #760
Comments
Agreed we should identify the root cause. I also opened this PR which could serve as a stop-gap to mark the agent as upgraded when we see that the version number changed: elastic/fleet-server#1731 |
@scunningham and I took a look at this today and produced this summary of how Elastic Agent and Horde's implementation of action acking work as well as some some potential Fleet Server issues.
There's potentially an issue where Fleet Server:
|
@blakerouse I think this is where could use the most help from you right now. I'm working with @pjbertels to get a cluster where this has been reproduced in a scale test with debug logs enbled. In those cases we've been seeing agents POST'ing a successful 200 ack, but the ES document isn't really updated correctly. |
@blakerouse does it make sense to close this issue now or do we suspect there are other cuases basied elastic/fleet-server#1894? |
@joshdover I am looking to keep this open until I can get all the 75k agents to upgrade successfully in scale testing. Once it has been confirmed it is completely fixed then I will close this. I don't want to take an early victory just yet. Other than adding the version conflict retries and the proper return of errors, no other fix has been implemented. The change might expose other errors that need to be handled/fixed. |
Had agents stuck in
|
The snapshot used in @pjbertels previous comment did not include the fix in elastic/fleet-server#1894. The latest snapshot build does contain the fix, but BC2 and snapshot for 8.5 is current broken because of https://github.com/elastic/ingest-dev/issues/1282. This is preventing the ability to actually test the fix performed in elastic/fleet-server#1894. Hopefully a quick fix to https://github.com/elastic/ingest-dev/issues/1282 is near, so we can get back to testing upgrade. |
@pjbertels Was able to successfully get latest snapshot running with 8.5 and perform an upgrade. Resulted in ~5k still being stuck in upgrading. Digging into the logs now. |
I was unable to find any errors being reported by Fleet Server. From the logs all upgrade actions where dispatched to the Elastic Agent's and all upgrade actions where ack'd. The issue is that update request to elasticsearch to update the document seems to be only updating 1 of the fields.
Only the At the moment I am in need of help from elasticsearch team, because I believe this could be an issue internal to elasticsearch. |
Hey @dakrone we definitely need Elasticsearch team help here - If not you, would you please help us find someone to help on the issue above ^ which seems to potentially be an Elasticsearch bug |
I'm having a hard time parsing exactly what this is doing (being not at all familiar with agent's internals), from what I can guess, it's doing something like this:
Which works fine for Elasticsearch. If that's incorrect, can someone translate this into an Elasticsearch reproduction so I can try this out? |
What's the confidence level that we're not getting any errors returned by ES? My understanding is that we may still be dropping logs in Cloud, but that may not matter to prove this depending on the logging. +1 a simple repro would be helpful. Could we use two tight loops of concurrent updates similar to what is happening with upgrade acks and the checkin 30s heartbeat to emulate this? |
@dakrone That is not what Fleet Server is doing. It is using
This bulk update action can occur with any other action that needs to occur with Fleet Server at an interval of 250ms or when the bulk buffer is full. I am seeing that it works sometimes to update all the fields and in other cases it is not updating all fields. |
Okay, even doing something like this, which creates and then updates the document:
Succeeds locally. Perhaps a reproduction with the tight loops would be necessary to see if this can be reproduced (if it can, please open an issue in the Elasticsearch repository). |
Hey y'all, I stumbled across this issue because today in production I have 1 agent that is stuck "Updating". (It has been like this for a little while. This device likes to roam on and off the network very frequently) Rebooting, stopping/starting service or assigning to other policies will not get the status away from Updating. Furthermore, the Agent is at 8.4.0 (Kibana and .\elastic-agent.exe status say so) and I can't upgrade it because it is grayed out in some spots and in others where it tries to let me upgrade fails immediately. So I can't tell the health of this agent via the UI, however, the .\elastic-agent.exe status reveals everything is healthy. I will be happy to share anything that might help the situation. I am sure removing and re-enrolling the agent will do the trick but I very much do not like doing that to fix these issues. |
Additional detail: 8.4.1 on-premise stack, virtualized on modern Windows Server OS. |
Further info: When doing a search on the specific endpoint to .fleet-agent here is the redacted result:
Two things in this output stick out to me (however, I have no clue how these should read). The first is that the device is Windows 11 but the data shows Windows 10. The second, which I think might be the issue, is that the upgraded_at is set to null, even though it started the upgrade on 9/2. Perhaps the null in the upgraded_at is why it is shown stuck at updating? Please me know if this is helpful info or just annoying and I can buzz off :) |
Hi there @nicpenning, you do indeed seem to be encountering this same issue. I recommend trying this as a workaround, you can run this from Dev Tools in Kibana:
|
So, I went ahead and updated the
I will try the upgrade again. |
I got the Upgrading Agent notification after I pushed the upgrade so that single field seemed to have been the culprit. Furthermore, the upgrade went on to work. Not sure if any of this info helps your case but figured I would share. You have helped me take a look under the hood to see what could be broken and a simple field update was a work around. Hopefully this will be addressed to help improve the reliability of such awesome capabilities. |
@joshdover @nicpenning then I assume that the following issue might resolve the UI display - elastic/kibana#139704 |
No I don't think that will fix the issue. The issue is that Fleet Server (or maybe ES) is losing the writes to make the upgrade as completed. This is why updating the We still need to get to the bottom of where and why these updates are not being persisted in Elasticsearch when the agent ack's it's upgrade. |
@nicpenning @blakerouse , i had a short slack conversation with @tomcallahan and he suggested to enable audit logging in ES while this is happening. Did we try that? |
Better than audit logging would be the indexing slowlog: https://www.elastic.co/guide/en/elasticsearch/reference/8.4/index-modules-slowlog.html#index-slow-log
|
Going to keep this open until we get confirmation that all 75k upgrade during scale testing. |
We may need to move away from a model where both Kibana and Fleet Server are writing to the We would need a way to easily query agents based on their pending, delivered, and ack'd actions so we could determine if an agent has a pending upgrade and dispaly this in the UI. This wouldn't completely solve the problem though if you consider actions like unenrollment and policy reassignments. There are still potentially multiple actors that would want to update a single |
Raised elastic/kibana#142364 |
I agree 100%, please let's avoid more than one application writing to the same document. |
+1 |
FYI I'm working on a draft proposal that will encompass addressing these issues |
Tests at 50K and 75K confirm we no longer have agents stuck in 'updating' at the end of an upgrade. |
Great news. @blakerouse I think we should go ahead and backport elastic/fleet-server#1896 to 8.4 in case we have a an additional 8.4 release |
@nicpenning can you please email me at [email protected] ? |
FYI: Upgrading from 8.4.1 to 8.4.3 - We had 1 asset that got stuck updating and had a Will this be resolved in 8.5.0 or was it supposed to be resolved in 8.4.3? |
I tried this except for using 8.4.3 but I got back no handler found error: |
@nicpenning this will be resolved in 8.5.0 as it was not backported to 8.4.3 |
I am still having the same issue upgrading agents from 8.6.0 to 8.6.1. Initially, most agents were stuck in updating, then after a policy update most returned healthy, but some are still stubbornly stating updating even though the version listed is 8.6.1. |
Confirmed we saw this in 8.6.0 -> 8.6.1. 2 out of 24 agents successfully upgraded, all others were "stuck" updating. I will note that this only occurred on the scheduled upgrades by setting it to an hour into the future and setting the time frame for over an hour. Any upgrades set to execute immediately worked without any hesitation. In the end, we had to force the upgrades via the API to get upgraded. I believe then that there is an issue with the scheduled aspect of the upgrades. |
I updated the agents manually from the CLI, but that didn't resolve the updating status either. I'm not sure how to reset the status now, the agents appear happy. How did you force things via the API? |
elastic/kibana#135539 (comment)
|
This changed the status to unhealthy until I restarted the agent and then it went back to updating. I should stress that our agents are already upgraded to 8.6.1. Agent log:
|
Okay, the difference with my agents is that they were still reporting 8.5.3 and stuck updating. After the force upgrade with API they upgraded with no issues and got out of the Updating state. |
I probably should have been a bit more patient, and not jumped the gun on manually updating the agents. For the agents reporting to the fleet server I can unenroll and redo the installation, but having to do this for an increasing number of agents each time we upgrade is far from ideal. And as for the Fleet server itself, this is not so easy, it's also stuck in updating. |
I am not sure if these reports are all experiencing the same root cause, without agent logs it is impossible to tell. It is unlikely to be the same bug described in this issue, just a similar symptom.
This is coming from the code below. Is there an accompany elastic-agent/internal/pkg/agent/application/upgrade/step_unpack.go Lines 31 to 41 in 973af90
|
The error is given after a reboot, I'm not sure if the agent or the fleet server would for some reason try to rerun the upgrade. Or if it fails because it's trying to reinstall the same version. There are no further accompanying error messages. The rest of the messages show a normal restart of the agent and everything eventually returning healthy. Yet Fleet will still show the agent as updating. I'm tempted to leave things as they are with 15 healthy agents and 8 agents showing as updating but healthy when checked locally. And then see how things go with 8.6.2. |
Just to clarify, did the agents that show as updating complete the upgrade to the desired version successfully? If so, then I suspect this is a synchronization issue in the UI and it would be fine to leave the agents like this although we would like to know why this is. I'll need to find someone from the Fleet UI team to think about why this might be the case if this is what is happening. |
No, they rolled back to 8.6.0 after a while but remained as updating in Fleet. I couldn't retry as Fleet was stuck in updating for these agents. So I did what at the time I thought was the only thing I could do and did TBH, I must have had two different issues across my agents as some agents came back as healthy when I update the policy applied to them. What is certain though is that it would be nice to have a way to force Fleet to check the status of the agent and if the agent is happy and on the right version then surely the status should be updated in Fleet. As for the Fleet server itself showing as updating, I'm at a loss on how to fix that other than removing the Fleet server and reinstalling the agent there. |
OK thanks we have one bug we just discovered where if an upgrade fails and then rolls back, the local agent downloads directory needs to be restored or the next upgrade will fail: #2222. The That would not have been the original problem but it may be affecting you now. If you can get us the logs for these failed upgrades we would love to look at them to confirm what is happening. If you have failed upgrades you will have multiple |
I found that by issuing the upgrade command locally, the upgrade did succeed. There's only one
So my current state is that the agent upgrade succeeded, but the Fleet server remains unaware of it. |
@dmgeurts thanks for reporting this. I agree it's unlikely to be the same root cause. A few questions that may help us narrow down this part of the problem:
|
Yes, Fleet shows the version number as 8.6.1 and the status as updating.
Aren't the logs deleted when an upgrade succeeds? This is my assumption as the logs folder is inside the agent folder and I only have one agent folder under
No, I haven't created a LogStash output yet. All agents are still using the default (direct to Elasticsearch). |
Seems it is possible that an upgrade action can result in the Elastic Agent successfully upgrading, but never fully ACKing the action.
Seems to happen when an Elastic Agent is having connectivity issues with Fleet Server on successful upgrade. I have seen this issue report in an SDH and worked around the issue, but we need to track down how and why this happens. The upgrade process needs to be full proof and this seems to be an issue that needs to be tracked down and fixed.
For confirmed bugs, please report:
Known causes:
The text was updated successfully, but these errors were encountered: