-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ark server needs to handle crashed/stopped plugin processes #481
Comments
Hi @lli-hiya, thanks for the question! The error When you have a well-functioning pod, could you please |
You can also try setting the server's log level to debug:
args:
- server
- --log-level
- debug This will include more information about launching and terminating plugins. |
@skriss this is interesting reading: hashicorp/go-plugin#31 |
@jfoy awesome, thanks! Now I'm looking forward to seeing what's in an unhealthy pod... |
@ncdc I will try to give that information to you later if the error happens again. |
@ncdc if this is in fact due to the plugin process exiting (sounds like it), moving to short-lived plugins for object/block store as you've proposed would probably mitigate this issue. |
@ncdc Since I don't know when the error will happen, as a demonstration I manually killed this process Although this cannot prove that our original failure was because of the cloudprovider process got terminated, at least it can show some evidence. |
Follow up: the error happened in my original cluster during the weekend, and the |
@lli-hiya thanks for the update. We are actively redesigning our plugin management code to be more robust and handle this sort of situation. Is there anything in your pod's logs that indicates what happened to the plugin process? |
No more than what I posted before because due to some unknown reason the log-level was reverted back to info. I just changed it to debug again, and will update this thread if I see something useful |
@ncdc Here it is. This is the log around the our daily backup time, and you can see it shows that the plugin process terminates.
|
Just add to this thread. We identified the reason for the failure. It was because the memory usage jumped beyond its limit and causing the |
@lli-hiya thanks for the update. What limit did you have set for the pod? |
@ncdc We increased the memory limit to avoid it to be killed. Now it looks like
|
Fixed by #710 |
I just encountered the same issue with velero 1.1.0:
I’ll create a new issue if this reproduces again. More details are below. A few hours before, backups were completing successfully. Now, backups are marked as Failed with the following in the container logs:
Exec-ing into the pod shows that the plugins process is not running:
After deleting the pod, the new pod launched with the plugins process:
The pod has the following limits:
|
@iamakulov please follow this issue for an error like what you reported: #1856. |
I've had a question that I don't know how to debug -- I noticed that Ark can suddenly fail to make its backup (after it made several successful backups) and the error logs looks like
This error was first noticed when I was using ark v0.6, and now I upgrade to ark v0.8.1 and the error still persists.
Some symptoms that I noticed when the error happens
ark backup get
works), but it cannot show backups logs (ark backup logs <backup>
fails)kubectl logs -f <pod>
works)kubectl delete pod
), everything works again and further backups can succeed maybe for a couple of days and then this error will happen againMy interpretation
What I am asking here is
Thanks for all the attentions.
The text was updated successfully, but these errors were encountered: