Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ark server needs to handle crashed/stopped plugin processes #481

Closed
lli-hiya opened this issue May 10, 2018 · 19 comments
Closed

Ark server needs to handle crashed/stopped plugin processes #481

lli-hiya opened this issue May 10, 2018 · 19 comments
Labels
Milestone

Comments

@lli-hiya
Copy link
Contributor

lli-hiya commented May 10, 2018

I've had a question that I don't know how to debug -- I noticed that Ark can suddenly fail to make its backup (after it made several successful backups) and the error logs looks like

2018-05-10T00:04:13.972+00:00 10.1.69.73 dockerd[4085]: time="2018-05-10T00:04:13Z" level=error msg="Error uploading log file" bucket=<some-bucket> error="rpc error: code = Internal desc = connection error: desc = "transport: error while dialing: dial unix /tmp/plugin763922082: connect: connection refused"" key=<some-key>-logs.gz logSource="pkg/cloudprovider/backup_service.go:148".

This error was first noticed when I was using ark v0.6, and now I upgrade to ark v0.8.1 and the error still persists.

Some symptoms that I noticed when the error happens

  • Ark client can still list backups (ark backup get works), but it cannot show backups logs (ark backup logs <backup> fails)
  • Kubectl can still show pod logs (kubectl logs -f <pod> works)
  • If I don't take any actions, all future backups will fail and won't be uploaded to S3.
  • If I bounce the pod (kubectl delete pod), everything works again and further backups can succeed maybe for a couple of days and then this error will happen again

My interpretation

  • This error is caused by the failure to connect to S3
  • It is not an authentication/authorization error, because it DOES work for a while after I bounce the pod

What I am asking here is

  1. Has anyone seen this problem, or this kind of error logs before?
  2. What might cause Ark to suddenly lose its ability to connect to S3?

Thanks for all the attentions.

@ncdc
Copy link
Contributor

ncdc commented May 10, 2018

Hi @lli-hiya, thanks for the question! The error transport: error while dialing: dial unix /tmp/plugin763922082: connect: connection refused I believe is coming from the communication between the Ark server process and one of its plugins. It sounds like the plugin process has exited and the Ark server isn't handling that well.

When you have a well-functioning pod, could you please kubectl exec into it and get ps information on all the running processes? And then when you start having issues, could you capture the same information?

@ncdc ncdc added the question label May 10, 2018
@ncdc
Copy link
Contributor

ncdc commented May 10, 2018

You can also try setting the server's log level to debug:

kubectl -n heptio-ark edit deploy/ark, then change args so it looks like this:

args:
  - server
  - --log-level
  - debug

This will include more information about launching and terminating plugins.

@ncdc
Copy link
Contributor

ncdc commented May 10, 2018

@skriss this is interesting reading: hashicorp/go-plugin#31

@jfoy
Copy link

jfoy commented May 10, 2018

@ncdc Here's what we see in a healthy pod. (I work with @lli-hiya )

PID   USER     TIME   COMMAND
    1 nobody     0:14 /ark server --logtostderr --v 4
   19 nobody     0:00 /ark run-plugin cloudprovider aws
   52 nobody     0:00 sh
   60 nobody     0:00 ps uaxf

@ncdc
Copy link
Contributor

ncdc commented May 10, 2018

@jfoy awesome, thanks! Now I'm looking forward to seeing what's in an unhealthy pod...

@lli-hiya
Copy link
Contributor Author

@ncdc I will try to give that information to you later if the error happens again.

@skriss
Copy link
Contributor

skriss commented May 10, 2018

@ncdc if this is in fact due to the plugin process exiting (sounds like it), moving to short-lived plugins for object/block store as you've proposed would probably mitigate this issue.

@lli-hiya
Copy link
Contributor Author

lli-hiya commented May 11, 2018

@ncdc Since I don't know when the error will happen, as a demonstration I manually killed this process 19 nobody 0:00 /ark run-plugin cloudprovider aws, and then I observed EXACTLY the same symptom. Ark client freezes to show logs, backups failed and etc.

Although this cannot prove that our original failure was because of the cloudprovider process got terminated, at least it can show some evidence.

@ncdc ncdc changed the title [Question and need help] Backup failed and ark client stops responding Ark server needs to handle crashed/stopped plugin processes May 11, 2018
@lli-hiya
Copy link
Contributor Author

Follow up: the error happened in my original cluster during the weekend, and the ps aux within the pod shows that the cloudprovider process is gone. I think this shows that our guess is correct -- Ark failed to handle it.

@ncdc
Copy link
Contributor

ncdc commented May 14, 2018

@lli-hiya thanks for the update. We are actively redesigning our plugin management code to be more robust and handle this sort of situation.

Is there anything in your pod's logs that indicates what happened to the plugin process?

@lli-hiya
Copy link
Contributor Author

No more than what I posted before because due to some unknown reason the log-level was reverted back to info. I just changed it to debug again, and will update this thread if I see something useful

@lli-hiya
Copy link
Contributor Author

@ncdc Here it is. This is the log around the our daily backup time, and you can see it shows that the plugin process terminates.

time="2018-05-16T00:00:46Z" level=debug msg="Backup has not expired yet, skipping" backup=heptio-ark/<xxx> expiration="2018-05-24 00:00:46 +0000 UTC" logSource="pkg/controller/gc_controller.go:129"
time="2018-05-16T00:00:46Z" level=debug msg="waiting for RPC address" logSource="pkg/plugin/logrus_adapter.go:74" path=/ark
time="2018-05-16T00:00:46Z" level=info msg="Checking for expired DeleteBackupRequests" controller=backup-deletion logSource="pkg/controller/backup_deletion_controller.go:285"
time="2018-05-16T00:00:46Z" level=info msg="Done checking for expired DeleteBackupRequests" controller=backup-deletion logSource="pkg/controller/backup_deletion_controller.go:313"
time="2018-05-16T00:00:46Z" level=debug msg="starting plugin" args="[/ark run-plugin backupitemaction pod]" logSource="pkg/plugin/logrus_adapter.go:74" path=/ark
time="2018-05-16T00:00:46Z" level=debug msg="plugin address" address=/tmp/plugin537000931 logSource="pkg/plugin/logrus_adapter.go:74" network=unix pluginName=ark
time="2018-05-16T00:00:46Z" level=debug msg="waiting for RPC address" logSource="pkg/plugin/logrus_adapter.go:74" path=/ark
time="2018-05-16T00:00:47Z" level=debug msg=gcController.enqueueAllBackups logSource="pkg/controller/gc_controller.go:88"
time="2018-05-16T00:00:47Z" level=debug msg="plugin address" address=/tmp/plugin854132941 logSource="pkg/plugin/logrus_adapter.go:74" network=unix pluginName=ark
I0516 00:00:50.982700       1 reflector.go:428] github.com/heptio/ark/pkg/generated/informers/externalversions/factory.go:73: Watch close - *v1.Backup total 2 items received
time="2018-05-16T00:01:16Z" level=debug msg="plugin process exited" logSource="pkg/plugin/logrus_adapter.go:74" path=/ark
time="2018-05-16T00:01:16Z" level=error msg="Error uploading log file" bucket=<xxx> error="rpc error: code = Unavailable desc = transport is closing" key=<xxx>/<yyy>-logs.gz logSource="pkg/cloudprovider/backup_service.go:148"
time="2018-05-16T00:01:16Z" level=info msg="Backup completed" backup=heptio-ark/<yyy> logSource="pkg/controller/backup_controller.go:375"
time="2018-05-16T00:01:17Z" level=error msg="backup failed" error="rpc error: code = Internal desc = connection error: desc = \"transport: error while dialing: dial unix /tmp/plugin457010137: connect: connection refused\"" key=heptio-ark/<yyy> logSource="pkg/controller/backup_controller.go:273"
time="2018-05-16T00:01:17Z" level=debug msg="Updating backup's final status" key=heptio-ark/<yyy> logSource="pkg/controller/backup_controller.go:277"```

@ncdc ncdc mentioned this issue May 16, 2018
6 tasks
@lli-hiya
Copy link
Contributor Author

Just add to this thread. We identified the reason for the failure. It was because the memory usage jumped beyond its limit and causing the cloudprovider process to be killed.

@ncdc
Copy link
Contributor

ncdc commented Jun 12, 2018

@lli-hiya thanks for the update. What limit did you have set for the pod?

@lli-hiya
Copy link
Contributor Author

lli-hiya commented Jun 12, 2018

@ncdc We increased the memory limit to avoid it to be killed. Now it looks like

            limits:
              cpu: 200m
              memory: 300Mi
            requests:
              cpu: 10m
              memory: 40Mi

@rosskukulinski rosskukulinski added this to the v1.0.0 milestone Jun 24, 2018
@rosskukulinski
Copy link
Contributor

References #94 regarding cpu/memory limits. PR #495 in-flight which should resolve this issue.

@ncdc
Copy link
Contributor

ncdc commented Jul 31, 2018

Fixed by #710

@ncdc ncdc closed this as completed Jul 31, 2018
@iamakulov
Copy link

I just encountered the same issue with velero 1.1.0:

$ velero version
Client:
        Version: v1.1.0
        Git commit: a357f21aec6b39a8244dd23e469cc4519f1fe608
Server:
        Version: v1.1.0

I’ll create a new issue if this reproduces again. More details are below.


A few hours before, backups were completing successfully. Now, backups are marked as Failed with the following in the container logs:

$ kubectl logs velero-74c7b8c545-w5cgn -n velero | grep error     
time="2019-09-07T11:11:55Z" level=error msg="Error uploading log file" backup=before-kops-1.11 bucket=chorus-backup-bucket-velero-production error="rpc error: code = Unavailable desc = transport is closing" logSource="pkg/persistence/object_store.go:196" prefix=
time="2019-09-07T11:11:55Z" level=error msg="backup failed" controller=backup error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: <nil>" key=velero/before-kops-1.11 logSource="pkg/controller/backup_controller.go:233"
time="2019-09-07T11:12:20Z" level=error msg="Error listing backups in backup store" backupLocation=default controller=backup-sync error="rpc error: code = Unavailable desc = transport is closing" logSource="pkg/controller/backup_sync_controller.go:180"
time="2019-09-07T11:13:20Z" level=error msg="Error listing backups in backup store" backupLocation=default controller=backup-sync error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: <nil>" logSource="pkg/controller/backup_sync_controller.go:180"
time="2019-09-07T11:14:21Z" level=error msg="Error listing backups in backup store" backupLocation=default controller=backup-sync error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: <nil>" logSource="pkg/controller/backup_sync_controller.go:180"
time="2019-09-07T11:15:21Z" level=error msg="Error listing backups in backup store" backupLocation=default controller=backup-sync error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: <nil>" logSource="pkg/controller/backup_sync_controller.go:180"
time="2019-09-07T11:16:22Z" level=error msg="Error listing backups in backup store" backupLocation=default controller=backup-sync error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: <nil>" logSource="pkg/controller/backup_sync_controller.go:180"
time="2019-09-07T11:17:23Z" level=error msg="Error listing backups in backup store" backupLocation=default controller=backup-sync error="rpc error: code = Unavailable desc = transport is closing" logSource="pkg/controller/backup_sync_controller.go:180"

Exec-ing into the pod shows that the plugins process is not running:

nobody@velero-74c7b8c545-w5cgn:/$ ps -aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
nobody       1  1.3 12.0 1128296 986756 ?      Ssl  09:02   1:52 /velero server
nobody    1485  0.0  0.0  18508  3412 ?        Ss   11:19   0:00 bash
nobody    1493  0.0  0.0  34400  2792 ?        R+   11:19   0:00 ps -aux

After deleting the pod, the new pod launched with the plugins process:

nobody@velero-74c7b8c545-scsq8:/$ ps -aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
nobody       1  0.4  0.4 157140 40664 ?        Ssl  11:24   0:00 /velero server
nobody      22  0.2  0.5 157140 41816 ?        Sl   11:24   0:00 /velero run-plugins --log-level info
nobody      40  0.0  0.0  18508  3308 ?        Ss   11:25   0:00 bash
nobody      46  0.0  0.0  34400  2896 ?        R+   11:25   0:00 ps -aux

The pod has the following limits:

resources:
  limits:
    cpu: 400m
    memory: 1Gi
  requests:
    cpu: 200m
    memory: 512Mi

@carlisia
Copy link
Contributor

@iamakulov please follow this issue for an error like what you reported: #1856.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants