Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test got terminated randomly after upgrading to gcr.io/k8s-testimages/kubekins-e2e:v20191017-ac4b4b5-master #14938

Closed
Random-Liu opened this issue Oct 23, 2019 · 6 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@Random-Liu
Copy link
Member

Right after the upgrade 71aae2c#diff-bc1a19dd8cab7a55902e9a81d5f4d935

The containerd windows test starts being randomly terminated in the middle of the test.
We start seeing this right after that image update.

@Random-Liu Random-Liu added the kind/bug Categorizes issue or PR as related to a bug. label Oct 23, 2019
@Random-Liu
Copy link
Member Author

Random-Liu commented Oct 23, 2019

An example test failure https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-cri-containerd-cri-validation-windows/1184908137358954497

And from the dashboard https://k8s-testgrid.appspot.com/sig-node-containerd#cri-validation-windows, we can see that the first failure started since 10-17 12:04, and the change was merged at 10-17 11:08.

@BenTheElder
Copy link
Member

BenTheElder commented Oct 23, 2019

do you have a specific example?
edit: there's a lot of noise in https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-cri-containerd-cri-validation-windows/1184908137358954497, which part specifically?

@Random-Liu
Copy link
Member Author

In some of the test failure, we can find this:
cat: write error: Resource temporarily unavailable

For example: https://storage.googleapis.com/kubernetes-jenkins/logs/ci-cri-containerd-cri-validation-windows/1185015668978749444/build-log.txt

@Random-Liu
Copy link
Member Author

This seems to be a bash issue, there are some previous bug reports, haven't looked into them yet:

If it is a bash issue, it could be caused by the debian update from stretch to buster.

@Random-Liu
Copy link
Member Author

Random-Liu commented Oct 23, 2019

Based on nodejs/node#14752, it seems that something set the stdout to O_NONBLOCK mode, and cat a large file will hit cat: write error: Resource temporarily unavailable in that mode.

The windows test runs a lot of gcloud ssh and gcloud scp command, and gcloud is also updated in that window:

$ docker run --entrypoint=/bin/bash gcr.io/k8s-testimages/kubekins-e2e:v20191017-ac4b4b5-master gcloud version
Google Cloud SDK 267.0.0
alpha 2019.05.17
beta 2019.05.17
bq 2.0.49
core 2019.10.15
gsutil 4.44
kubectl 2019.09.22

$ docker run --entrypoint=/bin/bash gcr.io/k8s-testimages/kubekins-e2e:v20191012-482f444-master gcloud version
Google Cloud SDK 250.0.0
alpha 2019.05.17
beta 2019.05.17
bq 2.0.43
core 2019.06.07
gsutil 4.38
kubectl 2019.06.07

It seems very likely to be a bug in the new gcloud version that it sets the stdout to O_NONBLOCK mode.

I'll try applying the fix uwcirg/truenth-portal#2689 to the test for now to set the stdout back before the large cat.

@Random-Liu
Copy link
Member Author

It does turn out that O_NONBLOCK is set by something containerd/cri#1324:

+ python -c 'import os,sys,fcntl; flags = fcntl.fcntl(sys.stdout, fcntl.F_GETFL); print(flags&os.O_NONBLOCK);'
2048
+ python -c 'import os,sys,fcntl; flags = fcntl.fcntl(sys.stdout, fcntl.F_GETFL); fcntl.

I highly suspect that it is gcloud.

Anyway, I think it is more a gcloud issue than a test-infra issue. I'll close this one for now. And for people who hits similar issue in the future, they can reference this issue to find a workaround.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

2 participants