Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[umbrella issue] arm ci for node e2e testing #29693

Closed
9 of 11 tasks
Tracked by #29946
pacoxu opened this issue Jun 6, 2023 · 25 comments
Closed
9 of 11 tasks
Tracked by #29946

[umbrella issue] arm ci for node e2e testing #29693

pacoxu opened this issue Jun 6, 2023 · 25 comments
Assignees
Labels
kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing.

Comments

@pacoxu
Copy link
Member

pacoxu commented Jun 6, 2023

What should be cleaned up or changed:
https://testgrid.k8s.io/sig-node-kubelet#kubelet-gce-e2e-arm64-ubuntu-serial

Tasks

Preview Give feedback
  1. endocrimes pacoxu
  2. dims
  3. area/testgrid size/S
  4. approved cncf-cla: yes lgtm ok-to-test size/XS
    MushuEE spiffxp
  5. approved cncf-cla: yes kind/bug lgtm priority/important-soon release-note sig/node sig/testing size/XS triage/accepted
  6. SergeyKanzhelev
  7. approved area/config area/jobs cncf-cla: yes lgtm sig/node sig/testing size/XS
    chaodaiG

Provide any links for context:
part of kubernetes/kubernetes#118441

/sig node
/kind failing-test

@pacoxu pacoxu added the kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. label Jun 6, 2023
@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. labels Jun 6, 2023
@pacoxu pacoxu changed the title kubelet-gce-e2e-arm64-ubuntu-serial keeps failing ci-kubernetes-node-arm64-ubuntu-serial keeps failing Jun 6, 2023
@pacoxu
Copy link
Member Author

pacoxu commented Jun 6, 2023

/assign @chendave
/cc @SergeyKanzhelev

@chendave
Copy link
Member

chendave commented Jun 6, 2023

@pacoxu thanks for this! will get some time to check it.

@SergeyKanzhelev
Copy link
Member

/assign @ike-ma

@ike-ma can you please take a look?

@k8s-ci-robot
Copy link
Contributor

@SergeyKanzhelev: GitHub didn't allow me to assign the following users: ike-ma.

Note that only kubernetes members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @ike-ma

@ike-ma can you please take a look?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@SergeyKanzhelev
Copy link
Member

/assign @ike-ma

Looking at https://testgrid.k8s.io/sig-node-kubelet#kubelet-gce-e2e-arm64-ubuntu-serial I see the following in logs:

./ginkgo: 1: �ELF����: not found
./ginkgo: 1: Syntax error: end of file unexpected (expecting ")")

I think this indicates that the wrong bittnes program is being attempted to execute on arm64 machine

@ike-ma
Copy link
Contributor

ike-ma commented Jun 9, 2023

This ELF not found error typically indicates we are running x86_64 binary on arm64 machine

Looking at the failure log: https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-node-arm64-ubuntu-serial/1667168379250675712/artifacts/build-log.txt

The build is building for linux/amd64 instead linux/arm64

I0609 14:00:17.433987    9511 remote.go:72] Building archive...
I0609 14:00:17.434329    9511 build.go:45] Building k8s binaries...
+++ [0609 14:00:17] Setting GOMAXPROCS: 4
+++ [0609 14:00:17] Building go targets for linux/amd64
    k8s.io/kubernetes/cmd/kubelet (non-static)
    k8s.io/kubernetes/test/e2e_node/e2e_node.test (test)
    github.com/onsi/ginkgo/v2/ginkgo (non-static)
    k8s.io/kubernetes/cluster/gce/gci/mounter (non-static)
    k8s.io/kubernetes/test/e2e_node/plugins/gcp-credential-provider (non-static)

@ike-ma
Copy link
Contributor

ike-ma commented Jun 9, 2023

It seems this wasn't build in a dockerized environment: https://github.com/kubernetes/kubernetes/blob/c840c947554e10e6b2918e731c86dbd0e541361a/test/e2e_node/builder/build.go#L52-L56

	if IsDockerizedBuild() {
		klog.Infof("Building dockerized k8s binaries targets %s for architecture %s", targets, GetTargetBuildArch())
		// Multi-architecture build is only supported in dockerized build
		cmd = exec.Command(filepath.Join(k8sRoot, "build/run.sh"), "make", fmt.Sprintf("WHAT=%s", targets), fmt.Sprintf("KUBE_BUILD_PLATFORMS=%s", GetTargetBuildArch()))
	}

@ike-ma
Copy link
Contributor

ike-ma commented Jun 9, 2023

Oh I see, in #29662, the - --use-dockerized-build=true - --target-build-arch=linux/arm64 are removed, since we are using kubetest2 to trigger the test now.

These flags are defined here: https://github.com/kubernetes/kubernetes/blob/c840c947554e10e6b2918e731c86dbd0e541361a/test/e2e_node/builder/build.go#L32-L33

We should add support to kubetest2 for these two args so that it can pass the flags to run_remote.go, before any binaries are built.

If we move to test-arg, I think it will just pass to an already built ginkgo as ginkgo-flags

A working command (that will be generated by kubetest2) should look something like this

go run /usr/local/google/home/ikema/go-k8s-oss/src/k8s.io/kubernetes/test/e2e_node/runner/remote/run_remote.go \
  --cleanup -vmodule=*=4 --ssh-env=gce --results-dir=/tmp/e2e-node-results/ason --project=ikema-gke-dev-2 \
  --use-dockerized-build=true --target-build-arch=linux/arm64 \
   --zone=us-central1-a --ssh-user=ikema --ssh-key=/usr/local/google/home/ikema/.ssh/google_compute_engine \
   --ginkgo-flags='--nodes=1 --focus="\[Serial\]" --skip="\[Flaky\]|\[Slow\]|\[Benchmark\]|\[NodeSpecialFeature:.+\]|\[NodeSpecialFeature\]|\[NodeAlphaFeature:.+\]|\[NodeAlphaFeature\]|\[NodeFeature:Eviction\]|\[NodeFeature:NodeProblemDetector\]|\[NodeFeature:OOMScoreAdj\]|\[NodeFeature:DevicePluginProbe\]|\[NodeConformance\]" '  \
   --test-timeout=5h0m0s --image-config-file=/tmp/e2e-node-results/ason/image-config.yaml 

@ike-ma
Copy link
Contributor

ike-ma commented Jun 9, 2023

We have two options:

  1. Rollback all kubetest2 related changes for now, I think add the extra_ref to the k8s.io/test-infra so that the image config file could be found from has already fixed issue in https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-node-arm64-ubuntu-serial/1664144092101087232

    • Pro: Quick to get the test run green (since we have previously tested with bootstrap way to running the test)
    • Con: We still need to migrate to kubetest2 at some point
  2. Investigating and fix based on kubetest2 paradigm

    • Pro: This is the eventual state we want to be in
    • Con: Might take longer to turn test green (since we still need to trial and error a bit to figure out the kubetest2 wiring stuff)

Which option does folk prefer?

@SergeyKanzhelev
Copy link
Member

@BenTheElder wdyt?

/sig testing

I'd prefer to go green faster so number 1.

@k8s-ci-robot k8s-ci-robot added the sig/testing Categorizes an issue or PR as relevant to SIG Testing. label Jun 9, 2023
@chendave
Copy link
Member

chendave commented Jun 10, 2023

@ike-ma @SergeyKanzhelev @BenTheElder

The fix is already up for review, just need someone to approve it.

here it is:
kubernetes-sigs/kubetest2#229
kubernetes/kubernetes#118567
#29727

@chendave
Copy link
Member

chendave commented Jun 10, 2023

Rollback all kubetest2 related changes for now

We have trouble to make the parameter --use-dockerized-build recognized by kubetest, this is why we move to the kubetest2 from the beginning,

#29617

Which option does folk prefer?

Let's go with the option2, eventually, we need the migration.

@chendave
Copy link
Member

Just shout out in the slack, hope we can get the fix in ASAP. https://kubernetes.slack.com/archives/C09QZ4DQB/p1686368336431369

Also, review is super welcome:
kubernetes-sigs/kubetest2#229
kubernetes/kubernetes#118567
#29727

@chendave
Copy link
Member

chendave commented Jun 14, 2023

Have commented on this issue: kubernetes/kubernetes#118441 (comment)

For record as well, other pr related,

kubernetes-sigs/kubetest2#229
kubernetes/kubernetes#118567
#29727
#29712

Have more work to do to make sure every e2e node testcases could pass on arm64, I will start to work on it from next month.

@pacoxu
Copy link
Member Author

pacoxu commented Jun 14, 2023

Good job! The test can run completely now.

For remaining dra-related failures, I am not sure if you want to open a new issue to track or use this one.

@chendave
Copy link
Member

chendave commented Jun 14, 2023

we can let this issue open to track all the fix and issue for the ci job, you can retitle this to something like "umbrella issue of arm ci for node e2e testing" if you feel okay.

@chendave
Copy link
Member

btw, current testing skip several testcases, as discussed with @ike-ma , we will try to enable all of them eventually.

@pacoxu pacoxu changed the title ci-kubernetes-node-arm64-ubuntu-serial keeps failing umbrella issue of arm ci for node e2e testing Jun 14, 2023
@pacoxu pacoxu changed the title umbrella issue of arm ci for node e2e testing [umbrella issue] arm ci for node e2e testing Jun 14, 2023
@chendave
Copy link
Member

dra testcases fail on x86_64 as well, https://testgrid.k8s.io/sig-node-kubelet#kubelet-gce-e2e-swap-ubuntu-serial.

@chendave
Copy link
Member

cc @pohly

@pohly
Copy link
Contributor

pohly commented Jun 14, 2023

The DRA testcases depend on alpha features and a suitable container runtime. They should get skipped in "normal" E2E node testing jobs based on the tags in the test description. I am not up-to-date on the current usage of NodeFeature. See https://github.com/kubernetes/enhancements/tree/master/keps/sig-testing/3041-node-conformance-and-features for some background.

/cc @bart0sh

@bart0sh
Copy link
Contributor

bart0sh commented Jun 14, 2023

According to the Kubelet logs the DRA plugin couldn't even register. This is not related to the used container runtime.
However, even with fixed registration CRI-O or Containerd 1.7 should be used as a container runtime for Node DRA tests to succeed.

@pohly
Copy link
Contributor

pohly commented Jun 14, 2023

According to the Kubelet logs the DRA plugin couldn't even register.

Perhaps because the feature gate wasn't enabled?

@bart0sh
Copy link
Contributor

bart0sh commented Jun 14, 2023

Most probably. I doubt the failure is related to arm changes as I'm able to run all DRA tests on my M2 Mac(arm64) in a multipass(Qemu) instance.

@chendave
Copy link
Member

/close

all tasks from this issue has been finished.

@k8s-ci-robot
Copy link
Contributor

@chendave: Closing this issue.

In response to this:

/close

all tasks from this issue has been finished.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@github-project-automation github-project-automation bot moved this from Issues - To do to Done in SIG Node CI/Test Board Oct 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Projects
Archived in project
Development

No branches or pull requests

7 participants