Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"invalid memory address or nil pointer dereference" #1382

Closed
qiankunli opened this issue Aug 27, 2021 · 6 comments
Closed

"invalid memory address or nil pointer dereference" #1382

qiankunli opened this issue Aug 27, 2021 · 6 comments

Comments

@qiankunli
Copy link
Contributor

old tf-operator version is v0.5.3 , I only upgrade tf-operator controller without upgrading crd/serviceacccount etc.
new tf-operator image is public.ecr.aws/j1r0q0g6/training/tf-operator:47a74b738920edbf4207160cec7e1dff9cdab3f2

when I create tfjob

apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  creationTimestamp: "2021-08-27T09:06:31Z"
  generation: 1
  name: v1-tensorflow-0728-1
  namespace: xdl-system
  resourceVersion: "237134489"
  selfLink: /apis/kubeflow.org/v1/namespaces/xdl-system/tfjobs/v1-tensorflow-0728-1
spec:
  cleanPodPolicy: Running
  tfReplicaSpecs:
    Worker:
      replicas: 1
      restartPolicy: Never
      template:
        spec:
          affinity:
            nodeAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                - matchExpressions:
                  - key: queue
                    operator: In
                    values:
                    - dev
                    - product-nlp
                    - product-recsys
          containers:
          - command:
            - python
            - mnist.py
            env:
            ...
            image: ...
            imagePullPolicy: Always
            name: tensorflow
            resources:
              limits:
                cpu: "5"
                memory: 5Gi
                nvidia.com/gpu: "10"
            volumeMounts:
            - mountPath: /dashboard
              name: dashboard-volume
              readOnly: false
          tolerations:
          - effect: NoSchedule
            key: node-role.kubernetes.io/master
            operator: Exists
          volumes:
          - hostPath:
              path: /mnt/cephfs/xdl/dashboard
              type: DirectoryOrCreate
            name: dashboard-volume
  ttlSecondsAfterFinished: 259200

tf-operator container logs

{"filename":"tensorflow/controller.go:304","job":"xdl-system.v1-tensorflow-0728-1","level":"info","msg":"Finished syncing tfjob \"xdl-system/v1-tensorflow-0728-1\" (886.184µs)","time":"2021-08-27T09:10:44Z"}
E0827 09:10:44.704207       1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 552 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x153e720, 0x245e5a0)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:74 +0xa3
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:48 +0x82
panic(0x153e720, 0x245e5a0)
	/usr/local/go/src/runtime/panic.go:679 +0x1b2
github.com/kubeflow/common/pkg/controller.v1/common.(*JobController).calcPGMinResources(0xc000205200, 0xc000000001, 0xc000e4f020, 0x1)
	/go/pkg/mod/github.com/kubeflow/[email protected]/pkg/controller.v1/common/job.go:429 +0x168
github.com/kubeflow/common/pkg/controller.v1/common.(*JobController).ReconcileJobs(0xc000205200, 0x16db720, 0xc001291a00, 0xc000e4f020, 0xc0001bf3b0, 0x1, 0x1, 0x0, 0x0, 0x0, ...)
	/go/pkg/mod/github.com/kubeflow/[email protected]/pkg/controller.v1/common/job.go:300 +0x1fae
github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow.(*TFController).syncTFJob(0xc000205200, 0xc0016ede40, 0x1f, 0x0, 0x0, 0x0)
	/go/src/github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow/controller.go:335 +0x5b2
github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow.(*TFController).processNextWorkItem(0xc000205200, 0x0)
	/go/src/github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow/controller.go:272 +0x6e0
github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow.(*TFController).runWorker(0xc000205200)
	/go/src/github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow/controller.go:222 +0x2b
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc00155d190)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:152 +0x5e
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00155d190, 0x3b9aca00, 0x0, 0x1, 0xc00018a000)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:153 +0xf8
k8s.io/apimachinery/pkg/util/wait.Until(0xc00155d190, 0x3b9aca00, 0xc00018a000)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:88 +0x4d
created by github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow.(*TFController).Run
	/go/src/github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow/controller.go:208 +0x2c4
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x13bd3b8]

goroutine 552 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:55 +0x105
panic(0x153e720, 0x245e5a0)
	/usr/local/go/src/runtime/panic.go:679 +0x1b2
github.com/kubeflow/common/pkg/controller.v1/common.(*JobController).calcPGMinResources(0xc000205200, 0xc000000001, 0xc000e4f020, 0x1)
	/go/pkg/mod/github.com/kubeflow/[email protected]/pkg/controller.v1/common/job.go:429 +0x168
github.com/kubeflow/common/pkg/controller.v1/common.(*JobController).ReconcileJobs(0xc000205200, 0x16db720, 0xc001291a00, 0xc000e4f020, 0xc0001bf3b0, 0x1, 0x1, 0x0, 0x0, 0x0, ...)
	/go/pkg/mod/github.com/kubeflow/[email protected]/pkg/controller.v1/common/job.go:300 +0x1fae
github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow.(*TFController).syncTFJob(0xc000205200, 0xc0016ede40, 0x1f, 0x0, 0x0, 0x0)
	/go/src/github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow/controller.go:335 +0x5b2
github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow.(*TFController).processNextWorkItem(0xc000205200, 0x0)
	/go/src/github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow/controller.go:272 +0x6e0
github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow.(*TFController).runWorker(0xc000205200)
	/go/src/github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow/controller.go:222 +0x2b
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc00155d190)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:152 +0x5e
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00155d190, 0x3b9aca00, 0x0, 0x1, 0xc00018a000)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:153 +0xf8
k8s.io/apimachinery/pkg/util/wait.Until(0xc00155d190, 0x3b9aca00, 0xc00018a000)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:88 +0x4d
created by github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow.(*TFController).Run
	/go/src/github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow/controller.go:208 +0x2c4
@gaocegege
Copy link
Member

/kind bug

@gaocegege
Copy link
Member

github.com/kubeflow/common/pkg/controller.v1/common.(*JobController).calcPGMinResources(0xc000205200, 0xc000000001, 0xc000e4f020, 0x1)
	/go/pkg/mod/github.com/kubeflow/[email protected]/pkg/controller.v1/common/job.go:429 +0x168
github.com/kubeflow/common/pkg/controller.v1/common.(*JobController).ReconcileJobs(0xc000205200, 0x16db720, 0xc001291a00, 0xc000e4f020, 0xc0001bf3b0, 0x1, 0x1, 0x0, 0x0, 0x0, ...)
	/go/pkg/mod/github.com/kubeflow/[email protected]/pkg/controller.v1/common/job.go:300 +0x1fae
github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow.(*TFController).syncTFJob(0xc000205200, 0xc0016ede40, 0x1f, 0x0, 0x0, 0x0)
	/go/src/github.com/kubeflow/tf-operator/pkg/controller.v1/tensorflow/controller.go:335 +0x5b2

Let me have a look at the common.

@gaocegege
Copy link
Member

Seems that https://github.com/kubeflow/common/blob/9d86006b1ee7d5eb9e9ee4023858cc3398877ad0/pkg/controller.v1/common/job.go#L429 PriorityClassLister is not initialized correctly.

gaocegege added a commit to gaocegege/tf-operator that referenced this issue Aug 27, 2021
google-oss-robot pushed a commit that referenced this issue Aug 27, 2021
* feat(init): Initialize podpriority

Signed-off-by: cegao <[email protected]>

* fix(priority): Fix #1382

Signed-off-by: cegao <[email protected]>
@johnugeorge
Copy link
Member

@gaocegege Is gang scheduling used in this issue?

@gaocegege
Copy link
Member

Yes, the operator will crash if gang is enabled.

@stale
Copy link

stale bot commented Mar 2, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants