Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot launch driver after Spark default CPU value to int32 #721

Closed
caldempsey opened this issue Dec 8, 2019 · 17 comments
Closed

Cannot launch driver after Spark default CPU value to int32 #721

caldempsey opened this issue Dec 8, 2019 · 17 comments

Comments

@caldempsey
Copy link

caldempsey commented Dec 8, 2019

Hey there!

Firstly thank you for everything you are trying to achieve. When trying to carve your own path, projects like this surely make for great ways for new players in the data engineering space to get started running clusters and forge great data experiences :)

Here's the problem...

I've noticed that the Apache Spark on k8s specification has changed the no. CPUs from a float to an integer. With the release of the latest API version this has been reflected #578.

Fairly new to kubernates but this seems to create a conflict with the driver runner where I'm experiencing...

Exception in thread "main" io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://10.96.0.1/api/v1/namespaces/spark-operator/pods. Message: Pod "spark-pi-driver" is invalid: spec.containers[0].resources.requests: Invalid value: "1": must be less than or equal to cpu limit. Received status: Status(apiVersion=v1, code=422, details=StatusDetails(causes=[StatusCause(field=spec.containers[0].resources.requests, message=Invalid value: "1": must be less than or equal to cpu limit, reason=FieldValueInvalid, additionalProperties={})], group=null, kind=Pod, name=spark-pi-driver, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=Pod "spark-pi-driver" is invalid: spec.containers[0].resources.requests: Invalid value: "1": must be less than or equal to cpu limit, metadata=ListMeta(_continue=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Invalid, status=Failure, additionalProperties={}).

This is likely because Invalid value: "1": must be less than or equal to cpu limit.. So our new minimum value 1 seems to be the default value. But the error indicates that the CPU limit for Kubernates must be 1. Perhaps this is related to this issue kubernetes/kubernetes#51430. Not sure how to resolve this, may be my Kubernetes configuration at fault here more than anything.

Environment

Here is the environment I'm running (from the pod)...

λ kubectl describe pod spark-sparkoperator-7c6d6f9cfd-6257n
Name:               spark-sparkoperator-7c6d6f9cfd-6257n
Namespace:          spark-operator
Priority:           0
PriorityClassName:  <none>
Node:               docker-desktop/192.168.65.3
Start Time:         Sun, 08 Dec 2019 01:10:57 +0000
Labels:             app.kubernetes.io/name=sparkoperator
                    app.kubernetes.io/version=v1beta2-1.0.1-2.4.4
                    pod-template-hash=7c6d6f9cfd
Annotations:        prometheus.io/path: /metrics
                    prometheus.io/port: 10254
                    prometheus.io/scrape: true
Status:             Running
IP:                 10.1.0.9
Controlled By:      ReplicaSet/spark-sparkoperator-7c6d6f9cfd
Containers:
  sparkoperator:
    Container ID:  docker://96d7a6908bad62e35fcfd530ca5337073a27602c468f2b7580f65cce4c48fd38
    Image:         gcr.io/spark-operator/spark-operator:v1beta2-1.0.1-2.4.4
    Image ID:      docker-pullable://gcr.io/spark-operator/spark-operator@sha256:ce769e5c6a5d8fa78ceb1a0abaf961fb2424767f9535c97baac04a18169654bd
    Port:          10254/TCP
    Host Port:     0/TCP
    Args:
      -v=2
      -namespace=
      -ingress-url-format=
      -controller-threads=10
      -resync-interval=30
      -logtostderr
      -enable-metrics=true
      -metrics-labels=app_type
      -metrics-port=10254
      -metrics-endpoint=/metrics
      -metrics-prefix=
    State:          Running
      Started:      Sun, 08 Dec 2019 01:10:58 +0000
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from spark-sparkoperator-token-w7dmr (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  spark-sparkoperator-token-w7dmr:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  spark-sparkoperator-token-w7dmr
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type    Reason     Age    From                     Message
  ----    ------     ----   ----                     -------
  Normal  Scheduled  3m11s  default-scheduler        Successfully assigned spark-operator/spark-sparkoperator-7c6d6f9cfd-6257n to docker-desktop
  Normal  Pulled     3m10s  kubelet, docker-desktop  Container image "gcr.io/spark-operator/spark-operator:v1beta2-1.0.1-2.4.4" already present on machine
  Normal  Created    3m10s  kubelet, docker-desktop  Created container sparkoperator
  Normal  Started    3m10s  kubelet, docker-desktop  Started container sparkoperator
@caldempsey caldempsey changed the title Changing the default value to int32 Cannot launch driver after changing the default value to int32 Dec 8, 2019
@caldempsey caldempsey changed the title Cannot launch driver after changing the default value to int32 Cannot launch driver after Spark default CPU value to int32 Dec 8, 2019
@liyinan926
Copy link
Collaborator

Can you paste your SparkApplication manifest here?

@damache
Copy link

damache commented Dec 19, 2019

same issue here. I'm testing, so I'm using the example lightbend manifest.

original one use v1beta1 worked fine against v1beta1 crd

apiVersion: "sparkoperator.k8s.io/v1beta1"
kind: SparkApplication
metadata:
  name: spark-pi
  namespace: default
spec:
  type: Scala
  mode: cluster
  image: "lightbend/spark:2.0.1-OpenShift-2.4.0-rh"
  imagePullPolicy: Always
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar"
  restartPolicy:
    type: Never
  volumes:
    - name: config-vol
      configMap:
        name: my-cm
  driver:
    cores: 0.1
    coreLimit: "200m"
    memory: "512m"
    labels:
      version: 2.4.0
    serviceAccount: eyewitness-orangutan-spark
    volumeMounts:
      - name: config-vol
        mountPath: /opt/spark
  executor:
    cores: 1
    instances: 1
    memory: "512m"
    labels:
      version: 2.4.0
    volumeMounts:
      - name: config-vol
        mountPath: /opt/spark

I removed v1beta1 from our k8s and redeployed v1beta2 helm chart. changed the version on the manifest accordingly. here is the update

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: spark-pi
  namespace: spark-operator
spec:
  sparkVersion: 2.4.4
  type: Scala
  mode: cluster
  image: "lightbend/spark:2.0.1-OpenShift-2.4.0-rh"
  imagePullPolicy: Always
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar"
  restartPolicy:
    type: Never
  volumes:
    - name: config-vol
      configMap:
        name: my-cm
  driver:
    cores: 0.1
    coreLimit: "200m"
    memory: "512m"
    labels:
      version: 2.4.0
    serviceAccount: eyewitness-orangutan-spark
    volumeMounts:
      - name: config-vol
        mountPath: /opt/spark
  executor:
    cores: 1
    instances: 1
    memory: "512m"
    labels:
      version: 2.4.0
    volumeMounts:
      - name: config-vol
        mountPath: /opt/spark

error:

{ Error: SparkApplication.sparkoperator.k8s.io "spark-pi" is invalid: []: Invalid value: map[string]interface {}{"apiVersion":"sparkoperator.k8s.io/v1beta2", "kind":"SparkApplication", "metadata":map[string]interface {}{"name":"spark-pi", "namespace":"spark-operator", "creationTimestamp":"2019-12-19T21:45:12Z", "generation":1, "uid":"d581ee6f-22a8-11ea-a54e-1afbce1e3f39"}, "spec":map[string]interface {}{"type":"Scala", "mode":"cluster", "mainClass":"org.apache.spark.examples.SparkPi", "restartPolicy":map[string]interface {}{"type":"Never"}, "image":"lightbend/spark:2.0.1-OpenShift-2.4.0-rh", "imagePullPolicy":"Always", "mainApplicationFile":"local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar", "volumes":[]interface {}{map[string]interface {}{"name":"config-vol", "configMap":map[string]interface {}{"name":"my-cm"}}}, "driver":map[string]interface {}{"cores":0.1, "coreLimit":"200m", "memory":"512m", "labels":map[string]interface {}{"version":"2.4.0"}, "serviceAccount":"eyewitness-orangutan-spark", "volumeMounts":[]interface {}{map[string]interface {}{"name":"config-vol", "mountPath":"/opt/spark"}}}, "executor":map[string]interface {}{"cores":1, "instances":1, "memory":"512m", "labels":map[string]interface {}{"version":"2.4.0"}, "volumeMounts":[]interface {}{map[string]interface {}{"name":"config-vol", "mountPath":"/opt/spark"}}}}}: validation failure list:
spec.driver.cores in body must be of type int32: "float64"
spec.driver.cores in body should be greater than or equal to 1
spec.sparkVersion in body is required

@caldempsey
Copy link
Author

@liyinan926 so sorry been away from my affected machine over the holidays will try to give you something sooner than later <3

@liyinan926
Copy link
Collaborator

liyinan926 commented Dec 19, 2019

This is due to a recent change in #578 that introduced the v1beta2 version of the API. The change was to change the type of .spec.driver.cores to an integer to be consistent with the Spark config property spark.driver.cores which is an integer. Spark 3.0 will have a new config property spark.kubernetes.driver.request.cores for setting the CPU request for the driver pod. We will add support for that soon. This new config property supports Kubernetes-comformant values, e.g., 0.1 and 100m and is used for specifying the CPU request of the driver pod, independently from spark.driver.cores that .spec.driver.cores maps to.

@caldempsey
Copy link
Author

@liyinan926 Thanks buddy!

@damache
Copy link

damache commented Dec 20, 2019

attached is the output of kubectl describe sparkapp spark-pi -n=spark-jobs-operator. we are not able to create the spark jobs because of the limit error. Invalid value: "1": must be less than or equal to cpu limit.

any idea what is causing this?

@damache
Copy link

damache commented Dec 20, 2019

the error doesn't always appear. but this time I removed the coreLimit definition for the manifest and it deployed.

@liyinan926
Copy link
Collaborator

The change to the type to spec.driver.cores causes the default and minimum value of it to become 1. Currently (in Spark 2.4.x), that value is used to set the cpu request for the driver pod. In the meanwhile, the value of spec.driver.coreLimit is used to set the cpu limit of the driver pod. If the value of coreLimit is less than that of cores, the resulting pod spec is invalid and that's why you saw the error. Please make sure the value of coreLimit is no less than that of cores.

@caldempsey
Copy link
Author

caldempsey commented Dec 20, 2019

@liyinan926 That error @damache posted is familiar. What I recall experiencing is if your default cpu limit for a namespace is 1, then it will also behave as the max limit and throw a different error (related to this issue) kubernetes/kubernetes#51430, but if you set a string value (i.e. "200m") then you get spec.driver.cores in body must be of type int32: "float64". So the answer should be to set a CPU limit above the default (2-3-4 .etc.). I think the operator no longer works "out the box" on a fresh environment with the default CPU limit the way it is, which is what many users might be expecting from their experience.

@liyinan926
Copy link
Collaborator

liyinan926 commented Dec 21, 2019

Let's make it clear that:

  1. v1beta2 changed the type of field spec.driver.cores from a string to int32 to be consistent with the type of spark.driver.cores. This means the minimum value it takes is now 1, or 1 cpu. Because the value of the field is also used to set the cpu request of the driver pod in Spark 2.4.x, it means that the minimum cpu request for the driver pod is 1 in Spark 2.4.x.
  2. The field spec.driver.coreLimit is used to set the cpu limit of the driver pod, which defaults to the default limit of the namespace if that field is not set. This means the default cpu limit could be invalid for the driver pod, if it's lower than the cpu request controlled by spec.driver.cores.

Whether the operator works out of the box depends on the environment's default cpu limit.

To mitigate this issue, I think we should have the operator set at runtime spec.driver.coreLimit based on the value of spec.driver.cores if spec.driver.coreLimit is not set. For example, if spec.driver.cores=2, and spec.driver.coreLimit is not set, the operator will set spec.driver.coreLimit to 2 before submitting the app to run.

Spark 3.0 has a new field spark.kubernetes.driver.request.cores specifically for setting the cpu request for the driver pod, and we have already added a new field spec.driver.coreRequest that maps to that. See #748.

@damache
Copy link

damache commented Dec 23, 2019

spark 2.4.x in k8s mode the value is extracted as a string,

we use this today in our spark services when deploying spark 2.4.4 to k8s via spark-submit directly
--conf spark.driver.cores=250m

not sure why issue #578 references the standalone spark configs, but it appears based on this line this crd deploys in k8s mode.

I would recommend letting the spark submit do validation, otherwise the crd is going to require more maintenance and possibly make incorrect assumptions.

@liyinan926
Copy link
Collaborator

liyinan926 commented Dec 23, 2019

@damache yes, it's parsed as a string in the k8s mode, but it's actually defined and treated as an integer elsewhere (see https://github.com/apache/spark/blob/cdc8fc6233450ed040f2f0272d06510c1eedbefb/core/src/main/scala/org/apache/spark/internal/config/package.scala#L81). That's why the new config spark.kubernetes.driver.request.cores was introduced in apache/spark@1a8c093. The purpose of the changes in #578 was to make it be consistent with the type in Spark moving forward. The k8s mode in 2.4.x incorrectly treated it as a string, and this got fixed in Spark 3.0.

@liyinan926
Copy link
Collaborator

BTW: we added support for spark.kubernetes.driver.request.cores in #748.

@damache
Copy link

damache commented Dec 23, 2019

that's only applicable to spark 3.0. so v1beta2 is only compatible to spark 3.0 if you don't want to set the cores of the driver to 1.

v1beta1 says it supports spark 2.4.0.

so no real support for spark 2.4.4.

@liyinan926
Copy link
Collaborator

liyinan926 commented Dec 23, 2019

that's only applicable to spark 3.0. so v1beta2 is only compatible to spark 3.0 if you don't want to set the cores of the driver to 1.

The specific field works with Spark 3.x only as the documentation clearly indicates. The rest of the API in v1beta2 works with both 2.4.x and 3.x. v1beta1 only works with 2.4.x.

so no real support for spark 2.4.4.

Not sure what you meant by this. If you would like to stick to the semantics of treating driver.cores as a string, then use the v1beta1 version of the API and version v2.4.0-v1beta1-0.9.0 ofthe operator.

@damache
Copy link

damache commented Dec 23, 2019

https://github.com/GoogleCloudPlatform/spark-on-k8s-operator#version-matrix

that matrix says 2.4.0 is the base spark image. also deploying with the helm chart setting the --set operatorVersion=v2.4.0-v1beta1-0.9.0 results in the helm chart showing the APP VERSION as v1beta2-1.0.1-2.4.4. when I tried to deploy a manifest with v1beta1 and a spark 2.4.4 docker image it wasn't able to deploy because of rbac issues. I can go back to that and see if it will run and if there are more issues open a new issue here.

@liyinan926
Copy link
Collaborator

OK, the APP VERSION you saw is the version defined in https://github.com/helm/charts/blob/master/incubator/sparkoperator/Chart.yaml, which corresponds to verison of the operator deployed by default. When you change the version used by setting operatorVersion, it only changes the version of the image used, but it doesn't change the value of appVersion. Please create a separate issue if necessary. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants