Skip to content

Latest commit

 

History

History
1000 lines (742 loc) · 41.6 KB

dataproc_cluster.html.markdown

File metadata and controls

1000 lines (742 loc) · 41.6 KB
subcategory description
Dataproc
Manages a Cloud Dataproc cluster resource.

google_dataproc_cluster

Manages a Cloud Dataproc cluster resource within GCP.

!> Warning: Due to limitations of the API, all arguments except labels,cluster_config.worker_config.num_instances and cluster_config.preemptible_worker_config.num_instances are non-updatable. Changing cluster_config.worker_config.min_num_instances will be ignored. Changing others will cause recreation of the whole cluster!

Example Usage - Basic

resource "google_dataproc_cluster" "simplecluster" {
  name   = "simplecluster"
  region = "us-central1"
}

Example Usage - Advanced

resource "google_service_account" "default" {
  account_id   = "service-account-id"
  display_name = "Service Account"
}

resource "google_dataproc_cluster" "mycluster" {
  name     = "mycluster"
  region   = "us-central1"
  graceful_decommission_timeout = "120s"
  labels = {
    foo = "bar"
  }

  cluster_config {
    staging_bucket = "dataproc-staging-bucket"

    master_config {
      num_instances = 1
      machine_type  = "e2-medium"
      disk_config {
        boot_disk_type    = "pd-ssd"
        boot_disk_size_gb = 30
      }
    }

    worker_config {
      num_instances    = 2
      machine_type     = "e2-medium"
      min_cpu_platform = "Intel Skylake"
      disk_config {
        boot_disk_size_gb = 30
        num_local_ssds    = 1
      }
    }

    preemptible_worker_config {
      num_instances = 0
    }

    # Override or set some custom properties
    software_config {
      image_version = "2.0.35-debian10"
      override_properties = {
        "dataproc:dataproc.allow.zero.workers" = "true"
      }
    }

    gce_cluster_config {
      tags = ["foo", "bar"]
      # Google recommends custom service accounts that have cloud-platform scope and permissions granted via IAM Roles.
      service_account = google_service_account.default.email
      service_account_scopes = [
        "cloud-platform"
      ]
    }

    # You can define multiple initialization_action blocks
    initialization_action {
      script      = "gs://dataproc-initialization-actions/stackdriver/stackdriver.sh"
      timeout_sec = 500
    }
  }
}

Example Usage - Using a GPU accelerator

resource "google_dataproc_cluster" "accelerated_cluster" {
  name   = "my-cluster-with-gpu"
  region = "us-central1"

  cluster_config {
    gce_cluster_config {
      zone = "us-central1-a"
    }

    master_config {
      accelerators {
        accelerator_type  = "nvidia-tesla-k80"
        accelerator_count = "1"
      }
    }
  }
}

Argument Reference

  • name - (Required) The name of the cluster, unique within the project and zone.

  • project - (Optional) The ID of the project in which the cluster will exist. If it is not provided, the provider project is used.

  • region - (Optional) The region in which the cluster and associated nodes will be created in. Defaults to global.

  • labels - (Optional) The list of labels (key/value pairs) configured on the resource through Terraform and to be applied to instances in the cluster. Note: This field is non-authoritative, and will only manage the labels present in your configuration. Please refer to the field effective_labels for all of the labels present on the resource.

  • terraform_labels - The combination of labels configured directly on the resource and default labels configured on the provider.

  • effective_labels - (Computed) The list of labels (key/value pairs) to be applied to instances in the cluster. GCP generates some itself including goog-dataproc-cluster-name which is the name of the cluster.

  • virtual_cluster_config - (Optional) Allows you to configure a virtual Dataproc on GKE cluster. Structure defined below.

  • cluster_config - (Optional) Allows you to configure various aspects of the cluster. Structure defined below.

  • graceful_decommission_timeout - (Optional) Allows graceful decomissioning when you change the number of worker nodes directly through a terraform apply. Does not affect auto scaling decomissioning from an autoscaling policy. Graceful decommissioning allows removing nodes from the cluster without interrupting jobs in progress. Timeout specifies how long to wait for jobs in progress to finish before forcefully removing nodes (and potentially interrupting jobs). Default timeout is 0 (for forceful decommission), and the maximum allowed timeout is 1 day. (see JSON representation of Duration). Only supported on Dataproc image versions 1.2 and higher. For more context see the docs


The virtual_cluster_config block supports:

    virtual_cluster_config {
        auxiliary_services_config { ... }
        kubernetes_cluster_config { ... }
    }
  • staging_bucket - (Optional) The Cloud Storage staging bucket used to stage files, such as Hadoop jars, between client machines and the cluster. Note: If you don't explicitly specify a staging_bucket then GCP will auto create / assign one for you. However, you are not guaranteed an auto generated bucket which is solely dedicated to your cluster; it may be shared with other clusters in the same region/zone also choosing to use the auto generation option.

  • auxiliary_services_config (Optional) Configuration of auxiliary services used by this cluster. Structure defined below.

  • kubernetes_cluster_config (Required) The configuration for running the Dataproc cluster on Kubernetes. Structure defined below.


The auxiliary_services_config block supports:

    virtual_cluster_config {
      auxiliary_services_config {
        metastore_config {
          dataproc_metastore_service = google_dataproc_metastore_service.metastore_service.id
        }

        spark_history_server_config {
          dataproc_cluster = google_dataproc_cluster.dataproc_cluster.id
        }
      }
    }
  • metastore_config (Optional) The Hive Metastore configuration for this workload.

    • dataproc_metastore_service (Required) Resource name of an existing Dataproc Metastore service.
  • spark_history_server_config (Optional) The Spark History Server configuration for the workload.

    • dataproc_cluster (Optional) Resource name of an existing Dataproc Cluster to act as a Spark History Server for the workload.

The kubernetes_cluster_config block supports:

    virtual_cluster_config {
      kubernetes_cluster_config {
        kubernetes_namespace = "foobar"

        kubernetes_software_config {
          component_version = {
            "SPARK" : "3.5-dataproc-17"
          }

          properties = {
            "spark:spark.eventLog.enabled": "true"
          }
        }

        gke_cluster_config {
          gke_cluster_target = google_container_cluster.primary.id

          node_pool_target {
            node_pool = "dpgke"
            roles = ["DEFAULT"]

            node_pool_config {
              autoscaling {
                min_node_count = 1
                max_node_count = 6
              }
              
              config {
                machine_type      = "n1-standard-4"
                preemptible       = true
                local_ssd_count   = 1
                min_cpu_platform  = "Intel Sandy Bridge"
              }

              locations = ["us-central1-c"]
            }
          }
        }
      }
    }
  • kubernetes_namespace (Optional) A namespace within the Kubernetes cluster to deploy into. If this namespace does not exist, it is created. If it exists, Dataproc verifies that another Dataproc VirtualCluster is not installed into it. If not specified, the name of the Dataproc Cluster is used.

  • kubernetes_software_config (Required) The software configuration for this Dataproc cluster running on Kubernetes.

    • component_version (Required) The components that should be installed in this Dataproc cluster. The key must be a string from the
      KubernetesComponent enumeration. The value is the version of the software to be installed. At least one entry must be specified.

      • NOTE : component_version[SPARK] is mandatory to set, or the creation of the cluster will fail.
    • properties (Optional) The properties to set on daemon config files. Property keys are specified in prefix:property format, for example spark:spark.kubernetes.container.image.

  • gke_cluster_config (Required) The configuration for running the Dataproc cluster on GKE.

    • gke_cluster_target (Optional) A target GKE cluster to deploy to. It must be in the same project and region as the Dataproc cluster (the GKE cluster can be zonal or regional)

    • node_pool_target (Optional) GKE node pools where workloads will be scheduled. At least one node pool must be assigned the DEFAULT GkeNodePoolTarget.Role. If a GkeNodePoolTarget is not specified, Dataproc constructs a DEFAULT GkeNodePoolTarget. Each role can be given to only one GkeNodePoolTarget. All node pools must have the same location settings.

      • node_pool (Required) The target GKE node pool.

      • roles (Required) The roles associated with the GKE node pool. One of "DEFAULT", "CONTROLLER", "SPARK_DRIVER" or "SPARK_EXECUTOR".

      • node_pool_config (Input only) The configuration for the GKE node pool. If specified, Dataproc attempts to create a node pool with the specified shape. If one with the same name already exists, it is verified against all specified fields. If a field differs, the virtual cluster creation will fail.

        • autoscaling (Optional) The autoscaler configuration for this node pool. The autoscaler is enabled only when a valid configuration is present.

          • min_node_count (Optional) The minimum number of nodes in the node pool. Must be >= 0 and <= maxNodeCount.

          • max_node_count (Optional) The maximum number of nodes in the node pool. Must be >= minNodeCount, and must be > 0.

        • config (Optional) The node pool configuration.

          • machine_type (Optional) The name of a Compute Engine machine type.

          • local_ssd_count (Optional) The number of local SSD disks to attach to the node, which is limited by the maximum number of disks allowable per zone.

          • preemptible (Optional) Whether the nodes are created as preemptible VM instances. Preemptible nodes cannot be used in a node pool with the CONTROLLER role or in the DEFAULT node pool if the CONTROLLER role is not assigned (the DEFAULT node pool will assume the CONTROLLER role).

          • min_cpu_platform (Optional) Minimum CPU platform to be used by this instance. The instance may be scheduled on the specified or a newer CPU platform. Specify the friendly names of CPU platforms, such as "Intel Haswell" or "Intel Sandy Bridge".

          • spot (Optional) Spot flag for enabling Spot VM, which is a rebrand of the existing preemptible flag.

        • locations (Optional) The list of Compute Engine zones where node pool nodes associated with a Dataproc on GKE virtual cluster will be located.


The cluster_config block supports:

    cluster_config {
        gce_cluster_config        { ... }
        master_config             { ... }
        worker_config             { ... }
        preemptible_worker_config { ... }
        software_config           { ... }

        # You can define multiple initialization_action blocks
        initialization_action     { ... }
        encryption_config         { ... }
        endpoint_config           { ... }
        metastore_config          { ... }
    }
  • staging_bucket - (Optional) The Cloud Storage staging bucket used to stage files, such as Hadoop jars, between client machines and the cluster. Note: If you don't explicitly specify a staging_bucket then GCP will auto create / assign one for you. However, you are not guaranteed an auto generated bucket which is solely dedicated to your cluster; it may be shared with other clusters in the same region/zone also choosing to use the auto generation option.

  • temp_bucket - (Optional) The Cloud Storage temp bucket used to store ephemeral cluster and jobs data, such as Spark and MapReduce history files. Note: If you don't explicitly specify a temp_bucket then GCP will auto create / assign one for you.

  • gce_cluster_config (Optional) Common config settings for resources of Google Compute Engine cluster instances, applicable to all instances in the cluster. Structure defined below.

  • master_config (Optional) The Google Compute Engine config settings for the master instances in a cluster. Structure defined below.

  • worker_config (Optional) The Google Compute Engine config settings for the worker instances in a cluster. Structure defined below.

  • preemptible_worker_config (Optional) The Google Compute Engine config settings for the additional instances in a cluster. Structure defined below.

    • NOTE : preemptible_worker_config is an alias for the api's secondaryWorkerConfig. The name doesn't necessarily mean it is preemptible and is named as such for legacy/compatibility reasons.
  • software_config (Optional) The config settings for software inside the cluster. Structure defined below.

  • security_config (Optional) Security related configuration. Structure defined below.

  • autoscaling_config (Optional) The autoscaling policy config associated with the cluster. Note that once set, if autoscaling_config is the only field set in cluster_config, it can only be removed by setting policy_uri = "", rather than removing the whole block. Structure defined below.

  • initialization_action (Optional) Commands to execute on each node after config is completed. You can specify multiple versions of these. Structure defined below.

  • encryption_config (Optional) The Customer managed encryption keys settings for the cluster. Structure defined below.

  • lifecycle_config (Optional) The settings for auto deletion cluster schedule. Structure defined below.

  • endpoint_config (Optional) The config settings for port access on the cluster. Structure defined below.

  • dataproc_metric_config (Optional) The Compute Engine accelerator (GPU) configuration for these instances. Can be specified multiple times. Structure defined below.

  • auxiliary_node_groups (Optional) A Dataproc NodeGroup resource is a group of Dataproc cluster nodes that execute an assigned role. Structure defined below.

  • metastore_config (Optional) The config setting for metastore service with the cluster. Structure defined below.


The cluster_config.gce_cluster_config block supports:

  cluster_config {
    gce_cluster_config {
      zone = "us-central1-a"

      # One of the below to hook into a custom network / subnetwork
      network    = google_compute_network.dataproc_network.name
      subnetwork = google_compute_network.dataproc_subnetwork.name

      tags = ["foo", "bar"]
    }
  }
  • zone - (Optional, Computed) The GCP zone where your data is stored and used (i.e. where the master and the worker nodes will be created in). If region is set to 'global' (default) then zone is mandatory, otherwise GCP is able to make use of Auto Zone Placement to determine this automatically for you. Note: This setting additionally determines and restricts which computing resources are available for use with other configs such as cluster_config.master_config.machine_type and cluster_config.worker_config.machine_type.

  • network - (Optional, Computed) The name or self_link of the Google Compute Engine network to the cluster will be part of. Conflicts with subnetwork. If neither is specified, this defaults to the "default" network.

  • subnetwork - (Optional) The name or self_link of the Google Compute Engine subnetwork the cluster will be part of. Conflicts with network.

  • service_account - (Optional) The service account to be used by the Node VMs. If not specified, the "default" service account is used.

  • service_account_scopes - (Optional, Computed) The set of Google API scopes to be made available on all of the node VMs under the service_account specified. Both OAuth2 URLs and gcloud short names are supported. To allow full access to all Cloud APIs, use the cloud-platform scope. See a complete list of scopes here.

  • tags - (Optional) The list of instance tags applied to instances in the cluster. Tags are used to identify valid sources or targets for network firewalls.

  • internal_ip_only - (Optional) By default, clusters are not restricted to internal IP addresses, and will have ephemeral external IP addresses assigned to each instance. If set to true, all instances in the cluster will only have internal IP addresses. Note: Private Google Access (also known as privateIpGoogleAccess) must be enabled on the subnetwork that the cluster will be launched in.

  • metadata - (Optional) A map of the Compute Engine metadata entries to add to all instances (see Project and instance metadata).

  • reservation_affinity - (Optional) Reservation Affinity for consuming zonal reservation.

    • consume_reservation_type - (Optional) Corresponds to the type of reservation consumption.
    • key - (Optional) Corresponds to the label key of reservation resource.
    • values - (Optional) Corresponds to the label values of reservation resource.
  • node_group_affinity - (Optional) Node Group Affinity for sole-tenant clusters.

    • node_group_uri - (Required) The URI of a sole-tenant node group resource that the cluster will be created on.
  • confidential_instance_config - (Optional) Confidential Instance Config for clusters using Confidential VMs

    • enable_confidential_compute - (Optional) Defines whether the instance should have confidential compute enabled.
  • shielded_instance_config (Optional) Shielded Instance Config for clusters using Compute Engine Shielded VMs.


The cluster_config.gce_cluster_config.shielded_instance_config block supports:

cluster_config{
  gce_cluster_config{
    shielded_instance_config{
      enable_secure_boot          = true
      enable_vtpm                 = true
      enable_integrity_monitoring = true
    }
  }
}
  • enable_secure_boot - (Optional) Defines whether instances have Secure Boot enabled.

  • enable_vtpm - (Optional) Defines whether instances have the vTPM enabled.

  • enable_integrity_monitoring - (Optional) Defines whether instances have integrity monitoring enabled.


The cluster_config.master_config block supports:

cluster_config {
  master_config {
    num_instances    = 1
    machine_type     = "e2-medium"
    min_cpu_platform = "Intel Skylake"

    disk_config {
      boot_disk_type    = "pd-ssd"
      boot_disk_size_gb = 30
      num_local_ssds    = 1
    }
  }
}
  • num_instances- (Optional, Computed) Specifies the number of master nodes to create. If not specified, GCP will default to a predetermined computed value (currently 1).

  • machine_type - (Optional, Computed) The name of a Google Compute Engine machine type to create for the master. If not specified, GCP will default to a predetermined computed value (currently n1-standard-4).

  • min_cpu_platform - (Optional, Computed) The name of a minimum generation of CPU family for the master. If not specified, GCP will default to a predetermined computed value for each zone. See the guide for details about which CPU families are available (and defaulted) for each zone.

  • image_uri (Optional) The URI for the image to use for this worker. See the guide for more information.

  • disk_config (Optional) Disk Config

    • boot_disk_type - (Optional) The disk type of the primary disk attached to each node. One of "pd-ssd" or "pd-standard". Defaults to "pd-standard".

    • boot_disk_size_gb - (Optional, Computed) Size of the primary disk attached to each node, specified in GB. The primary disk contains the boot volume and system libraries, and the smallest allowed disk size is 10GB. GCP will default to a predetermined computed value if not set (currently 500GB). Note: If SSDs are not attached, it also contains the HDFS data blocks and Hadoop working directories.

    • num_local_ssds - (Optional) The amount of local SSD disks that will be attached to each master cluster node. Defaults to 0.

    • local_ssd_interface - Optional. Interface type of local SSDs (default is "scsi"). Valid values: "scsi" (Small Computer System Interface), "nvme" (Non-Volatile Memory Express). See local SSD performance.

  • accelerators (Optional) The Compute Engine accelerator (GPU) configuration for these instances. Can be specified multiple times.

    • accelerator_type - (Required) The short name of the accelerator type to expose to this instance. For example, nvidia-tesla-k80.

    • accelerator_count - (Required) The number of the accelerator cards of this type exposed to this instance. Often restricted to one of 1, 2, 4, or 8.

~> The Cloud Dataproc API can return unintuitive error messages when using accelerators; even when you have defined an accelerator, Auto Zone Placement does not exclusively select zones that have that accelerator available. If you get a 400 error that the accelerator can't be found, this is a likely cause. Make sure you check accelerator availability by zone if you are trying to use accelerators in a given zone.


The cluster_config.worker_config block supports:

cluster_config {
  worker_config {
    num_instances    = 3
    machine_type     = "e2-medium"
    min_cpu_platform = "Intel Skylake"
    min_num_instance = 2
    disk_config {
      boot_disk_type    = "pd-standard"
      boot_disk_size_gb = 30
      num_local_ssds    = 1
    }
  }
}
  • num_instances- (Optional, Computed) Specifies the number of worker nodes to create. If not specified, GCP will default to a predetermined computed value (currently 2). There is currently a beta feature which allows you to run a Single Node Cluster. In order to take advantage of this you need to set "dataproc:dataproc.allow.zero.workers" = "true" in cluster_config.software_config.properties

  • machine_type - (Optional, Computed) The name of a Google Compute Engine machine type to create for the worker nodes. If not specified, GCP will default to a predetermined computed value (currently n1-standard-4).

  • min_cpu_platform - (Optional, Computed) The name of a minimum generation of CPU family for the master. If not specified, GCP will default to a predetermined computed value for each zone. See the guide for details about which CPU families are available (and defaulted) for each zone.

  • disk_config (Optional) Disk Config

    • boot_disk_type - (Optional) The disk type of the primary disk attached to each node. One of "pd-ssd" or "pd-standard". Defaults to "pd-standard".

    • boot_disk_size_gb - (Optional, Computed) Size of the primary disk attached to each worker node, specified in GB. The smallest allowed disk size is 10GB. GCP will default to a predetermined computed value if not set (currently 500GB). Note: If SSDs are not attached, it also contains the HDFS data blocks and Hadoop working directories.

    • num_local_ssds - (Optional) The amount of local SSD disks that will be attached to each worker cluster node. Defaults to 0.

  • image_uri (Optional) The URI for the image to use for this worker. See the guide for more information.

  • min_num_instances (Optional) The minimum number of primary worker instances to create. If min_num_instances is set, cluster creation will succeed if the number of primary workers created is at least equal to the min_num_instances number.

  • accelerators (Optional) The Compute Engine accelerator configuration for these instances. Can be specified multiple times.

    • accelerator_type - (Required) The short name of the accelerator type to expose to this instance. For example, nvidia-tesla-k80.

    • accelerator_count - (Required) The number of the accelerator cards of this type exposed to this instance. Often restricted to one of 1, 2, 4, or 8.

~> The Cloud Dataproc API can return unintuitive error messages when using accelerators; even when you have defined an accelerator, Auto Zone Placement does not exclusively select zones that have that accelerator available. If you get a 400 error that the accelerator can't be found, this is a likely cause. Make sure you check accelerator availability by zone if you are trying to use accelerators in a given zone.


The cluster_config.preemptible_worker_config block supports:

cluster_config {
  preemptible_worker_config {
    num_instances = 1

    disk_config {
      boot_disk_type    = "pd-standard"
      boot_disk_size_gb = 30
      num_local_ssds    = 1
    }
    instance_flexibility_policy {
      instance_selection_list {
        machine_types = ["n2-standard-2","n1-standard-2"]
        rank          = 1
      }
      instance_selection_list {
        machine_types = ["n2d-standard-2"]
        rank          = 3
      }
    }
  }
}

Note: Unlike worker_config, you cannot set the machine_type value directly. This will be set for you based on whatever was set for the worker_config.machine_type value.

  • num_instances- (Optional) Specifies the number of preemptible nodes to create. Defaults to 0.

  • preemptibility- (Optional) Specifies the preemptibility of the secondary workers. The default value is PREEMPTIBLE Accepted values are:

    • PREEMPTIBILITY_UNSPECIFIED
    • NON_PREEMPTIBLE
    • PREEMPTIBLE
    • SPOT
  • disk_config (Optional) Disk Config

    • boot_disk_type - (Optional) The disk type of the primary disk attached to each preemptible worker node. One of "pd-ssd" or "pd-standard". Defaults to "pd-standard".

    • boot_disk_size_gb - (Optional, Computed) Size of the primary disk attached to each preemptible worker node, specified in GB. The smallest allowed disk size is 10GB. GCP will default to a predetermined computed value if not set (currently 500GB). Note: If SSDs are not attached, it also contains the HDFS data blocks and Hadoop working directories.

    • num_local_ssds - (Optional) The amount of local SSD disks that will be attached to each preemptible worker node. Defaults to 0.

  • instance_flexibility_policy (Optional) Instance flexibility Policy allowing a mixture of VM shapes and provisioning models.

    • instance_selection_list - (Optional) List of instance selection options that the group will use when creating new VMs.
      • machine_types - (Optional) Full machine-type names, e.g. "n1-standard-16".

      • rank - (Optional) Preference of this instance selection. A lower number means higher preference. Dataproc will first try to create a VM based on the machine-type with priority rank and fallback to next rank based on availability. Machine types and instance selections with the same priority have the same preference.


The cluster_config.software_config block supports:

cluster_config {
  # Override or set some custom properties
  software_config {
    image_version = "2.0.35-debian10"

    override_properties = {
      "dataproc:dataproc.allow.zero.workers" = "true"
    }
  }
}
  • image_version - (Optional, Computed) The Cloud Dataproc image version to use for the cluster - this controls the sets of software versions installed onto the nodes when you create clusters. If not specified, defaults to the latest version. For a list of valid versions see Cloud Dataproc versions

  • override_properties - (Optional) A list of override and additional properties (key/value pairs) used to modify various aspects of the common configuration files used when creating a cluster. For a list of valid properties please see Cluster properties

  • optional_components - (Optional) The set of optional components to activate on the cluster. See Available Optional Components.


The cluster_config.security_config block supports:

cluster_config {
  # Override or set some custom properties
  security_config {
    kerberos_config {
      kms_key_uri = "projects/projectId/locations/locationId/keyRings/keyRingId/cryptoKeys/keyId"
      root_principal_password_uri = "bucketId/o/objectId"
    }
  }
}
  • kerberos_config (Required) Kerberos Configuration

    • cross_realm_trust_admin_server - (Optional) The admin server (IP or hostname) for the remote trusted realm in a cross realm trust relationship.

    • cross_realm_trust_kdc - (Optional) The KDC (IP or hostname) for the remote trusted realm in a cross realm trust relationship.

    • cross_realm_trust_realm - (Optional) The remote realm the Dataproc on-cluster KDC will trust, should the user enable cross realm trust.

    • cross_realm_trust_shared_password_uri - (Optional) The Cloud Storage URI of a KMS encrypted file containing the shared password between the on-cluster Kerberos realm and the remote trusted realm, in a cross realm trust relationship.

    • enable_kerberos - (Optional) Flag to indicate whether to Kerberize the cluster.

    • kdc_db_key_uri - (Optional) The Cloud Storage URI of a KMS encrypted file containing the master key of the KDC database.

    • key_password_uri - (Optional) The Cloud Storage URI of a KMS encrypted file containing the password to the user provided key. For the self-signed certificate, this password is generated by Dataproc.

    • keystore_uri - (Optional) The Cloud Storage URI of the keystore file used for SSL encryption. If not provided, Dataproc will provide a self-signed certificate.

    • keystore_password_uri - (Optional) The Cloud Storage URI of a KMS encrypted file containing the password to the user provided keystore. For the self-signed certificated, the password is generated by Dataproc.

    • kms_key_uri - (Required) The URI of the KMS key used to encrypt various sensitive files.

    • realm - (Optional) The name of the on-cluster Kerberos realm. If not specified, the uppercased domain of hostnames will be the realm.

    • root_principal_password_uri - (Required) The Cloud Storage URI of a KMS encrypted file containing the root principal password.

    • tgt_lifetime_hours - (Optional) The lifetime of the ticket granting ticket, in hours.

    • truststore_password_uri - (Optional) The Cloud Storage URI of a KMS encrypted file containing the password to the user provided truststore. For the self-signed certificate, this password is generated by Dataproc.

    • truststore_uri - (Optional) The Cloud Storage URI of the truststore file used for SSL encryption. If not provided, Dataproc will provide a self-signed certificate.


The cluster_config.autoscaling_config block supports:

cluster_config {
  # Override or set some custom properties
  autoscaling_config {
    policy_uri = "projects/projectId/locations/region/autoscalingPolicies/policyId"
  }
}
  • policy_uri - (Required) The autoscaling policy used by the cluster.

Only resource names including projectid and location (region) are valid. Examples:

https://www.googleapis.com/compute/v1/projects/[projectId]/locations/[dataproc_region]/autoscalingPolicies/[policy_id] projects/[projectId]/locations/[dataproc_region]/autoscalingPolicies/[policy_id] Note that the policy must be in the same project and Cloud Dataproc region.


The initialization_action block (Optional) can be specified multiple times and supports:

cluster_config {
  # You can define multiple initialization_action blocks
  initialization_action {
    script      = "gs://dataproc-initialization-actions/stackdriver/stackdriver.sh"
    timeout_sec = 500
  }
}
  • script- (Required) The script to be executed during initialization of the cluster. The script must be a GCS file with a gs:// prefix.

  • timeout_sec - (Optional, Computed) The maximum duration (in seconds) which script is allowed to take to execute its action. GCP will default to a predetermined computed value if not set (currently 300).


The encryption_config block supports:

cluster_config {
  encryption_config {
    kms_key_name = "projects/projectId/locations/region/keyRings/keyRingName/cryptoKeys/keyName"
  }
}
  • kms_key_name - (Required) The Cloud KMS key name to use for PD disk encryption for all instances in the cluster.

The dataproc_metric_config block supports:

dataproc_metric_config {
      metrics {
        metric_source = "HDFS"
        metric_overrides = ["yarn:ResourceManager:QueueMetrics:AppsCompleted"]
      }
    }

The auxiliary_node_groups block supports:

auxiliary_node_groups{
  node_group {
    roles = ["DRIVER"]
    node_group_config{
      num_instances=2
      machine_type="n1-standard-2"
      min_cpu_platform = "AMD Rome"
      disk_config {
        boot_disk_size_gb = 35
        boot_disk_type = "pd-standard"
        num_local_ssds = 1
      }
      accelerators {
        accelerator_count = 1
        accelerator_type  = "nvidia-tesla-t4"
      }
    }
  }
}
  • node_group - (Required) Node group configuration.

    • roles - (Required) Node group roles. One of "DRIVER".

    • name - (Optional) The Node group resource name.

    • node_group_config - (Optional) The node group instance group configuration.

      • num_instances- (Optional, Computed) Specifies the number of master nodes to create. Please set a number greater than 0. Node Group must have at least 1 instance.

      • machine_type - (Optional, Computed) The name of a Google Compute Engine machine type to create for the node group. If not specified, GCP will default to a predetermined computed value (currently n1-standard-4).

      • min_cpu_platform - (Optional, Computed) The name of a minimum generation of CPU family for the node group. If not specified, GCP will default to a predetermined computed value for each zone. See the guide for details about which CPU families are available (and defaulted) for each zone.

      • disk_config (Optional) Disk Config

        • boot_disk_type - (Optional) The disk type of the primary disk attached to each node. One of "pd-ssd" or "pd-standard". Defaults to "pd-standard".

        • boot_disk_size_gb - (Optional, Computed) Size of the primary disk attached to each node, specified in GB. The primary disk contains the boot volume and system libraries, and the smallest allowed disk size is 10GB. GCP will default to a predetermined computed value if not set (currently 500GB). Note: If SSDs are not attached, it also contains the HDFS data blocks and Hadoop working directories.

        • num_local_ssds - (Optional) The amount of local SSD disks that will be attached to each master cluster node. Defaults to 0.

      • accelerators (Optional) The Compute Engine accelerator (GPU) configuration for these instances. Can be specified multiple times.

        • accelerator_type - (Required) The short name of the accelerator type to expose to this instance. For example, nvidia-tesla-k80.

        • accelerator_count - (Required) The number of the accelerator cards of this type exposed to this instance. Often restricted to one of 1, 2, 4, or 8.


The lifecycle_config block supports:

cluster_config {
  lifecycle_config {
    idle_delete_ttl = "10m"
    auto_delete_time = "2120-01-01T12:00:00.01Z"
  }
}
  • idle_delete_ttl - (Optional) The duration to keep the cluster alive while idling (no jobs running). After this TTL, the cluster will be deleted. Valid range: [10m, 14d].

  • auto_delete_time - (Optional) The time when cluster will be auto-deleted. A timestamp in RFC3339 UTC "Zulu" format, accurate to nanoseconds. Example: "2014-10-02T15:01:23.045123456Z".


The endpoint_config block (Optional, Computed, Beta) supports:

cluster_config {
  endpoint_config {
    enable_http_port_access = true
  }
}
  • enable_http_port_access - (Optional) The flag to enable http access to specific ports on the cluster from external sources (aka Component Gateway). Defaults to false.

The metastore_config block (Optional, Computed, Beta) supports:

cluster_config {
  metastore_config {
    dataproc_metastore_service = "projects/projectId/locations/region/services/serviceName"
  }
}
  • dataproc_metastore_service - (Required) Resource name of an existing Dataproc Metastore service.

Only resource names including projectid and location (region) are valid. Examples:

projects/[projectId]/locations/[dataproc_region]/services/[service-name]

Attributes Reference

In addition to the arguments listed above, the following computed attributes are exported:

  • cluster_config.0.master_config.0.instance_names - List of master instance names which have been assigned to the cluster.

  • cluster_config.0.worker_config.0.instance_names - List of worker instance names which have been assigned to the cluster.

  • cluster_config.0.preemptible_worker_config.0.instance_names - List of preemptible instance names which have been assigned to the cluster.

  • cluster_config.0.bucket - The name of the cloud storage bucket ultimately used to house the staging data for the cluster. If staging_bucket is specified, it will contain this value, otherwise it will be the auto generated name.

  • cluster_config.0.software_config.0.properties - A list of the properties used to set the daemon config files. This will include any values supplied by the user via cluster_config.software_config.override_properties

  • cluster_config.0.lifecycle_config.0.idle_start_time - Time when the cluster became idle (most recent job finished) and became eligible for deletion due to idleness.

  • cluster_config.0.endpoint_config.0.http_ports - The map of port descriptions to URLs. Will only be populated if enable_http_port_access is true.

Import

This resource does not support import.

Timeouts

This resource provides the following Timeouts configuration options: configuration options:

  • create - Default is 45 minutes.
  • update - Default is 45 minutes.
  • delete - Default is 45 minutes.