Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable TPUs to use Shared VPC #3939

Merged
merged 10 commits into from
Sep 17, 2020
16 changes: 14 additions & 2 deletions products/tpu/api.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ objects:
guides:
'Official Documentation':
'https://cloud.google.com/tpu/docs/'
api: 'https://cloud.google.com/tpu/docs/reference/rest/'
api: 'https://cloud.google.com/tpu/docs/reference/rest/v1/projects.locations.nodes'
parameters:
- !ruby/object:Api::Type::String # TODO: resourceref?
name: 'zone'
Expand Down Expand Up @@ -94,8 +94,9 @@ objects:
used.
- !ruby/object:Api::Type::String
name: 'cidrBlock'
required: true
input: true
conflicts:
- use_service_networking
description: |
The CIDR block that the TPU node will use when selecting an IP
address. This CIDR block must be a /29 block; the Compute Engine
Expand All @@ -114,6 +115,17 @@ objects:
node. To share resources, including Google Cloud Storage data, with
the Tensorflow job running in the Node, this account must have
permissions to that data.
- !ruby/object:Api::Type::Boolean
name: 'useServiceNetworking'
description: |
Whether the VPC peering for the node is set up through Service Networking API.
The VPC Peering should be set up before provisioning the node. If this field is set,
cidr_block field should not be specified. If the network that you want to peer the
TPU Node to is a Shared VPC network, the node must be created with this this field enabled.
input: true
default_value: false
conflicts:
- cidr_block
- !ruby/object:Api::Type::NestedObject
name: 'schedulingConfig'
input: true
Expand Down
6 changes: 4 additions & 2 deletions products/tpu/terraform.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -33,13 +33,15 @@ overrides: !ruby/object:Overrides::ResourceOverrides
properties:
name: !ruby/object:Overrides::Terraform::PropertyOverride
custom_flatten: 'templates/terraform/custom_flatten/name_from_self_link.erb'
network: !ruby/object:Overrides::Terraform::PropertyOverride
cidrBlock: !ruby/object:Overrides::Terraform::PropertyOverride
default_from_api: true
diff_suppress_func: 'compareSelfLinkOrResourceName'
schedulingConfig: !ruby/object:Overrides::Terraform::PropertyOverride
diff_suppress_func: 'compareTpuNodeSchedulingConfig'
schedulingConfig.preemptible: !ruby/object:Overrides::Terraform::PropertyOverride
diff_suppress_func: 'compareTpuNodeSchedulingConfig'
network: !ruby/object:Overrides::Terraform::PropertyOverride
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be overwriting the network block from line 36, and also, I don't think we want to always suppress this diff. What if the intent is to change the network, I don't think it'd recognize the change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it was suppressing the values the api returns from get call in the Read function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworked it to check for a change in network.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this will make a huge difference, because if the user really wants to change the network, this will prevent them from doing so. I wonder if we could maybe do a diffsuppress that would parse the network and see if the only difference is the project and then check that the project number is the correct project number? Does that make sense? It's a little hacky, but would be the workaround until the upstream issue was solved.
Also, I think we want to use the diff_suppress_func from line 38 above rather than creating a new network block here. Then, in the new diff suppress func that you're implementing, make sure you include the same logic that is in compareSelfLinkOrResourceName so we don't lose that as well. Does that make sense?
Let me know if you have questions.

Copy link
Contributor Author

@upodroid upodroid Sep 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have a look my last commit. How do I get the *Config in to the diffsupress function? I'm using other functions that expect the config block as arguments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @upodroid! What a finicky issue! I spent my morning looking into how we can do this. Although diff suppress would be ideal, it looks like we can't use it here. If we want access to config we'll have to use a CustomizeDiff.
Basically what I did in the CustomizeDiff was get the old (value using project number) and new (value using project name) values, used regex to find the submatch for project. Then called out to the projects API using the project name and if the returned project.Number was the same as the project number, then i did a diff.SetNew(old). Does that make sense? So basically if they're the same project, we're just going to set it to use the project number so there is no diff.
Let me know if you have questions! Thanks again!!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I amended it to use CustomizeDiff now but i'm still getting diffs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm. It is ready to go.

 REDACTED  MCW0CDP3YY  ~  go  …  github.com  hashicorp  terraform-provider-google-beta   master  66⬇  5✎  USAGE  $    make testacc TEST=./google-beta TESTARGS='-run=TestAccTPUNode_tpuNodeFullExample'
==> Checking source code against gofmt...
==> Checking that code complies with gofmt requirements...
TF_ACC=1 TF_SCHEMA_PANIC_ON_ERROR=1 go test ./google-beta -v -run=TestAccTPUNode_tpuNodeFullExample -timeout 240m -ldflags="-X=github.com/hashicorp/terraform-provider-google-beta/version.ProviderVersion=acc"
=== RUN   TestAccTPUNode_tpuNodeFullExample
=== PAUSE TestAccTPUNode_tpuNodeFullExample
=== CONT  TestAccTPUNode_tpuNodeFullExample
--- PASS: TestAccTPUNode_tpuNodeFullExample (364.17s)
PASS
ok      github.com/hashicorp/terraform-provider-google-beta/google-beta 366.543s

default_from_api: true
diff_suppress_func: 'compareTpuNodeNetwork'
zone: !ruby/object:Overrides::Terraform::PropertyOverride
ignore_read: true
custom_code: !ruby/object:Provider::Terraform::CustomCode
Expand Down
33 changes: 33 additions & 0 deletions templates/terraform/constants/tpu_node.erb
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,39 @@ func compareTpuNodeSchedulingConfig(k, old, new string, d *schema.ResourceData)
return false
}


// The API takes the project_id in the network field and returns it as a number which causes permadiff.
upodroid marked this conversation as resolved.
Show resolved Hide resolved
func compareTpuNodeNetwork(k, old, new string, d *schema.ResourceData) bool {
// 1st part of diffsuppress

retval := false
network := GetResourceNameFromSelfLink(old)

serviceNetworkingNetworkName, err := retrieveServiceNetworkingNetworkName(d, config, network)
if err != nil {
return err
}

if new == serviceNetworkingNetworkName {
retval = true
}

// 2nd part of diffsuppress
newParts := strings.Split(new, "/")

if len(newParts) == 1 {
// `new` is a name
// `old` is always a self_link
if GetResourceNameFromSelfLink(old) == newParts[0] {
retval = true
}
}

// The `new` string is a self_link
retval = retval && compareSelfLinkRelativePaths("", old, new, nil)
return retval
}

func validateHttpHeaders() schema.SchemaValidateFunc {
return func(i interface{}, k string) (s []string, es []error) {
headers := i.(map[string]interface{})
Expand Down
23 changes: 21 additions & 2 deletions templates/terraform/examples/tpu_node_full.tf.erb
Original file line number Diff line number Diff line change
Expand Up @@ -13,17 +13,17 @@ resource "google_tpu_node" "<%= ctx[:primary_resource_id] %>" {

accelerator_type = "v3-8"

cidr_block = "10.3.0.0/29"
tensorflow_version = data.google_tpu_tensorflow_versions.available.versions[0]

description = "Terraform Google Provider test TPU"
use_service_networking = true
<%#-
We previously used a separate network resource here, but TPUs only allow using 50
different network names, ever. This caused our tests to start failing, so just
use the default network in order to still demonstrate using as many fields as
possible on the resource.
-%>
network = "default"
network = google_service_networking_connection.private_service_connection.network

labels = {
foo = "bar"
Expand All @@ -33,3 +33,22 @@ resource "google_tpu_node" "<%= ctx[:primary_resource_id] %>" {
preemptible = true
}
}

data "google_compute_network" "network" {
name = "default"
}

resource "google_compute_global_address" "service_range" {
provider = google-beta
upodroid marked this conversation as resolved.
Show resolved Hide resolved
name = "tf-test%{random_suffix}"
purpose = "VPC_PEERING"
address_type = "INTERNAL"
prefix_length = 16
network = data.google_compute_network.network.id
}

resource "google_service_networking_connection" "private_service_connection" {
network = data.google_compute_network.network.id
service = "servicenetworking.googleapis.com"
reserved_peering_ranges = [google_compute_global_address.service_range.name]
}