Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support BigQuery custom schema's for external data using CSV / NDJSON #3717

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 41 additions & 5 deletions third_party/terraform/resources/resource_bigquery_table.go
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,21 @@ func resourceBigQueryTable() *schema.Resource {
Default: "NONE",
Description: `The compression type of the data source. Valid values are "NONE" or "GZIP".`,
},
// Schema: Optional] The schema for the data.
// Schema is required for CSV and JSON formats if autodetect is not on.
// Schema is disallowed for Google Cloud Bigtable, Cloud Datastore backups, Avro, ORC and Parquet formats.
"schema": {
Type: schema.TypeString,
Optional: true,
Computed: true,
ForceNew: true,
ValidateFunc: validation.ValidateJsonString,
StateFunc: func(v interface{}) string {
json, _ := structure.NormalizeJsonString(v)
return json
},
Description: `A JSON schema for the external table. Schema is required for CSV and JSON formats and is disallowed for Google Cloud Bigtable, Cloud Datastore backups, and Avro formats when using external tables.`,
},
// CsvOptions: [Optional] Additional properties to set if
// sourceFormat is set to CSV.
"csv_options": {
Expand Down Expand Up @@ -275,9 +290,6 @@ func resourceBigQueryTable() *schema.Resource {
},

// Schema: [Optional] Describes the schema of this table.
// Schema is required for external tables in CSV and JSON formats
// and disallowed for Google Cloud Bigtable, Cloud Datastore backups,
// and Avro formats.
"schema": {
Type: schema.TypeString,
Optional: true,
Expand All @@ -287,7 +299,7 @@ func resourceBigQueryTable() *schema.Resource {
json, _ := structure.NormalizeJsonString(v)
return json
},
Description: `A JSON schema for the table. Schema is required for CSV and JSON formats and is disallowed for Google Cloud Bigtable, Cloud Datastore backups, and Avro formats when using external tables.`,
Description: `A JSON schema for the table.`,
},

// View: [Optional] If specified, configures this table as a view.
Expand Down Expand Up @@ -636,7 +648,6 @@ func resourceBigQueryTableCreate(d *schema.ResourceData, meta interface{}) error
}

log.Printf("[INFO] BigQuery table %s has been created", res.Id)

d.SetId(fmt.Sprintf("projects/%s/datasets/%s/tables/%s", res.TableReference.ProjectId, res.TableReference.DatasetId, res.TableReference.TableId))

return resourceBigQueryTableRead(d, meta)
Expand Down Expand Up @@ -683,6 +694,24 @@ func resourceBigQueryTableRead(d *schema.ResourceData, meta interface{}) error {
return err
}

if v, ok := d.GetOk("external_data_configuration"); ok {
// The API response doesn't return the `external_data_configuration.schema`
// used when creating the table and it cannot be queried.
// After creation, a computed schema is stored in the toplevel `schema`,
// which combines `external_data_configuration.schema`
// with any hive partioning fields found in the `source_uri_prefix`.
// So just assume the configured schema has been applied after successful
// creation, by copying the configured value back into the resource schema.
// This avoids that reading back this field will be identified as a change.
// The `ForceNew=true` on `external_data_configuration.schema` will ensure
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't totally follow why not reading back the external schema requires us to do ForceNew. Couldn't we still allow updating that field even while not detecting drift on it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ForceNew is not required in terms of reading back the data, I just consider this better UX wrt the expectations of the user, e.g. if you change the schema, the most probable course I think is to recreate the table with that schema. Afaik you cannot change the schema of an existing table in place.

The external_data_configuration.schema is only used as an input parameter for creating the table, when we read back the table, this field is always empty. There is a computed schema returned on top level, which reflects the effective schema of the created table, however this value is calculated by combining the schema provided here and any other field/type mappings it can infer by autodetection and/or inferred from the source_uri_prefix. I wanted to avoid having to determine if external_data_configuration.schema is accurately reflected in the computed schema and reimplement BQ's logic in doing so, so I just assume after creation this is successfully reflected, hence I just ignore this field by making sure there are no changes with what is configured.

Perhaps there's a smarter way to do this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. If it's not clear that it can be updated and users wouldn't necessarily expect it to be, we can leave it as ForceNew. Worst thing that happens is someone files an issue to ask for update support.

// the users' expectation that changing the configured input schema will
// recreate the resource.
edc := v.([]interface{})[0].(map[string]interface{})
if edc["schema"] != nil {
externalDataConfiguration[0]["schema"] = edc["schema"]
}
}

d.Set("external_data_configuration", externalDataConfiguration)
}

Expand Down Expand Up @@ -804,6 +833,13 @@ func expandExternalDataConfiguration(cfg interface{}) (*bigquery.ExternalDataCon
if v, ok := raw["max_bad_records"]; ok {
edc.MaxBadRecords = int64(v.(int))
}
if v, ok := raw["schema"]; ok {
schema, err := expandSchema(v)
if err != nil {
return nil, err
}
edc.Schema = schema
}
if v, ok := raw["source_format"]; ok {
edc.SourceFormat = v.(string)
}
Expand Down
90 changes: 82 additions & 8 deletions third_party/terraform/tests/resource_bigquery_table_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,31 @@ func TestAccBigQueryTable_HivePartitioning(t *testing.T) {
})
}

func TestAccBigQueryTable_HivePartitioningCustomSchema(t *testing.T) {
t.Parallel()
bucketName := testBucketName(t)
resourceName := "google_bigquery_table.test"
datasetID := fmt.Sprintf("tf_test_%s", randString(t, 10))
tableID := fmt.Sprintf("tf_test_%s", randString(t, 10))

vcrTest(t, resource.TestCase{
PreCheck: func() { testAccPreCheck(t) },
Providers: testAccProviders,
CheckDestroy: testAccCheckBigQueryTableDestroyProducer(t),
Steps: []resource.TestStep{
{
Config: testAccBigQueryTableHivePartitioningCustomSchema(bucketName, datasetID, tableID),
},
{
ResourceName: resourceName,
ImportState: true,
ImportStateVerify: true,
ImportStateVerifyIgnore: []string{"external_data_configuration.0.schema"},
},
},
})
}

func TestAccBigQueryTable_RangePartitioning(t *testing.T) {
t.Parallel()
resourceName := "google_bigquery_table.test"
Expand Down Expand Up @@ -480,23 +505,72 @@ resource "google_storage_bucket_object" "test" {
}

resource "google_bigquery_dataset" "test" {
dataset_id = "%s"
dataset_id = "%s"
}

resource "google_bigquery_table" "test" {
table_id = "%s"
dataset_id = google_bigquery_dataset.test.dataset_id

external_data_configuration {
source_format = "CSV"
autodetect = true
source_uris= ["gs://${google_storage_bucket.test.name}/*"]
source_format = "CSV"
autodetect = true
source_uris= ["gs://${google_storage_bucket.test.name}/*"]

hive_partitioning_options {
mode = "AUTO"
source_uri_prefix = "gs://${google_storage_bucket.test.name}/"
}
hive_partitioning_options {
mode = "AUTO"
source_uri_prefix = "gs://${google_storage_bucket.test.name}/"
}

}
depends_on = ["google_storage_bucket_object.test"]
}
`, bucketName, datasetID, tableID)
}

func testAccBigQueryTableHivePartitioningCustomSchema(bucketName, datasetID, tableID string) string {
return fmt.Sprintf(`
resource "google_storage_bucket" "test" {
name = "%s"
force_destroy = true
}

resource "google_storage_bucket_object" "test" {
name = "key1=20200330/data.json"
content = "{\"name\":\"test\", \"last_modification\":\"2020-04-01\"}"
bucket = google_storage_bucket.test.name
}

resource "google_bigquery_dataset" "test" {
dataset_id = "%s"
}

resource "google_bigquery_table" "test" {
table_id = "%s"
dataset_id = google_bigquery_dataset.test.dataset_id

external_data_configuration {
source_format = "NEWLINE_DELIMITED_JSON"
autodetect = false
source_uris= ["gs://${google_storage_bucket.test.name}/*"]

hive_partitioning_options {
mode = "CUSTOM"
source_uri_prefix = "gs://${google_storage_bucket.test.name}/{key1:STRING}"
}

schema = <<EOH
[
{
"name": "name",
"type": "STRING"
},
{
"name": "last_modification",
"type": "DATE"
}
]
EOH
}
depends_on = ["google_storage_bucket_object.test"]
ffung marked this conversation as resolved.
Show resolved Hide resolved
}
Expand Down
18 changes: 13 additions & 5 deletions third_party/terraform/website/docs/r/bigquery_table.html.markdown
Original file line number Diff line number Diff line change
Expand Up @@ -112,11 +112,7 @@ The following arguments are supported:

* `labels` - (Optional) A mapping of labels to assign to the resource.

* `schema` - (Optional) A JSON schema for the table. Schema is required
for CSV and JSON formats and is disallowed for Google Cloud
Bigtable, Cloud Datastore backups, and Avro formats when using
external tables. For more information see the
[BigQuery API documentation](https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#resource).
* `schema` - (Optional) A JSON schema for the table.
~>**NOTE**: Because this field expects a JSON string, any changes to the
string will create a diff, even if the JSON itself hasn't changed.
If the API returns a different value for the same schema, e.g. it
Expand Down Expand Up @@ -167,6 +163,18 @@ The `external_data_configuration` block supports:
* `max_bad_records` (Optional) - The maximum number of bad records that
BigQuery can ignore when reading data.

* `schema` - (Optional) A JSON schema for the external table. Schema is required
for CSV and JSON formats if autodetect is not on. Schema is disallowed
for Google Cloud Bigtable, Cloud Datastore backups, Avro, ORC and Parquet formats.
~>**NOTE**: Because this field expects a JSON string, any changes to the
string will create a diff, even if the JSON itself hasn't changed.
Furthermore drift for this field cannot not be detected because BigQuery
only uses this schema to compute the effective schema for the table, therefore
any changes on the configured value will force the table to be recreated.
This schema is effectively only applied when creating a table from an external
datasource, after creation the computed schema will be stored in
`google_bigquery_table.schema`

* `source_format` (Required) - The data format. Supported values are:
"CSV", "GOOGLE_SHEETS", "NEWLINE_DELIMITED_JSON", "AVRO", "PARQUET",
and "DATSTORE_BACKUP". To use "GOOGLE_SHEETS"
Expand Down