Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🎉 Destination Bigquery: added gcs upload option #5614

Merged
Merged
Show file tree
Hide file tree
Changes from 13 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 16 additions & 1 deletion airbyte-integrations/connectors/destination-bigquery/README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,13 @@
## Uploading options
There are 2 available options to upload data to bigquery `Standard` and `GCS Staging`.
- `Standard` is option to upload data directly from your source to BigQuery storage. This way is faster and requires less resources than GCS one.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to explain further about when to choose which option. Its not clear to me when should I choose Standard vs GCS Uploading (CSV format)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks

Please be aware you may see some fails for big datasets and slow sources, i.e. if reading from source takes more than 10-12 hours.
This is caused by the Google BigQuery SDK client limitations. For more details please check https://github.com/airbytehq/airbyte/issues/3549
- `GCS Uploading (CSV format)`. This approach has been implemented in order to avoid the issue for big datasets mentioned above.
At the first step all data is uploaded to GCS bucket and then all moved to BigQuery at one shot stream by stream.
The destination-gcs connector is partially used under the hood here, so you may check its documentation for more details.


# BigQuery Test Configuration

In order to test the BigQuery destination, you need a service account key file.
Expand All @@ -10,9 +20,14 @@ As a community contributor, you will need access to a GCP project and BigQuery t
1. Click on `+ Create Service Account" button
1. Fill out a descriptive name/id/description
1. Click the edit icon next to the service account you created on the `IAM` page
1. Add the `BigQuery Data Editor` and `BigQuery User` role
1. Add the `BigQuery Data Editor`, `BigQuery User` and `GCS User` roles. For more details check https://cloud.google.com/storage/docs/access-control/iam-roles
1. Go back to the `Service Accounts` page and use the actions modal to `Create Key`
1. Download this key as a JSON file
1. Create an GCS bucket for testing.
1. Generate a [HMAC key](https://cloud.google.com/storage/docs/authentication/hmackeys) for the bucket with reading and writing permissions. Please note that currently only the HMAC key credential is supported. More credential types will be added in the future.
1. Paste the bucket and key information into the config files under [`./sample_secrets`](./sample_secrets).
1. Rename the directory from `sample_secrets` to `secrets`.
1. Feel free to modify the config files with different settings in the acceptance test file as long as they follow the schema defined in [spec.json](src/main/resources/spec.json).
1. Move and rename this file to `secrets/credentials.json`

## Airbyte Employee
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@ dependencies {
implementation project(':airbyte-config:models')
implementation project(':airbyte-integrations:bases:base-java')
implementation project(':airbyte-protocol:models')
implementation project(':airbyte-integrations:connectors:destination-s3')
implementation project(':airbyte-integrations:connectors:destination-gcs')

integrationTestJavaImplementation project(':airbyte-integrations:bases:standard-destination-test')
integrationTestJavaImplementation files(project(':airbyte-integrations:bases:base-normalization').airbyteDocker.outputs)
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
{
"basic_bigquery_config": {
"type": "service_account",
"project_id": "",
"private_key_id": "",
"private_key": "",
"client_email": "",
"client_id": "",
"auth_uri": "",
"token_uri": "",
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"client_x509_cert_url": ""
},
"gcs_config": {
"gcs_bucket_name": "",
"gcs_bucket_path": "test_path",
"gcs_bucket_region": "us-west1",
"credential": {
"credential_type": "HMAC_KEY",
"hmac_key_access_id": "",
"hmac_key_secret": ""
}
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
/*
* MIT License
*
* Copyright (c) 2020 Airbyte
*
* Permission is hereby granted, free of charge, to any person obtaining a copy
* of this software and associated documentation files (the "Software"), to deal
* in the Software without restriction, including without limitation the rights
* to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
* copies of the Software, and to permit persons to whom the Software is
* furnished to do so, subject to the following conditions:
*
* The above copyright notice and this permission notice shall be included in all
* copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
* AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
* OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
* SOFTWARE.
*/

package io.airbyte.integrations.destination.bigquery;

public class BigQueryConsts {

public static final int MiB = 1024 * 1024;
public static final String CONFIG_DATASET_ID = "dataset_id";
public static final String CONFIG_PROJECT_ID = "project_id";
public static final String CONFIG_DATASET_LOCATION = "dataset_location";
public static final String CONFIG_CREDS = "credentials_json";
public static final String BIG_QUERY_CLIENT_CHUNK_SIZE = "big_query_client_buffer_size_mb";

public static final String LOADING_METHOD = "loading_method";
public static final String METHOD = "method";
public static final String GCS_STAGING = "GCS Staging";
public static final String GCS_BUCKET_NAME = "gcs_bucket_name";
public static final String GCS_BUCKET_PATH = "gcs_bucket_path";
public static final String GCS_BUCKET_REGION = "gcs_bucket_region";
public static final String CREDENTIAL = "credential";
public static final String FORMAT = "format";
public static final String KEEP_GCS_FILES = "keep_files_in_gcs-bucket";
public static final String KEEP_GCS_FILES_VAL = "Keep all tmp files in GCS";

// tests
public static final String BIGQUERY_BASIC_CONFIG = "basic_bigquery_config";
public static final String GCS_CONFIG = "gcs_config";

public static final String CREDENTIAL_TYPE = "credential_type";
public static final String HMAC_KEY_ACCESS_ID = "hmac_key_access_id";
public static final String HMAC_KEY_ACCESS_SECRET = "hmac_key_secret";

}
Loading