Skip to content

Commit

Permalink
docs: materialization sync schedule
Browse files Browse the repository at this point in the history
Adds and updates documentation for estuary/connectors#1696
  • Loading branch information
williamhbaker committed Jul 17, 2024
1 parent d229b0a commit f991039
Show file tree
Hide file tree
Showing 6 changed files with 208 additions and 70 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -64,8 +64,6 @@ For a complete introduction to resource organization in Bigquery, see the [BigQu
| **`/bucket`** | Bucket | Name of the GCS bucket. | String | Required |
| `/bucket_path` | Bucket path | Base path within the GCS bucket. Also called "Folder" in the GCS console. | String | |
| `/billing_project_id` | Billing project ID | The project ID to which these operations are billed in BigQuery. Typically, you want this to be the same as `project_id` (the default). | String | Same as `project_id` |
| `/advanced` | Advanced Options | Options for advanced users. You should not typically need to modify these. | object | |
| `/advanced/updateDelay` | Update Delay | Potentially reduce compute time by increasing the delay between updates. Defaults to 30 minutes if unset. | string | |

To learn more about project billing, [see the BigQuery docs](https://cloud.google.com/billing/docs/how-to/verify-billing-enabled).

Expand Down Expand Up @@ -98,15 +96,10 @@ materializations:
source: ${PREFIX}/${source_collection}
```
## Update Delay
## Sync Schedule
The `Update Delay` parameter in Estuary materializations offers a flexible approach to data ingestion scheduling. This advanced option allows users to control when the materialization or capture tasks pull in new data by specifying a delay period. By incorporating an update delay into your workflow, you can effectively manage and optimize your active warehouse time, leading to potentially lower costs and more efficient data processing.

An update delay is configured in the advanced settings of a materialization's configuration. It represents the amount of time the system will wait before it begins materializing the latest data. This delay is specified in hours and can be adjusted according to the needs of your data pipeline.

For example, if an update delay is set to 2 hours, the materialization task will pause for 2 hours before processing the latest available data. This delay ensures that data is not pulled in immediately after it becomes available, allowing for batching and other optimizations that can reduce warehouse load and processing time.

To configure an update delay, navigate the `Advanced Options` section of the materialization's configuration and select a value from the drop down. The default value for the update delay in Estuary materializations is set to 30 minutes.
This connector supports configuring a schedule for sync frequency. You can read
about how to configure this [here](../../materialization-sync-schedule.md).
## Delta updates
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -136,8 +136,6 @@ Use the below properties to configure a Snowflake materialization, which will di
| **`/credentials/user`** | User | Snowflake username | string | Required |
| `/credentials/password` | Password | Required if using user_password authentication | string | Required |
| `/credentials/privateKey` | Private Key | Required if using jwt authentication | string | Required |
| `/advanced` | Advanced Options | Options for advanced users. You should not typically need to modify these. | object | |
| `/advanced/updateDelay` | Update Delay | Potentially reduce active warehouse time by increasing the delay between updates. | string | |

#### Bindings

Expand Down Expand Up @@ -209,6 +207,30 @@ materializations:
source: ${PREFIX}/${source_collection}
```
## Sync Schedule
This connector supports configuring a schedule for sync frequency. You can read
about how to configure this [here](../../materialization-sync-schedule.md).
Snowflake compute is [priced](https://www.snowflake.com/pricing/) per second of
activity, with a minimum of 60 seconds. Inactive warehouses don't incur charges.
To keep costs down, you'll want to minimize your warehouse's active time.
To accomplish this, we recommend a two-pronged approach:
* [Configure your Snowflake warehouse to auto-suspend](https://docs.snowflake.com/en/sql-reference/sql/create-warehouse.html#:~:text=Specifies%20the%20number%20of%20seconds%20of%20inactivity%20after%20which%20a%20warehouse%20is%20automatically%20suspended.) after 60 seconds.
This ensures that after each transaction completes, you'll only be charged for one minute of compute, Snowflake's smallest granularity.
Use a query like the one shown below, being sure to substitute your warehouse name:
```sql
ALTER WAREHOUSE ESTUARY_WH SET auto_suspend = 60;
```

* Configure the materialization's **Sync Schedule** based on your requirements for data freshness.


## Delta updates

This connector supports both standard (merge) and [delta updates](../../../concepts/materialization.md#delta-updates).
Expand Down Expand Up @@ -245,47 +267,6 @@ This is because most materializations tend to be roughly chronological over time
This means that updates of keys `/date, /user_id` will need to physically read far fewer rows as compared to a key like `/user_id`,
because those rows will tend to live in the same micro-partitions, and Snowflake is able to cheaply prune micro-partitions that aren't relevant to the transaction.

### Reducing active warehouse time

Snowflake compute is [priced](https://www.snowflake.com/pricing/) per second of activity, with a minimum of 60 seconds.
Inactive warehouses don't incur charges.
To keep costs down, you'll want to minimize your warehouse's active time.

Like other Estuary connectors, this is a real-time connector that materializes documents using continuous [**transactions**](../../../concepts/advanced/shards.md#transactions).
Every time a Flow materialization commits a transaction, your warehouse becomes active.

If your source data collection or collections don't change much, this shouldn't cause an issue;
Flow only commits transactions when data has changed.
However, if your source data is frequently updated, your materialization may have frequent transactions that result in
excessive active time in the warehouse, and thus a higher bill from Snowflake.

To mitigate this, we recommend a two-pronged approach:

* [Configure your Snowflake warehouse to auto-suspend](https://docs.snowflake.com/en/sql-reference/sql/create-warehouse.html#:~:text=Specifies%20the%20number%20of%20seconds%20of%20inactivity%20after%20which%20a%20warehouse%20is%20automatically%20suspended.) after 60 seconds.

This ensures that after each transaction completes, you'll only be charged for one minute of compute, Snowflake's smallest granularity.

Use a query like the one shown below, being sure to substitute your warehouse name:

```sql
ALTER WAREHOUSE ESTUARY_WH SET auto_suspend = 60;
```

* Configure the materialization's **update delay** by setting a value in the advanced configuration.

For example, if you set the warehouse to auto-suspend after 60 seconds and set the materialization's
update delay to 30 minutes, you can incur as little as 48 minutes per day of active time in the warehouse.

### Update Delay

The `Update Delay` parameter in Estuary materializations offers a flexible approach to data ingestion scheduling. This advanced option allows users to control when the materialization or capture tasks pull in new data by specifying a delay period. By incorporating an update delay into your workflow, you can effectively manage and optimize your active warehouse time, leading to potentially lower costs and more efficient data processing.

An update delay is configured in the advanced settings of a materialization's configuration. It represents the amount of time the system will wait before it begins materializing the latest data. This delay is specified in hours and can be adjusted according to the needs of your data pipeline.

For example, if an update delay is set to 2 hours, the materialization task will pause for 2 hours before processing the latest available data. This delay ensures that data is not pulled in immediately after it becomes available, allowing for batching and other optimizations that can reduce warehouse load and processing time.

To configure an update delay, navigate the `Advanced Options` section of the materialization's configuration and select a value from the drop down. The default value for the update delay in Estuary materializations is set to 30 minutes.

### Snowpipe

[Snowpipe](https://docs.snowflake.com/en/user-guide/data-load-snowpipe-intro) allows for loading data into target tables without waking up the warehouse, which can be cheaper and more performant. Snowpipe can be used for delta updates bindings, and it requires configuring your authentication using a private key. Instructions for configuring key-pair authentication can be found in this page: [Key-pair Authentication & Snowpipe](#key-pair-authentication--snowpipe)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -49,8 +49,6 @@ more of your Flow collections to your desired tables in the database.
| **`/bucket`** | S3 Staging Bucket | Name of the S3 bucket to use for staging data loads. | string | Required |
| **`/region`** | Region | Region of the S3 staging bucket. For optimal performance this should be in the same region as the Redshift database cluster. | string | Required |
| `/bucketPath` | Bucket Path | A prefix that will be used to store objects in S3. | string | |
| `/advanced` | Advanced Options | Options for advanced users. You should not typically need to modify these. | object | |
| `/advanced/updateDelay` | Update Delay | Potentially reduce active cluster time by increasing the delay between updates. Defaults to 30 minutes if unset. | string | |

#### Bindings

Expand Down Expand Up @@ -83,6 +81,11 @@ materializations:
source: ${PREFIX}/${COLLECTION_NAME}
```
## Sync Schedule
This connector supports configuring a schedule for sync frequency. You can read
about how to configure this [here](../../materialization-sync-schedule.md).
## Setup
You must configure your cluster to allow connections from Estuary. This can be accomplished by
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ If you haven't yet captured your data from its external source, start at the beg

You need to first create a SQL Warehouse if you don't already have one in your account. See [Databricks documentation](https://docs.databricks.com/en/sql/admin/create-sql-warehouse.html) on configuring a Databricks SQL Warehouse. After creating a SQL Warehouse, you can find the details necessary for connecting to it under the **Connection Details** tab.

In order to save on costs, we recommend that you set the Auto Stop parameter for your SQL warehouse to the minimum available. Estuary's Databricks connector automatically delays updates to the destination up to a configured Update Delay (see the endpoint configuration below), with a default value of 30 minutes. If your SQL warehouse is configured to have an Auto Stop of more than 15 minutes, we disable the automatic delay since the delay is not as effective in saving costs with a long Auto Stop idle period.
In order to save on costs, we recommend that you set the Auto Stop parameter for your SQL warehouse to the minimum available. Estuary's Databricks connector automatically delays updates to the destination according to the configured **Sync Schedule** (see configuration details below), with a default delay value of 30 minutes.

You also need an access token for your user to be used by our connector, see the respective [documentation](https://docs.databricks.com/en/administration-guide/access-control/tokens.html) from Databricks on how to create an access token.

Expand All @@ -49,8 +49,6 @@ Use the below properties to configure a Databricks materialization, which will d
| **`/credentials`** | Credentials | Authentication credentials | object | |
| **`/credentials/auth_type`** | Role | Authentication type, set to `PAT` for personal access token | string | Required |
| **`/credentials/personal_access_token`** | Role | Personal Access Token | string | Required |
| /advanced | Advanced | Options for advanced users. You should not typically need to modify these. | object | |
| /advanced/updateDelay | Update Delay | Potentially reduce active warehouse time by increasing the delay between updates. Defaults to 30 minutes if unset. | string | 30m |

#### Bindings

Expand Down Expand Up @@ -86,6 +84,11 @@ materializations:
source: ${PREFIX}/${source_collection}
```
## Sync Schedule
This connector supports configuring a schedule for sync frequency. You can read
about how to configure this [here](../../materialization-sync-schedule.md).
## Delta updates
This connector supports both standard (merge) and [delta updates](../../../concepts/materialization.md#delta-updates).
Expand All @@ -107,16 +110,6 @@ You can enable delta updates on a per-binding basis:
source: ${PREFIX}/${source_collection}
```
## Update Delay
The `Update Delay` parameter in Estuary materializations offers a flexible approach to data ingestion scheduling. This advanced option allows users to control when the materialization or capture tasks pull in new data by specifying a delay period. By incorporating an update delay into your workflow, you can effectively manage and optimize your active warehouse time, leading to potentially lower costs and more efficient data processing.

An update delay is configured in the advanced settings of a materialization's configuration. It represents the amount of time the system will wait before it begins materializing the latest data. This delay is specified in hours and can be adjusted according to the needs of your data pipeline.

For example, if an update delay is set to 2 hours, the materialization task will pause for 2 hours before processing the latest available data. This delay ensures that data is not pulled in immediately after it becomes available, allowing for batching and other optimizations that can reduce warehouse load and processing time.

To configure an update delay, navigate the `Advanced Options` section of the materialization's configuration and select a value from the drop down. The default value for the update delay in Estuary materializations is set to 30 minutes.

## Reserved words
Databricks has a list of reserved words that must be quoted in order to be used as an identifier. Flow automatically quotes fields that are in the reserved words list. You can find this list in Databricks's documentation [here](https://docs.databricks.com/en/sql/language-manual/sql-ref-reserved-words.html) and in the table below.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -45,8 +45,6 @@ Use the below properties to configure a Starburst materialization, which will di
| **`/region`** | AWS Region | Region of AWS storage | string | Required |
| **`/bucket`** | Bucket name | | string | Required |
| **`/bucketPath`** | Bucket path | A prefix that will be used to store objects in S3. | string | Required |
| /advanced | Advanced | Options for advanced users. You should not typically need to modify these. | string | |
| /advanced/updateDelay | Update Delay | Potentially reduce active warehouse time by increasing the delay between updates. Defaults to 30 minutes if unset. | string | 30m |

#### Bindings

Expand Down Expand Up @@ -84,6 +82,11 @@ materializations:
source: ${PREFIX}/${source_collection}
```
## Sync Schedule
This connector supports configuring a schedule for sync frequency. You can read
about how to configure this [here](../../materialization-sync-schedule.md).
## Reserved words
Starburst Galaxy has a list of reserved words that must be quoted in order to be used as an identifier. Flow automatically quotes fields that are in the reserved words list. You can find this list in Trino's documentation [here](https://trino.io/docs/current/language/reserved.html) and in the table below.
Expand Down
Loading

0 comments on commit f991039

Please sign in to comment.