diff --git a/site/docs/reference/Connectors/materialization-connectors/BigQuery.md b/site/docs/reference/Connectors/materialization-connectors/BigQuery.md index e04eda052d..aabdc4f996 100644 --- a/site/docs/reference/Connectors/materialization-connectors/BigQuery.md +++ b/site/docs/reference/Connectors/materialization-connectors/BigQuery.md @@ -64,8 +64,6 @@ For a complete introduction to resource organization in Bigquery, see the [BigQu | **`/bucket`** | Bucket | Name of the GCS bucket. | String | Required | | `/bucket_path` | Bucket path | Base path within the GCS bucket. Also called "Folder" in the GCS console. | String | | | `/billing_project_id` | Billing project ID | The project ID to which these operations are billed in BigQuery. Typically, you want this to be the same as `project_id` (the default). | String | Same as `project_id` | -| `/advanced` | Advanced Options | Options for advanced users. You should not typically need to modify these. | object | | -| `/advanced/updateDelay` | Update Delay | Potentially reduce compute time by increasing the delay between updates. Defaults to 30 minutes if unset. | string | | To learn more about project billing, [see the BigQuery docs](https://cloud.google.com/billing/docs/how-to/verify-billing-enabled). @@ -98,15 +96,10 @@ materializations: source: ${PREFIX}/${source_collection} ``` -## Update Delay +## Sync Schedule -The `Update Delay` parameter in Estuary materializations offers a flexible approach to data ingestion scheduling. This advanced option allows users to control when the materialization or capture tasks pull in new data by specifying a delay period. By incorporating an update delay into your workflow, you can effectively manage and optimize your active warehouse time, leading to potentially lower costs and more efficient data processing. - -An update delay is configured in the advanced settings of a materialization's configuration. It represents the amount of time the system will wait before it begins materializing the latest data. This delay is specified in hours and can be adjusted according to the needs of your data pipeline. - -For example, if an update delay is set to 2 hours, the materialization task will pause for 2 hours before processing the latest available data. This delay ensures that data is not pulled in immediately after it becomes available, allowing for batching and other optimizations that can reduce warehouse load and processing time. - -To configure an update delay, navigate the `Advanced Options` section of the materialization's configuration and select a value from the drop down. The default value for the update delay in Estuary materializations is set to 30 minutes. +This connector supports configuring a schedule for sync frequency. You can read +about how to configure this [here](../../materialization-sync-schedule.md). ## Delta updates diff --git a/site/docs/reference/Connectors/materialization-connectors/Snowflake.md b/site/docs/reference/Connectors/materialization-connectors/Snowflake.md index 45829284cb..2d26c6a68b 100644 --- a/site/docs/reference/Connectors/materialization-connectors/Snowflake.md +++ b/site/docs/reference/Connectors/materialization-connectors/Snowflake.md @@ -136,8 +136,6 @@ Use the below properties to configure a Snowflake materialization, which will di | **`/credentials/user`** | User | Snowflake username | string | Required | | `/credentials/password` | Password | Required if using user_password authentication | string | Required | | `/credentials/privateKey` | Private Key | Required if using jwt authentication | string | Required | -| `/advanced` | Advanced Options | Options for advanced users. You should not typically need to modify these. | object | | -| `/advanced/updateDelay` | Update Delay | Potentially reduce active warehouse time by increasing the delay between updates. | string | | #### Bindings @@ -209,6 +207,30 @@ materializations: source: ${PREFIX}/${source_collection} ``` +## Sync Schedule + +This connector supports configuring a schedule for sync frequency. You can read +about how to configure this [here](../../materialization-sync-schedule.md). + +Snowflake compute is [priced](https://www.snowflake.com/pricing/) per second of +activity, with a minimum of 60 seconds. Inactive warehouses don't incur charges. +To keep costs down, you'll want to minimize your warehouse's active time. + +To accomplish this, we recommend a two-pronged approach: + +* [Configure your Snowflake warehouse to auto-suspend](https://docs.snowflake.com/en/sql-reference/sql/create-warehouse.html#:~:text=Specifies%20the%20number%20of%20seconds%20of%20inactivity%20after%20which%20a%20warehouse%20is%20automatically%20suspended.) after 60 seconds. + + This ensures that after each transaction completes, you'll only be charged for one minute of compute, Snowflake's smallest granularity. + + Use a query like the one shown below, being sure to substitute your warehouse name: + + ```sql + ALTER WAREHOUSE ESTUARY_WH SET auto_suspend = 60; + ``` + +* Configure the materialization's **Sync Schedule** based on your requirements for data freshness. + + ## Delta updates This connector supports both standard (merge) and [delta updates](../../../concepts/materialization.md#delta-updates). @@ -245,47 +267,6 @@ This is because most materializations tend to be roughly chronological over time This means that updates of keys `/date, /user_id` will need to physically read far fewer rows as compared to a key like `/user_id`, because those rows will tend to live in the same micro-partitions, and Snowflake is able to cheaply prune micro-partitions that aren't relevant to the transaction. -### Reducing active warehouse time - -Snowflake compute is [priced](https://www.snowflake.com/pricing/) per second of activity, with a minimum of 60 seconds. -Inactive warehouses don't incur charges. -To keep costs down, you'll want to minimize your warehouse's active time. - -Like other Estuary connectors, this is a real-time connector that materializes documents using continuous [**transactions**](../../../concepts/advanced/shards.md#transactions). -Every time a Flow materialization commits a transaction, your warehouse becomes active. - -If your source data collection or collections don't change much, this shouldn't cause an issue; -Flow only commits transactions when data has changed. -However, if your source data is frequently updated, your materialization may have frequent transactions that result in -excessive active time in the warehouse, and thus a higher bill from Snowflake. - -To mitigate this, we recommend a two-pronged approach: - -* [Configure your Snowflake warehouse to auto-suspend](https://docs.snowflake.com/en/sql-reference/sql/create-warehouse.html#:~:text=Specifies%20the%20number%20of%20seconds%20of%20inactivity%20after%20which%20a%20warehouse%20is%20automatically%20suspended.) after 60 seconds. - - This ensures that after each transaction completes, you'll only be charged for one minute of compute, Snowflake's smallest granularity. - - Use a query like the one shown below, being sure to substitute your warehouse name: - - ```sql - ALTER WAREHOUSE ESTUARY_WH SET auto_suspend = 60; - ``` - -* Configure the materialization's **update delay** by setting a value in the advanced configuration. - -For example, if you set the warehouse to auto-suspend after 60 seconds and set the materialization's -update delay to 30 minutes, you can incur as little as 48 minutes per day of active time in the warehouse. - -### Update Delay - -The `Update Delay` parameter in Estuary materializations offers a flexible approach to data ingestion scheduling. This advanced option allows users to control when the materialization or capture tasks pull in new data by specifying a delay period. By incorporating an update delay into your workflow, you can effectively manage and optimize your active warehouse time, leading to potentially lower costs and more efficient data processing. - -An update delay is configured in the advanced settings of a materialization's configuration. It represents the amount of time the system will wait before it begins materializing the latest data. This delay is specified in hours and can be adjusted according to the needs of your data pipeline. - -For example, if an update delay is set to 2 hours, the materialization task will pause for 2 hours before processing the latest available data. This delay ensures that data is not pulled in immediately after it becomes available, allowing for batching and other optimizations that can reduce warehouse load and processing time. - -To configure an update delay, navigate the `Advanced Options` section of the materialization's configuration and select a value from the drop down. The default value for the update delay in Estuary materializations is set to 30 minutes. - ### Snowpipe [Snowpipe](https://docs.snowflake.com/en/user-guide/data-load-snowpipe-intro) allows for loading data into target tables without waking up the warehouse, which can be cheaper and more performant. Snowpipe can be used for delta updates bindings, and it requires configuring your authentication using a private key. Instructions for configuring key-pair authentication can be found in this page: [Key-pair Authentication & Snowpipe](#key-pair-authentication--snowpipe) diff --git a/site/docs/reference/Connectors/materialization-connectors/amazon-redshift.md b/site/docs/reference/Connectors/materialization-connectors/amazon-redshift.md index 791265c90e..9911fb2612 100644 --- a/site/docs/reference/Connectors/materialization-connectors/amazon-redshift.md +++ b/site/docs/reference/Connectors/materialization-connectors/amazon-redshift.md @@ -49,8 +49,6 @@ more of your Flow collections to your desired tables in the database. | **`/bucket`** | S3 Staging Bucket | Name of the S3 bucket to use for staging data loads. | string | Required | | **`/region`** | Region | Region of the S3 staging bucket. For optimal performance this should be in the same region as the Redshift database cluster. | string | Required | | `/bucketPath` | Bucket Path | A prefix that will be used to store objects in S3. | string | | -| `/advanced` | Advanced Options | Options for advanced users. You should not typically need to modify these. | object | | -| `/advanced/updateDelay` | Update Delay | Potentially reduce active cluster time by increasing the delay between updates. Defaults to 30 minutes if unset. | string | | #### Bindings @@ -83,6 +81,11 @@ materializations: source: ${PREFIX}/${COLLECTION_NAME} ``` +## Sync Schedule + +This connector supports configuring a schedule for sync frequency. You can read +about how to configure this [here](../../materialization-sync-schedule.md). + ## Setup You must configure your cluster to allow connections from Estuary. This can be accomplished by diff --git a/site/docs/reference/Connectors/materialization-connectors/databricks.md b/site/docs/reference/Connectors/materialization-connectors/databricks.md index f9b99b82a5..ced733479b 100644 --- a/site/docs/reference/Connectors/materialization-connectors/databricks.md +++ b/site/docs/reference/Connectors/materialization-connectors/databricks.md @@ -27,7 +27,7 @@ If you haven't yet captured your data from its external source, start at the beg You need to first create a SQL Warehouse if you don't already have one in your account. See [Databricks documentation](https://docs.databricks.com/en/sql/admin/create-sql-warehouse.html) on configuring a Databricks SQL Warehouse. After creating a SQL Warehouse, you can find the details necessary for connecting to it under the **Connection Details** tab. -In order to save on costs, we recommend that you set the Auto Stop parameter for your SQL warehouse to the minimum available. Estuary's Databricks connector automatically delays updates to the destination up to a configured Update Delay (see the endpoint configuration below), with a default value of 30 minutes. If your SQL warehouse is configured to have an Auto Stop of more than 15 minutes, we disable the automatic delay since the delay is not as effective in saving costs with a long Auto Stop idle period. +In order to save on costs, we recommend that you set the Auto Stop parameter for your SQL warehouse to the minimum available. Estuary's Databricks connector automatically delays updates to the destination according to the configured **Sync Schedule** (see configuration details below), with a default delay value of 30 minutes. You also need an access token for your user to be used by our connector, see the respective [documentation](https://docs.databricks.com/en/administration-guide/access-control/tokens.html) from Databricks on how to create an access token. @@ -49,8 +49,6 @@ Use the below properties to configure a Databricks materialization, which will d | **`/credentials`** | Credentials | Authentication credentials | object | | | **`/credentials/auth_type`** | Role | Authentication type, set to `PAT` for personal access token | string | Required | | **`/credentials/personal_access_token`** | Role | Personal Access Token | string | Required | -| /advanced | Advanced | Options for advanced users. You should not typically need to modify these. | object | | -| /advanced/updateDelay | Update Delay | Potentially reduce active warehouse time by increasing the delay between updates. Defaults to 30 minutes if unset. | string | 30m | #### Bindings @@ -86,6 +84,11 @@ materializations: source: ${PREFIX}/${source_collection} ``` +## Sync Schedule + +This connector supports configuring a schedule for sync frequency. You can read +about how to configure this [here](../../materialization-sync-schedule.md). + ## Delta updates This connector supports both standard (merge) and [delta updates](../../../concepts/materialization.md#delta-updates). @@ -107,16 +110,6 @@ You can enable delta updates on a per-binding basis: source: ${PREFIX}/${source_collection} ``` -## Update Delay - -The `Update Delay` parameter in Estuary materializations offers a flexible approach to data ingestion scheduling. This advanced option allows users to control when the materialization or capture tasks pull in new data by specifying a delay period. By incorporating an update delay into your workflow, you can effectively manage and optimize your active warehouse time, leading to potentially lower costs and more efficient data processing. - -An update delay is configured in the advanced settings of a materialization's configuration. It represents the amount of time the system will wait before it begins materializing the latest data. This delay is specified in hours and can be adjusted according to the needs of your data pipeline. - -For example, if an update delay is set to 2 hours, the materialization task will pause for 2 hours before processing the latest available data. This delay ensures that data is not pulled in immediately after it becomes available, allowing for batching and other optimizations that can reduce warehouse load and processing time. - -To configure an update delay, navigate the `Advanced Options` section of the materialization's configuration and select a value from the drop down. The default value for the update delay in Estuary materializations is set to 30 minutes. - ## Reserved words Databricks has a list of reserved words that must be quoted in order to be used as an identifier. Flow automatically quotes fields that are in the reserved words list. You can find this list in Databricks's documentation [here](https://docs.databricks.com/en/sql/language-manual/sql-ref-reserved-words.html) and in the table below. diff --git a/site/docs/reference/Connectors/materialization-connectors/starburst.md b/site/docs/reference/Connectors/materialization-connectors/starburst.md index eda3de6b9f..eee6d4f7a8 100644 --- a/site/docs/reference/Connectors/materialization-connectors/starburst.md +++ b/site/docs/reference/Connectors/materialization-connectors/starburst.md @@ -45,8 +45,6 @@ Use the below properties to configure a Starburst materialization, which will di | **`/region`** | AWS Region | Region of AWS storage | string | Required | | **`/bucket`** | Bucket name | | string | Required | | **`/bucketPath`** | Bucket path | A prefix that will be used to store objects in S3. | string | Required | -| /advanced | Advanced | Options for advanced users. You should not typically need to modify these. | string | | -| /advanced/updateDelay | Update Delay | Potentially reduce active warehouse time by increasing the delay between updates. Defaults to 30 minutes if unset. | string | 30m | #### Bindings @@ -84,6 +82,11 @@ materializations: source: ${PREFIX}/${source_collection} ``` +## Sync Schedule + +This connector supports configuring a schedule for sync frequency. You can read +about how to configure this [here](../../materialization-sync-schedule.md). + ## Reserved words Starburst Galaxy has a list of reserved words that must be quoted in order to be used as an identifier. Flow automatically quotes fields that are in the reserved words list. You can find this list in Trino's documentation [here](https://trino.io/docs/current/language/reserved.html) and in the table below. diff --git a/site/docs/reference/materialization-sync-schedule.md b/site/docs/reference/materialization-sync-schedule.md new file mode 100644 index 0000000000..809a54fd33 --- /dev/null +++ b/site/docs/reference/materialization-sync-schedule.md @@ -0,0 +1,165 @@ +# Materialization sync schedule + +For some systems you might prefer to have data sync'd less frequently to reduce +compute costs in the destination if some delay in new data is acceptable. For +example, if the destination system has a minimum compute charge per-query, you +could reduce your compute charges by running a single large query every 30 +minutes rather than many smaller queries every few seconds. + +:::note +Syncing data less frequently to your destination system does _not_ effect the +cost for running the materialization connector within Estuary Flow. But it can +reduce the costs incurred in the destination from the actions the connector +takes to load data to it. +::: + +These materialization connectors support configuring a sync schedule: +- [materialize-bigquery](Connectors/materialization-connectors/BigQuery.md) +- [materialize-databricks](Connectors/materialization-connectors/databricks.md) +- [materialize-redshift](Connectors/materialization-connectors/amazon-redshift.md) +- [materialize-snowflake](Connectors/materialization-connectors/Snowflake.md) +- [materialize-starburst](Connectors/materialization-connectors/starburst.md) + +## How transactions are used to sync data to a destination + +Estuary Flow processes data in +[transactions](../concepts/advanced/shards.md#transactions). Materialization +connectors use the [materialization +protocol](Connectors/materialization-protocol.md) to process transactions and +sync data to the destination. + +When a materialization is caught up to its source collections, it runs frequent +small transactions to keep the destination up to date. In this case every new +transaction contains the latest data that needs updated. But when a +materialization is backfilling its source collections, it runs larger +transactions to efficiently load the data in bulk to the destination and catch +up to the latest changes. + +The sync schedule is configured in terms of these **transactions**: For less +frequent updates, processing of additional transactions is delayed by some +amount of time. This extra delay is only applied when the materialization is +fully caught up - backfills always run as fast as possible. And while a +transaction is delayed, Estuary Flow will continue batching and combining new +documents so that the next transaction contains all of the latest data. + +You can read about [how continuous materialization +works](../concepts/materialization.md#how-continuous-materialization-works) for +more background information. + +## Configuring a sync schedule + +A materialization can be configured to run on a fixed schedule 24/7 or it can +have a faster sync schedule during certain times of the day and on certain days +of the week. The following options are available for configuring the sync +schedule: + +| Property | Title | Description | Type | +|------------------------|------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------| +| `/syncFrequency` | Sync Frequency | Frequency at which transactions are executed when the materialization is fully caught up and streaming changes. May be enabled only for certain time periods and days of the week if configured below; otherwise it is effective 24/7. Defaults to 30 minutes if unset. | string | +| `/timezone` | Timezone | Timezone applicable to sync time windows and active days. Must be a valid IANA time zone name or +HH:MM offset. | string | +| `/fastSyncStartTime` | Fast Sync Start Time | Time of day that transactions begin executing at the configured Sync Frequency. Prior to this time transactions will be executed more slowly. Must be in the form of '09:00'. | string | +| `/fastSyncStopTime` | Fast Sync Stop Time | Time of day that transactions stop executing at the configured Sync Frequency. After this time transactions will be executed more slowly. Must be in the form of '17:00'. | string | +| `/fastSyncEnabledDays` | Fast Sync Enabled Days | Days of the week that the configured Sync Frequency is active. On days that are not enabled, transactions will be executed more slowly for the entire day. Examples: 'M-F' (Monday through Friday, inclusive), 'M,W,F' (Monday, Wednesday, and Friday), 'Su-T,Th-S' (Sunday through Tuesday, inclusive; Thursday through Saturday, inclusive). All days are enabled if unset. | string | + +:::warning +Changes to a [materialization's +specification](../concepts/materialization.md#specification) are only applied +after the materialization task has completed and acknowledged all of its +outstanding transactions. This means that if a task is running with a 4 hour +sync frequency, it may take up to 8 hours for a change to the specification to +take effect: 4 hours for the "current" transaction to complete and be +acknowledged, and another 4 hours for the next "pipelined" commit to complete +and be acknowledged. + +If you are making changes to a materialization with a **Sync Schedule** +configured and would like those changes to take effect immediately, you can +disable and then re-enable the materialization. +::: + +#### Example: Sync data on a fixed schedule + +To use the same schedule for syncing data 24/7, set the value of **Sync +Frequency** only and leave the other inputs empty. For example, you might set a +**Sync Frequency** of `15m` to always have you destination sync every 15 minutes +instead of the default 30 minutes. + +:::tip +If you want the materialization to always push updated data as fast as possible, +use a **Sync Frequency** of `0s`. +::: + +#### Example: Sync data faster during certain times of the day + +If you only care about having the most-up-to-date data possible during certain +times of the day, you can set a start and stop time for that time period. The +value you set for **Sync Frequency** will be used during that time period; +otherwise syncs will be performed every 4 hours. + +The **Fast Sync Start Time** and **Fast Sync Stop Time** values must be set as +24-hour times, and you must provide a value for **Timezone** that this time +window should use. Timezones must either be [a valid IANA time zone +name](https://en.wikipedia.org/wiki/List_of_tz_database_time_zones) or a +HH:MM +offset. Providing a time zone name will ensure local factors like daylight +savings time are considered for the schedule, whereas an offset timezone is +always relative to UTC. + +An example configuration data syncs data as fast as possible between the hours +of 9:00AM and 5:00PM in the Eastern Time Zone (ET) would use these values: +- **Sync Frequency**: `0s` +- **Timezone**: `America/New_York` +- **Fast Sync Start Time**: `09:00` +- **Fast Sync Stop Time**: `17:00` + +#### Example: Sync data faster only on certain days of the week + +You can also set certain days of the week that the fast sync is active. On all +other days, data will be sync'd more slowly all day. + +To enable this, set values for **Sync Frequency**, **Timezone**, **Fast Sync +Start Time**, and **Fast Sync Stop Time** as you would for syncing data faster +during certain times of the day, and also provide a value for **Fast Sync +Enabled Days**. + +**Fast Sync Enabled Days** is a range of days, where the days of the week are +abbreviated as `(Su)nday`, `(M)onday`, `(T)uesday`, `(W)ednesday`, `(Th)ursday`, +`(F)riday`, `(S)aturday`. + +Here are some examples of valid inputs for **Fast Sync Enabled Days**: +- `M-F` to enable fast sync on Monday through Friday. +- `Su, T, Th, S` to enable fast sync on Sunday, Tuesday, Thursday, and Saturday. +- `Su-M,Th-S` to enable fast sync on Thursday through Monday. Note that the days + of the week must be listed in order, so `Th-M` will not work. + +## Timing of syncs + +In technical terms, timing of syncs is controlled by the materialization +connector sending a transaction acknowledgement to the Flow runtime a computed +times. Practically this means that at these times the prior transaction will +complete and have its statistics recorded, and the next transaction will begin. + +This timing is computed so that it occurs at predictable instants in time. As a +hypothetical example, if you have set a **Sync Frequency** of `15m`, transaction +acknowledgements might be sent at times like `00:00`, `00:15`, `00:30`, `00:45` +and so on, where each acknowledgement is sent at a multiple of the **Sync +Frequency** relative to the hour. This means that if the materialization [task +shard](../concepts/advanced/shards.md) restarts and completes its first +transaction at `00:13`, it will run its next transaction at `00:15` rather than +`00:28`. + +In actuality these computed points in time have some amount of +[jitter](https://en.wikipedia.org/wiki/Jitter) applied to them to avoid +overwhelming the system at common intervals, so setting a **Sync Frequency** to +a specific value will ensure that transactions are predictably acknowledged that +often, but makes no assumptions about precisely what time instants the +acknowledgements will occur. + +:::info +The `jitter` value is deterministic based on the *compute resource* for the +destination system from the materialization's endpoint configuration. How this +compute resource is identified various for different systems, but is usually +something like `"account_name" + "warehouse_Name"`. + +This means that separate materialization use the same compute resource will +synchronize their usage of that compute resource if they have the same **Sync +Schedule** configured. +:::