Skip to content

Commit

Permalink
feat(ingest): dbt cloud integration (#6323)
Browse files Browse the repository at this point in the history
  • Loading branch information
hsheth2 authored Nov 21, 2022
1 parent 9c1577d commit 05a0f3e
Show file tree
Hide file tree
Showing 12 changed files with 1,186 additions and 645 deletions.
58 changes: 42 additions & 16 deletions docs/how/updating-datahub.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,24 @@
This file documents any backwards-incompatible changes in DataHub and assists people when migrating to a new version.

## Next

### Breaking Changes

### Potential Downtime

### Deprecations

### Other notable Changes

## 0.9.2

- LookML source will only emit views that are reachable from explores while scanning your git repo. Previous behavior can be achieved by setting `emit_reachable_views_only` to False.
- LookML source will always lowercase urns for lineage edges from views to upstream tables. There is no fallback provided to previous behavior because it was inconsistent in application of lower-casing earlier.
- dbt config `node_type_pattern` which was previously deprecated has been removed. Use `entities_enabled` instead to control whether to emit metadata for sources, models, seeds, tests, etc.
- The dbt source will always lowercase urns for lineage edges to the underlying data platform.
- The DataHub Airflow lineage backend and plugin no longer support Airflow 1.x. You can still run DataHub ingestion in Airflow 1.x using the [PythonVirtualenvOperator](https://airflow.apache.org/docs/apache-airflow/1.10.15/_api/airflow/operators/python_operator/index.html?highlight=pythonvirtualenvoperator#airflow.operators.python_operator.PythonVirtualenvOperator).

### Breaking Changes
- Java version 11 or greater is required.
- For any of the GraphQL search queries, the input no longer supports value but instead now accepts a list of values. These values represent an OR relationship where the field value must match any of the values.

### Potential Downtime

Expand All @@ -21,7 +31,9 @@ This file documents any backwards-incompatible changes in DataHub and assists pe
## 0.9.0

### Breaking Changes

- Java version 11 or greater is required.
- For any of the GraphQL search queries, the input no longer supports value but instead now accepts a list of values. These values represent an OR relationship where the field value must match any of the values.

### Potential Downtime

Expand All @@ -32,13 +44,14 @@ This file documents any backwards-incompatible changes in DataHub and assists pe
## `v0.8.45`

### Breaking Changes
- The `getNativeUserInviteToken` and `createNativeUserInviteToken` GraphQL endpoints have been renamed to
`getInviteToken` and `createInviteToken` respectively. Additionally, both now accept an optional `roleUrn` parameter.

- The `getNativeUserInviteToken` and `createNativeUserInviteToken` GraphQL endpoints have been renamed to
`getInviteToken` and `createInviteToken` respectively. Additionally, both now accept an optional `roleUrn` parameter.
Both endpoints also now require the `MANAGE_POLICIES` privilege to execute, rather than `MANAGE_USER_CREDENTIALS`
privilege.
- One of the default policies shipped with DataHub (`urn:li:dataHubPolicy:7`, or `All Users - All Platform Privileges`)
has been edited to no longer include `MANAGE_POLICIES`. Its name has consequently been changed to
`All Users - All Platform Privileges (EXCEPT MANAGE POLICIES)`. This change was made to prevent all users from
`All Users - All Platform Privileges (EXCEPT MANAGE POLICIES)`. This change was made to prevent all users from
effectively acting as superusers by default.

### Potential Downtime
Expand All @@ -58,9 +71,9 @@ This file documents any backwards-incompatible changes in DataHub and assists pe

### Potential Downtime

- [Helm] If you're using Helm, please ensure that your version of the `datahub-actions` container is bumped to `v0.0.7` or `head`.
This version contains changes to support running ingestion in debug mode. Previous versions are not compatible with this release.
Upgrading to helm chart version `0.2.103` will ensure that you have the compatible versions by default.
- [Helm] If you're using Helm, please ensure that your version of the `datahub-actions` container is bumped to `v0.0.7` or `head`.
This version contains changes to support running ingestion in debug mode. Previous versions are not compatible with this release.
Upgrading to helm chart version `0.2.103` will ensure that you have the compatible versions by default.

### Deprecations

Expand All @@ -69,10 +82,11 @@ Upgrading to helm chart version `0.2.103` will ensure that you have the compatib
## `v0.8.42`

### Breaking Changes

- Python 3.6 is no longer supported for metadata ingestion
- #5451 `GMS_HOST` and `GMS_PORT` environment variables deprecated in `v0.8.39` have been removed. Use `DATAHUB_GMS_HOST` and `DATAHUB_GMS_PORT` instead.
- #5478 DataHub CLI `delete` command when used with `--hard` option will delete soft-deleted entities which match the other filters given.
- #5471 Looker now populates `userEmail` in dashboard user usage stats. This version of looker connnector will not work with older version of **datahub-gms** if you have `extract_usage_history` looker config enabled.
- #5471 Looker now populates `userEmail` in dashboard user usage stats. This version of looker connnector will not work with older version of **datahub-gms** if you have `extract_usage_history` looker config enabled.
- #5529 - `ANALYTICS_ENABLED` environment variable in **datahub-gms** is now deprecated. Use `DATAHUB_ANALYTICS_ENABLED` instead.

### Potential Downtime
Expand All @@ -84,14 +98,16 @@ Upgrading to helm chart version `0.2.103` will ensure that you have the compatib
## `v0.8.41`

### Breaking Changes

- The `should_overwrite` flag in `csv-enricher` has been replaced with `write_semantics` to match the format used for other sources. See the [documentation](https://datahubproject.io/docs/generated/ingestion/sources/csv/) for more details
- Closing an authorization hole in creating tags adding a Platform Privilege called `Create Tags` for creating tags. This is assigned to `datahub` root user, along
with default All Users policy. Notice: You may need to add this privilege (or `Manage Tags`) to existing users that need the ability to create tags on the platform.
- Closing an authorization hole in creating tags adding a Platform Privilege called `Create Tags` for creating tags. This is assigned to `datahub` root user, along
with default All Users policy. Notice: You may need to add this privilege (or `Manage Tags`) to existing users that need the ability to create tags on the platform.
- #5329 Below profiling config parameters are now supported in `BigQuery`:

- profiling.profile_if_updated_since_days (default=1)
- profiling.profile_table_size_limit (default=1GB)
- profiling.profile_table_row_limit (default=50000)

Set above parameters to `null` if you want older behaviour.

### Potential Downtime
Expand All @@ -103,6 +119,7 @@ with default All Users policy. Notice: You may need to add this privilege (or `M
## `v0.8.40`

### Breaking Changes

- #5240 `lineage_client_project_id` in `bigquery` source is removed. Use `storage_project_id` instead.

### Potential Downtime
Expand All @@ -114,19 +131,21 @@ with default All Users policy. Notice: You may need to add this privilege (or `M
## `v0.8.39`

### Breaking Changes

- Refactored the `health` field of the `Dataset` GraphQL Type to be of type **list of HealthStatus** (was type **HealthStatus**). See [this PR](https://github.com/datahub-project/datahub/pull/5222/files) for more details.

### Potential Downtime

### Deprecations

- #4875 Lookml view file contents will no longer be populated in custom_properties, instead view definitions will be always available in the View Definitions tab.
- #5208 `GMS_HOST` and `GMS_PORT` environment variables being set in various containers are deprecated in favour of `DATAHUB_GMS_HOST` and `DATAHUB_GMS_PORT`.
- `KAFKA_TOPIC_NAME` environment variable in **datahub-mae-consumer** and **datahub-gms** is now deprecated. Use `METADATA_AUDIT_EVENT_NAME` instead.
- `KAFKA_MCE_TOPIC_NAME` environment variable in **datahub-mce-consumer** and **datahub-gms** is now deprecated. Use `METADATA_CHANGE_EVENT_NAME` instead.
- `KAFKA_FMCE_TOPIC_NAME` environment variable in **datahub-mce-consumer** and **datahub-gms** is now deprecated. Use `FAILED_METADATA_CHANGE_EVENT_NAME` instead.


### Other notable Changes

- #5132 Profile tables in `snowflake` source only if they have been updated since configured (default: `1`) number of day(s). Update the config `profiling.profile_if_updated_since_days` as per your profiling schedule or set it to `None` if you want older behaviour.

## `v0.8.38`
Expand All @@ -138,21 +157,24 @@ with default All Users policy. Notice: You may need to add this privilege (or `M
### Deprecations

### Other notable Changes

- Create & Revoke Access Tokens via the UI
- Create and Manage new users via the UI
- Create and Manage new users via the UI
- Improvements to Business Glossary UI
- FIX - Do not require reindexing to migrate to using the UI business glossary
- FIX - Do not require reindexing to migrate to using the UI business glossary

## `v0.8.36`

### Breaking Changes

- In this release we introduce a brand new Business Glossary experience. With this new experience comes some new ways of indexing data in order to make viewing and traversing the different levels of your Glossary possible. Therefore, you will have to [restore your indices](https://datahubproject.io/docs/how/restore-indices/) in order for the new Glossary experience to work for users that already have existing Glossaries. If this is your first time using DataHub Glossaries, you're all set!

### Potential Downtime

### Deprecations

### Other notable Changes

- #4961 Dropped profiling is not reported by default as that caused a lot of spurious logging in some cases. Set `profiling.report_dropped_profiles` to `True` if you want older behaviour.

## `v0.8.35`
Expand All @@ -162,20 +184,24 @@ with default All Users policy. Notice: You may need to add this privilege (or `M
### Potential Downtime

### Deprecations
- #4875 Lookml view file contents will no longer be populated in custom_properties, instead view definitions will be always available in the View Definitions tab.

- #4875 Lookml view file contents will no longer be populated in custom_properties, instead view definitions will be always available in the View Definitions tab.

### Other notable Changes

## `v0.8.34`

### Breaking Changes

- #4644 Remove `database` option from `snowflake` source which was deprecated since `v0.8.5`
- #4595 Rename confusing config `report_upstream_lineage` to `upstream_lineage_in_report` in `snowflake` connector which was added in `0.8.32`

### Potential Downtime

### Deprecations

- #4644 `host_port` option of `snowflake` and `snowflake-usage` sources deprecated as the name was confusing. Use `account_id` option instead.

### Other notable Changes

- #4760 `check_role_grants` option was added in `snowflake` to disable checking roles in `snowflake` as some people were reporting long run times when checking roles.
20 changes: 20 additions & 0 deletions metadata-ingestion/docs/sources/dbt/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
Ingesting metadata from dbt requires either using the **dbt** module or the **dbt-cloud** module.

### Concept Mapping

| Source Concept | DataHub Concept | Notes |
| ------------------------ | ------------------------------------------------------------- | --------------------- |
| `"dbt"` | [Data Platform](../../metamodel/entities/dataPlatform.md) | |
| dbt Source | [Dataset](../../metamodel/entities/dataset.md) | Subtype `source` |
| dbt Seed | [Dataset](../../metamodel/entities/dataset.md) | Subtype `seed` |
| dbt Model - materialized | [Dataset](../../metamodel/entities/dataset.md) | Subtype `table` |
| dbt Model - view | [Dataset](../../metamodel/entities/dataset.md) | Subtype `view` |
| dbt Model - incremental | [Dataset](../../metamodel/entities/dataset.md) | Subtype `incremental` |
| dbt Model - ephemeral | [Dataset](../../metamodel/entities/dataset.md) | Subtype `ephemeral` |
| dbt Test | [Assertion](../../metamodel/entities/assertion.md) | |
| dbt Test Result | [Assertion Run Result](../../metamodel/entities/assertion.md) | |

Note:

1. It also generates lineage between the `dbt` nodes (e.g. ephemeral nodes that depend on other dbt sources) as well as lineage between the `dbt` nodes and the underlying (target) platform nodes (e.g. BigQuery Table -> dbt Source, dbt View -> BigQuery View).
2. We also support automated actions (like add a tag, term or owner) based on properties defined in dbt meta.
19 changes: 19 additions & 0 deletions metadata-ingestion/docs/sources/dbt/dbt-cloud_recipe.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
source:
type: "dbt-cloud"
config:
token: ${DBT_CLOUD_TOKEN}

# In the URL https://cloud.getdbt.com/next/deploy/107298/projects/175705/jobs/148094,
# 107298 is the account_id, 175705 is the project_id, and 148094 is the job_id

account_id: # set to your dbt cloud account id
project_id: # set to your dbt cloud project id
job_id: # set to your dbt cloud job id
run_id: # set to your dbt cloud run id. This is optional, and defaults to the latest run

target_platform: postgres

# Options
target_platform: "my_target_platform_id" # e.g. bigquery/postgres/etc.

# sink configs
4 changes: 3 additions & 1 deletion metadata-ingestion/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -259,6 +259,7 @@ def get_long_description():
"datahub-business-glossary": set(),
"delta-lake": {*data_lake_profiling, *delta_lake},
"dbt": {"requests"} | aws_common,
"dbt-cloud": {"requests"},
"druid": sql_common | {"pydruid>=0.6.2"},
# Starting with 7.14.0 python client is checking if it is connected to elasticsearch client. If its not it throws
# UnsupportedProductError
Expand Down Expand Up @@ -493,7 +494,8 @@ def get_long_description():
"clickhouse-usage = datahub.ingestion.source.usage.clickhouse_usage:ClickHouseUsageSource",
"delta-lake = datahub.ingestion.source.delta_lake:DeltaLakeSource",
"s3 = datahub.ingestion.source.s3:S3Source",
"dbt = datahub.ingestion.source.dbt:DBTSource",
"dbt = datahub.ingestion.source.dbt.dbt_core:DBTCoreSource",
"dbt-cloud = datahub.ingestion.source.dbt.dbt_cloud:DBTCloudSource",
"druid = datahub.ingestion.source.sql.druid:DruidSource",
"elasticsearch = datahub.ingestion.source.elastic_search:ElasticsearchSource",
"feast-legacy = datahub.ingestion.source.feast_legacy:FeastSource",
Expand Down
6 changes: 4 additions & 2 deletions metadata-ingestion/src/datahub/entrypoints.py
Original file line number Diff line number Diff line change
Expand Up @@ -216,11 +216,13 @@ def main(**kwargs):


def _get_pretty_chained_message(exc: Exception) -> str:
pretty_msg = f"{exc}"
pretty_msg = f"{exc.__class__.__name__} {exc}"
tmp_exc = exc.__cause__
indent = "\n\t\t"
while tmp_exc:
pretty_msg = f"{pretty_msg} due to {indent}'{tmp_exc}'"
pretty_msg = (
f"{pretty_msg} due to {indent}{tmp_exc.__class__.__name__}{tmp_exc}"
)
tmp_exc = tmp_exc.__cause__
indent += "\t"
return pretty_msg
Empty file.
Loading

0 comments on commit 05a0f3e

Please sign in to comment.