Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingest): add snowflake-queries source #10835

Merged
merged 35 commits into from
Jul 12, 2024
Merged

Conversation

hsheth2
Copy link
Collaborator

@hsheth2 hsheth2 commented Jul 3, 2024

  • Adds the snowflake-queries source, which does lineage, usage, queries, and operations all in one go.
    • More performant than existing lineage/usage implementations.
    • Respects most configs that the existing implementations support.
    • Automatically follows lineage through temp tables and generates composite query urns.
  • Add a new QueryUsageStatistics aspect.
  • Refactor: wrap all usages of snowflake.connector.SnowflakeConnection with a new SnowflakeConnection type.
  • Refactor: use the SnowflakeConnection around using composition, removing some mixin classes like SnowflakeQueryMixin and SnowflakeConnectionMixin. self.query(...) is now self.connection.query(...). As part of this, the connection is initialized in the constructor instead of in the get_workunits_internal method.
  • Refactor: introduce a SnowsightUrlBuilder class.
  • Refactor: Move towards removing SnowflakeCommonMixin by introducing the SnowflakeFilterMixin and SnowflakeIdentifierMixin instead. I'm not fully convinced this is the best design - something with composition like SnowsightUrlBuilder might actually be better, but would increase the size of the diff even more.

Follow up TODOs:

  • Add a tool extractor + initial implementations of it.
  • Integrate the SnowflakeQueriesExtractor into the main snowflake source.
  • Allow for view lineage parsing without requiring query-based lineage/usage too.
  • Add support for the known lineage mappings (e.g. snowflake external tables) to the snowflake-queries source.
  • Remove dead code from the include_view_lineage flag.
  • Add remaining missing configs (usage configs, lazy_graph)

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

Summary by CodeRabbit

  • New Features

    • Introduced Snowflake query extraction capabilities for enhanced metadata ingestion.
    • Added QueryUsageStatistics to track dataset usage statistics.
  • Improvements

    • Enhanced Snowflake connection handling and metadata extraction processes.
    • Refined methods for processing external lineage information.
  • Bug Fixes

    • Improved error handling and reporting for Snowflake connections and queries.
  • Tests

    • Updated test cases and schemas for Snowflake integration.
    • Added new entities and schema fields to unit and integration tests.
  • Chores

    • Updated dependencies and configuration options for better validation and functionality.

@hsheth2 hsheth2 marked this pull request as draft July 3, 2024 04:55
Copy link
Contributor

coderabbitai bot commented Jul 3, 2024

Walkthrough

The recent updates to the metadata ingestion module focus on enhancing Snowflake integration. Key changes include adding dependencies for Snowflake queries, refining lineage mapping, and improving schema and usage statistics extraction. Additionally, new entities and test cases have been introduced to support these enhancements, ensuring robust functionality and better handling of external lineage information and query details.

Changes

Files/Group Summary
metadata-ingestion/setup.py Added "snowflake-queries" to dependencies and sources in entry points.
.../snowflake/snowflake_lineage_v2.py Added KnownLineageMapping, updated methods to return iterable KnownLineageMapping objects.
.../snowflake/snowflake_queries.py Introduced functionality for querying Snowflake databases.
.../com/linkedin/query/QueryUsageStatistics.pdl Added QueryUsageStatistics record for dataset usage statistics.
.../entity-registry.yml Added queryUsageStatistics to aspects for entities.
.../snowflake/snowflake_usage_v2.py Restructured imports and improved Snowflake connections and error handling.
.../snowflake/snowflake_utils.py Modified classes and methods for Snowflake operations and error handling.
.../snowflake/snowflake_v2.py Refactored Snowflake connection handling and metadata extraction.
.../sql/sql_config.py Updated SQLCommonConfig and added SQLFilterConfig class.
.../snowflake/common.py Adjusted function parameters and calls.
.../snowflake/snowflake_golden.json Added schema fields for multiple Snowflake tables.
.../snowflake/snowflake_privatelink_golden.json Added schema fields for tables and views in Snowflake.
.../snowflake/test_snowflake_failures.py Modified test cases and imports.
.../test_add_known_query_lineage.json Added new entities with schema fields.
.../test_basic_lineage.json Added schema field entities related to datasets.
.../test_column_lineage_deduplication.json Added and modified schema fields in JSON structure.
.../test_multistep_temp_table.json Added new entities with different schema fields.
.../test_overlapping_inserts.json Added entities for datasets and schema fields.
.../test_overlapping_inserts_from_temp_tables.json Modified JSON data structure for Redshift schema entities.
.../test_redundant_run_skip_handler.py Updated stateful_source fixture to return Iterable[SnowflakeV2Source].
.../snowflake/snowflake_schema_gen.py Modified imports, type annotations, class inheritance, method signatures, and property definitions.

Poem

In the realms of Snowflake, we took a stride,
With queries crafted and lineage wide,
Usage stats and schemas bright,
Robust and clear, in the data night.
For every change, a line to sing,
Data flows, like the rabbit's spring. 🐇✨


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Jul 3, 2024
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between b153473 and 5ad2963.

Files selected for processing (6)
  • metadata-ingestion/setup.py (2 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_lineage_v2.py (8 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_queries.py (1 hunks)
  • metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py (21 hunks)
  • metadata-models/src/main/pegasus/com/linkedin/query/QueryUsageStatistics.pdl (1 hunks)
  • metadata-models/src/main/resources/entity-registry.yml (1 hunks)
Additional comments not posted (33)
metadata-models/src/main/pegasus/com/linkedin/query/QueryUsageStatistics.pdl (5)

18-18: Well-documented field: queryCount.

The field queryCount is well-documented and includes a TimeseriesField annotation for time series data.


24-24: Well-documented field: queryCost.

The field queryCost is well-documented and includes a TimeseriesField annotation for time series data.


30-30: Well-documented field: lastExecutedAt.

The field lastExecutedAt is well-documented and includes a TimeseriesField annotation for time series data.


36-36: Well-documented field: uniqueUserCount.

The field uniqueUserCount is well-documented and includes a TimeseriesField annotation for time series data.


42-42: Well-documented field: userCounts.

The field userCounts is well-documented and includes a TimeseriesFieldCollection annotation for time series data collection.

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_queries.py (9)

49-69: Good use of configuration mixins and default values.

The SnowflakeQueriesConfig class effectively uses configuration mixins and provides sensible default values for its fields.


73-77: Well-structured report class.

The SnowflakeQueriesReport class is well-structured and extends SourceReport.


80-107: Comprehensive initialization.

The __init__ method provides a comprehensive initialization of the SnowflakeQueriesSource class, setting up the context, configuration, report, and aggregator.


108-112: Factory method for creating instances.

The create method is a factory method that parses configuration and creates an instance of SnowflakeQueriesSource.


113-123: Efficient use of cached property for local temp path.

The local_temp_path method efficiently uses the cached_property decorator to manage the local temporary path.


124-151: Efficient handling of audit log and work units.

The get_workunits_internal method efficiently handles the audit log and generates metadata work units.


152-196: Detailed method for fetching audit log.

The fetch_audit_log method is detailed and includes TODO comments for future enhancements.


202-293: Comprehensive audit log response parsing.

The _parse_audit_log_response method provides comprehensive parsing of audit log responses, converting them into PreparsedQuery objects.


295-296: Simple method for retrieving the report.

The get_report method is simple and straightforward, returning the SnowflakeQueriesReport instance.

metadata-models/src/main/resources/entity-registry.yml (1)

507-507: Correctly added queryUsageStatistics aspect.

The queryUsageStatistics aspect has been correctly added to the list of aspects for the query entity.

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_lineage_v2.py (6)

33-33: Correctly imported KnownLineageMapping.

The KnownLineageMapping class has been correctly imported from datahub.sql_parsing.sql_parsing_aggregator.


268-271: Correctly updated method to return Iterable[KnownLineageMapping].

The _populate_external_lineage_from_copy_history method has been correctly updated to return an iterable of KnownLineageMapping objects.


277-280: Correctly updated method to return Iterable[KnownLineageMapping].

The _populate_external_lineage_from_show_query method has been correctly updated to return an iterable of KnownLineageMapping objects.


Line range hint 355-371: Correctly updated method to return Optional[KnownLineageMapping].

The _process_external_lineage_result_row method has been correctly updated to return an optional KnownLineageMapping object.


268-281: Efficient handling of external upstreams.

The _populate_external_upstreams method efficiently handles the addition of external upstreams using the updated methods.


268-281: Efficient handling of work units.

The get_workunits method efficiently handles the addition of work units using the updated methods.

metadata-ingestion/setup.py (2)

412-412: Addition of snowflake-queries plugin dependency looks good.

The new snowflake-queries plugin dependency has been added correctly.


665-665: Addition of snowflake-queries entry point looks good.

The new entry point for snowflake-queries has been added correctly.

metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py (10)

439-450: New Method: Verify the usage of add method.

The add method is new. Ensure it is used appropriately throughout the code.


17-17: New Imports: Ensure they are necessary and used correctly.

The imports from datahub.configuration.time_window_config, datahub.sql_parsing.sql_parsing_common, and datahub.sql_parsing.sqlglot_lineage are new. Ensure these are necessary and used correctly in the code.

Also applies to: 34-34

Verification successful

Verified: The new imports are necessary and used correctly.

  • get_time_bucket is used in multiple files, including sql_parsing_aggregator.py.
  • QueryTypeProps is used in sql_parsing_aggregator.py and other related files.
  • ColumnLineageInfo is used extensively in sql_parsing_aggregator.py and other files.
Scripts executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Description: Verify that the new imports are necessary and used in the code.

# Test: Search for the usage of the new imports. Expect: Usage of each import in the code.
rg --type python $'get_time_bucket'
rg --type python $'QueryTypeProps'
rg --type python $'ColumnLineageInfo'

Length of output: 198



Script:

#!/bin/bash
# Description: Verify that the new imports are necessary and used in the code.

# Test: Search for the usage of the new imports. Expect: Usage of each import in the code.
rg --type py 'get_time_bucket'
rg --type py 'QueryTypeProps'
rg --type py 'ColumnLineageInfo'

Length of output: 14733


147-169: New Class: Verify the usage of PreparsedQuery.

The PreparsedQuery class is new. Ensure it is used appropriately throughout the code.

Verification successful

Verified Usage: PreparsedQuery class

The PreparsedQuery class is used appropriately throughout the codebase, as evidenced by its presence in multiple files.

  • metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_queries.py
  • metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py
Scripts executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Description: Verify the usage of the new `PreparsedQuery` class.

# Test: Search for the usage of the `PreparsedQuery` class. Expect: Usage of the class in the code.
rg --type python $'PreparsedQuery'

Length of output: 64



Script:

#!/bin/bash
# Description: Verify the usage of the new `PreparsedQuery` class.

# Test: Search for the usage of the `PreparsedQuery` class. Expect: Usage of the class in the code.
rg --type py 'PreparsedQuery'

Length of output: 1100


140-145: New Class: Verify the usage of KnownLineageMapping.

The KnownLineageMapping class is new. Ensure it is used appropriately throughout the code.

Verification successful

Verified Usage: The KnownLineageMapping class is used appropriately throughout the codebase.

The KnownLineageMapping class is found in multiple files, indicating proper integration and usage:

  • metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py
  • metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_lineage_v2.py
  • metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_queries.py
Scripts executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Description: Verify the usage of the new `KnownLineageMapping` class.

# Test: Search for the usage of the `KnownLineageMapping` class. Expect: Usage of the class in the code.
rg --type python $'KnownLineageMapping'

Length of output: 69



Script:

#!/bin/bash
# Description: Verify the usage of the new `KnownLineageMapping` class.

# Test: Search for the usage of the `KnownLineageMapping` class. Expect: Usage of the class in the code.
rg 'KnownLineageMapping'

Length of output: 1374


249-249: New Parameters and Attributes: Verify the usage of generate_query_usage_statistics and _query_usage_counts in SqlParsingAggregator.

The generate_query_usage_statistics parameter and _query_usage_counts attribute are new. Ensure they are used appropriately throughout the code.

Also applies to: 263-263, 270-271, 368-376

Verification successful

New Parameters and Attributes: Verified the usage of generate_query_usage_statistics and _query_usage_counts in SqlParsingAggregator.

The generate_query_usage_statistics parameter and _query_usage_counts attribute are used appropriately throughout the codebase.

  • generate_query_usage_statistics is used in sql_parsing_aggregator.py and snowflake_queries.py.
  • _query_usage_counts is used in sql_parsing_aggregator.py.
Scripts executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Description: Verify the usage of the new `generate_query_usage_statistics` parameter and `_query_usage_counts` attribute in `SqlParsingAggregator`.

# Test: Search for the usage of the `generate_query_usage_statistics` parameter. Expect: Usage of the parameter in the code.
rg --type python $'generate_query_usage_statistics'

# Test: Search for the usage of the `_query_usage_counts` attribute. Expect: Usage of the attribute in the code.
rg --type python $'_query_usage_counts'

Length of output: 151



Script:

#!/bin/bash
# Description: Verify the usage of the new `generate_query_usage_statistics` parameter and `_query_usage_counts` attribute in `SqlParsingAggregator`.

# Test: Search for the usage of the `generate_query_usage_statistics` parameter. Expect: Usage of the parameter in the code.
rg 'generate_query_usage_statistics'

# Test: Search for the usage of the `_query_usage_counts` attribute. Expect: Usage of the attribute in the code.
rg '_query_usage_counts'

Length of output: 1859


202-202: New Attributes: Verify the usage of num_preparsed_queries and num_query_usage_stats_generated in SqlAggregatorReport.

The num_preparsed_queries and num_query_usage_stats_generated attributes are new. Ensure they are used appropriately throughout the code.

Also applies to: 225-225

Verification successful

New Attributes: Verified usage of num_preparsed_queries and num_query_usage_stats_generated in SqlAggregatorReport.

The attributes num_preparsed_queries and num_query_usage_stats_generated are used appropriately in the codebase:

  • num_preparsed_queries: metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py lines 202 and 225.
  • num_query_usage_stats_generated: metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py lines 225 and 226.
Scripts executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Description: Verify the usage of the new `num_preparsed_queries` and `num_query_usage_stats_generated` attributes in `SqlAggregatorReport`.

# Test: Search for the usage of the `num_preparsed_queries` attribute. Expect: Usage of the attribute in the code.
rg --type python $'num_preparsed_queries'

# Test: Search for the usage of the `num_query_usage_stats_generated` attribute. Expect: Usage of the attribute in the code.
rg --type python $'num_query_usage_stats_generated'

Length of output: 153



Script:

#!/bin/bash
# Description: Verify the usage of the new `num_preparsed_queries` and `num_query_usage_stats_generated` attributes in `SqlAggregatorReport`.

# Test: Search for the usage of the `num_preparsed_queries` attribute. Expect: Usage of the attribute in the code.
rg 'num_preparsed_queries'

# Test: Search for the usage of the `num_query_usage_stats_generated` attribute. Expect: Usage of the attribute in the code.
rg 'num_query_usage_stats_generated'

Length of output: 535


107-107: New Attribute: Verify the usage of column_usage in QueryMetadata.

The column_usage attribute is new. Ensure it is used appropriately throughout the code.

Verification successful

Verified: The column_usage attribute is used appropriately throughout the codebase.

The column_usage attribute is integrated and utilized correctly in various parts of the code.

  • File: metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py
  • File: metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_queries.py
Scripts executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Description: Verify the usage of the new `column_usage` attribute in `QueryMetadata`.

# Test: Search for the usage of the `column_usage` attribute. Expect: Usage of the attribute in the code.
rg --type python $'column_usage'

Length of output: 62



Script:

#!/bin/bash
# Description: Verify the usage of the new `column_usage` attribute in `QueryMetadata`.

# Test: Search for the usage of the `column_usage` attribute. Expect: Usage of the attribute in the code.
rg 'column_usage'

Length of output: 1746



Script:

#!/bin/bash
# Description: Extract relevant code segments where `column_usage` is used for manual review.

# Extract lines containing `column_usage` in the relevant files for manual inspection.
rg 'column_usage' -A 3 -B 3 metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py
rg 'column_usage' -A 3 -B 3 metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_queries.py

Length of output: 4412


1140-1152: New Methods: Verify the usage of _gen_remaining_queries and _gen_query methods.

The _gen_remaining_queries and _gen_query methods are new. Ensure they are used appropriately throughout the code.

Also applies to: 1156-1240

Verification successful

New Methods: Verified the usage of _gen_remaining_queries and _gen_query methods.

The _gen_remaining_queries and _gen_query methods are defined and used appropriately within metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py.

  • _gen_remaining_queries is defined at line 1140 and used at line 1131.
  • _gen_query is defined at line 1156 and used at lines 1160 and 1170.
Scripts executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Description: Verify the usage of the new `_gen_remaining_queries` and `_gen_query` methods.

# Test: Search for the usage of the `_gen_remaining_queries` method. Expect: Usage of the method in the code.
rg --type python $'def _gen_remaining_queries'

# Test: Search for the usage of the `_gen_query` method. Expect: Usage of the method in the code.
rg --type python $'def _gen_query'

Length of output: 145



Script:

#!/bin/bash
# Description: Verify the usage of the new `_gen_remaining_queries` and `_gen_query` methods.

# Test: Search for the definition of the `_gen_remaining_queries` method.
rg 'def _gen_remaining_queries'

# Test: Search for the definition of the `_gen_query` method.
rg 'def _gen_query'

# Test: Search for the usage of the `_gen_remaining_queries` method.
rg '_gen_remaining_queries'

# Test: Search for the usage of the `_gen_query` method.
rg '_gen_query'

Length of output: 880


659-669: New Method: Verify the usage of add_preparsed_query method.

The add_preparsed_query method is new. Ensure it is used appropriately throughout the code.

Verification successful

New Method: Verify the usage of add_preparsed_query method.

The add_preparsed_query method is used appropriately in the codebase to handle PreparsedQuery objects.

  • metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py: Lines around 650-670
Scripts executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Description: Verify the usage of the new `add_preparsed_query` method.

# Test: Search for the usage of the `add_preparsed_query` method. Expect: Usage of the method in the code.
rg --type python $'def add_preparsed_query'

Length of output: 75



Script:

#!/bin/bash
# Description: Verify the usage of the new `add_preparsed_query` method.

# Test: Search for the usage of the `add_preparsed_query` method. Expect: Usage of the method in the code.
rg 'add_preparsed_query'

Length of output: 337



Script:

#!/bin/bash
# Description: Extract the context of `add_preparsed_query` method usage.

# Extract lines around the usage of `add_preparsed_query` method for deeper inspection.
rg -A 5 -B 5 'add_preparsed_query' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py

Length of output: 1414


133-133: New Attribute: Verify the usage of column_usage in KnownQueryLineageInfo.

The column_usage attribute is new. Ensure it is used appropriately throughout the code.

Copy link
Contributor

@treff7es treff7es left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of mypy issue there

@hsheth2 hsheth2 marked this pull request as ready for review July 11, 2024 01:36
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Outside diff range, codebase verification and nitpick comments (5)
metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_data_reader.py (1)

Line range hint 42-54: Ensure SQL query safety.

The SQL query construction should be reviewed to prevent SQL injection. Consider using parameterized queries to enhance security.

sql = 'select * from "{db_name}"."{schema_name}"."{table_name}" sample ({sample_size} rows);'
cursor.execute(sql)

You can use a parameterized query to avoid SQL injection:

sql = 'select * from "{db_name}"."{schema_name}"."{table_name}" sample (%s rows);'
cursor.execute(sql, (sample_size,))
metadata-ingestion/src/datahub/ingestion/source/fivetran/config.py (1)

71-73: Ensure Correct Description for database and log_schema.

The descriptions for database and log_schema fields should clearly explain their purpose related to the Fivetran connector log.

- database: str = Field(description="The fivetran connector log database.")
- log_schema: str = Field(description="The fivetran connector log schema.")
+ database: str = Field(description="The database where the Fivetran connector logs are stored.")
+ log_schema: str = Field(description="The schema within the Fivetran connector log database.")
metadata-ingestion/src/datahub/ingestion/source/redshift/lineage_v2.py (3)

Line range hint 34-50:
Consider initializing known_urns in the constructor.

To ensure all attributes are initialized in the constructor, consider initializing self.known_urns in the __init__ method.

- self.known_urns: Set[str] = set()  # will be set later
+ self.known_urns: Set[str] = set()

Line range hint 290-293:
Consider adding a detailed TODO comment.

The TODO comment should provide more details on what needs to be implemented.

- # TODO actor
+ # TODO: Implement actor extraction for lineage rows.

Line range hint 295-303:
Improve logging for filtered targets.

Consider adding more details to the log message for better debugging.

- logger.debug(
-   f"Skipping lineage for {target.urn()} as it is not in known_urns"
- )
+ logger.debug(
+   f"Skipping lineage for target URN: {target.urn()} as it is not in known_urns"
+ )
Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 5ad2963 and 4c4fb22.

Files selected for processing (36)
  • metadata-ingestion/setup.py (2 hunks)
  • metadata-ingestion/src/datahub/ingestion/api/source.py (2 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_audit.py (1 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/fivetran/config.py (2 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/redshift/lineage_v2.py (1 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_assertion.py (4 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_config.py (8 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_connection.py (7 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_data_reader.py (2 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_lineage_v2.py (12 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_profiler.py (1 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_queries.py (1 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_query.py (5 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_schema.py (14 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_schema_gen.py (19 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_summary.py (6 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_usage_v2.py (13 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_utils.py (10 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_v2.py (20 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/sql/sql_config.py (4 hunks)
  • metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py (24 hunks)
  • metadata-ingestion/tests/integration/snowflake/common.py (2 hunks)
  • metadata-ingestion/tests/integration/snowflake/snowflake_golden.json (13 hunks)
  • metadata-ingestion/tests/integration/snowflake/snowflake_privatelink_golden.json (2 hunks)
  • metadata-ingestion/tests/integration/snowflake/test_snowflake_failures.py (3 hunks)
  • metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_add_known_query_lineage.json (1 hunks)
  • metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_basic_lineage.json (1 hunks)
  • metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_column_lineage_deduplication.json (2 hunks)
  • metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_multistep_temp_table.json (1 hunks)
  • metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_overlapping_inserts.json (2 hunks)
  • metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_overlapping_inserts_from_temp_tables.json (3 hunks)
  • metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_table_rename.json (2 hunks)
  • metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_temp_table.json (2 hunks)
  • metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_view_lineage.json (1 hunks)
  • metadata-ingestion/tests/unit/stateful_ingestion/state/test_redundant_run_skip_handler.py (3 hunks)
  • metadata-ingestion/tests/unit/test_snowflake_source.py (9 hunks)
Files skipped from review due to trivial changes (1)
  • metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_basic_lineage.json
Additional context used
Ruff
metadata-ingestion/tests/unit/stateful_ingestion/state/test_redundant_run_skip_handler.py

48-48: Local variable mock_connect is assigned to but never used

Remove assignment to unused variable mock_connect

(F841)

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_utils.py

195-203: Return the negated condition directly

Inline condition

(SIM103)

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_connection.py

119-119: Use key not in dict instead of key not in dict.keys()

Remove .keys()

(SIM118)

metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py

452-452: Use of functools.lru_cache or functools.cache on methods can lead to memory leaks

(B019)

Additional comments not posted (172)
metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_view_lineage.json (1)

84-100: Verify the structure and format of new entities.

Ensure that the new entities added to the subjects array follow the correct structure and format.

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_assertion.py (2)

Line range hint 63-78: Ensure SQL query safety and verify processing logic.

The SQL query construction should be reviewed to prevent SQL injection. Consider using parameterized queries to enhance security. Verify that the processing logic correctly handles the fetched data.


Line range hint 103-121: Verify row processing logic.

Ensure that the row processing logic correctly handles the data and generates the appropriate metadata change proposals.

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_summary.py (2)

Line range hint 67-127: Ensure SQL query safety and verify processing logic.

The SQL query construction should be reviewed to prevent SQL injection. Consider using parameterized queries to enhance security. Verify that the processing logic correctly handles the fetched data.


Line range hint 129-131: Verify method for correctness.

Ensure that the method correctly returns the summary report.

metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_add_known_query_lineage.json (1)

116-129: Ensure Consistent Usage of URNs.

The URNs for datasets and schema fields must follow a consistent pattern. Verify that the URNs used here match the expected format and refer to the correct entities.

Verification successful

Ensure Consistent Usage of URNs.

The URNs in the JSON file follow the expected format and refer to the correct entities. The patterns for urn:li:dataset and urn:li:schemaField are consistent throughout the file.

  • Datasets:

    • urn:li:dataset:(urn:li:dataPlatform:redshift,dev.public.foo,PROD)
    • urn:li:dataset:(urn:li:dataPlatform:redshift,dev.public.bar,PROD)
  • Schema Fields:

    • urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:redshift,dev.public.foo,PROD),a)
    • urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:redshift,dev.public.foo,PROD),b)
    • urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:redshift,dev.public.foo,PROD),c)
    • urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:redshift,dev.public.bar,PROD),a)
    • urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:redshift,dev.public.bar,PROD),b)
    • urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:redshift,dev.public.bar,PROD),c)
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the consistency of URNs in the JSON file.

# Test: Search for URNs in the file. Expect: All URNs follow the expected pattern and refer to valid entities.
rg --type json 'urn:li:dataset:|urn:li:schemaField:' metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_add_known_query_lineage.json

Length of output: 1670

metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_multistep_temp_table.json (1)

109-125: Ensure Consistent Usage of URNs.

The URNs for datasets and schema fields must follow a consistent pattern. Verify that the URNs used here match the expected format and refer to the correct entities.

Verification successful

URNs are consistent and valid.

The URNs for datasets and schema fields in the specified JSON file follow the expected patterns and refer to the correct entities.

  • urn:li:dataset:(urn:li:dataPlatform:redshift,<namespace>,PROD)
  • urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:redshift,<namespace>,PROD),<field>)
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the consistency of URNs in the JSON file.

# Test: Search for URNs in the file. Expect: All URNs follow the expected pattern and refer to valid entities.
rg --type json 'urn:li:dataset:|urn:li:schemaField:' metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_multistep_temp_table.json

Length of output: 1861


Script:

#!/bin/bash
# Description: Verify the consistency and validity of URNs in the JSON file.

# Search for dataset URNs and validate their patterns.
rg --type json 'urn:li:dataset:\(urn:li:dataPlatform:redshift,[^)]*,PROD\)' metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_multistep_temp_table.json

# Search for schemaField URNs and validate their patterns.
rg --type json 'urn:li:schemaField:\(urn:li:dataset:\(urn:li:dataPlatform:redshift,[^)]*,PROD\),[^)]*\)' metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_multistep_temp_table.json

Length of output: 3184

metadata-ingestion/src/datahub/ingestion/source/fivetran/config.py (1)

16-18: Verify Import of SnowflakeConnectionConfig.

Ensure that the SnowflakeConnectionConfig import is necessary and correctly used in the file.

Verification successful

Verified Import of SnowflakeConnectionConfig.

The import of SnowflakeConnectionConfig is necessary and correctly used in the file metadata-ingestion/src/datahub/ingestion/source/fivetran/config.py. It is utilized in the definition of the SnowflakeDestinationConfig class, which inherits from SnowflakeConnectionConfig.

  • Line 1: SnowflakeConnectionConfig is imported.
  • Line 2: SnowflakeDestinationConfig class extends SnowflakeConnectionConfig.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the usage of `SnowflakeConnectionConfig` in the file.

# Test: Search for `SnowflakeConnectionConfig` in the file. Expect: The import is necessary and correctly used.
rg 'SnowflakeConnectionConfig' metadata-ingestion/src/datahub/ingestion/source/fivetran/config.py

Length of output: 189

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_profiler.py (1)

Line range hint 167-171:
Ensure Proper Handling of Database and Schema Selection.

The callable_for_db_connection method should handle database and schema selection correctly, especially when the PUBLIC schema is absent.

Verification successful

Verified: Proper Handling of Database and Schema Selection.

The callable_for_db_connection method correctly handles the database and schema selection, explicitly managing the case when the PUBLIC schema is absent by checking for a schema name and executing the USE SCHEMA command if necessary.

  • The method retrieves the native connection.
  • Executes the USE DATABASE command.
  • Checks for a schema name and executes the USE SCHEMA command if provided.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the handling of database and schema selection in the `callable_for_db_connection` method.

# Test: Search for the method in the file. Expect: Proper handling of database and schema selection, especially when the PUBLIC schema is absent.
rg 'callable_for_db_connection' metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_profiler.py

Length of output: 241


Script:

#!/bin/bash
# Description: Retrieve the full implementation of the `callable_for_db_connection` method to verify the handling of database and schema selection.

# Use ast-grep to extract the implementation of `callable_for_db_connection` method from the file.
ast-grep --lang python --pattern 'def callable_for_db_connection(self, db_name: str) -> Callable:
  $$$' metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_profiler.py

Length of output: 2040

metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_column_lineage_deduplication.json (2)

96-112: Ensure correct entity formatting.

The new subjects added under the querySubjects aspect appear correctly formatted and consistent with the existing structure.


160-182: Ensure correct entity formatting.

The new subjects added under the querySubjects aspect appear correctly formatted and consistent with the existing structure.

metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_overlapping_inserts.json (2)

121-137: Ensure correct entity formatting.

The new subjects added under the querySubjects aspect appear correctly formatted and consistent with the existing structure.


185-201: Ensure correct entity formatting.

The new subjects added under the querySubjects aspect appear correctly formatted and consistent with the existing structure.

metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_table_rename.json (2)

84-100: Ensure correct entity formatting.

The new subjects added under the querySubjects aspect appear correctly formatted and consistent with the existing structure.


199-215: Ensure correct entity formatting.

The new subjects added under the querySubjects aspect appear correctly formatted and consistent with the existing structure.

metadata-ingestion/src/datahub/ingestion/source/sql/sql_config.py (3)

11-13: Correct mixin replacements.

The new mixins EnvConfigMixin and PlatformInstanceConfigMixin are correctly imported and used.


Line range hint 34-62:
New class SQLFilterConfig looks good.

The new class SQLFilterConfig and its fields are correctly defined and adhere to best practices.


63-76: Updates to SQLCommonConfig class look good.

The updates to the SQLCommonConfig class and its fields are correctly defined and adhere to best practices.

metadata-ingestion/tests/integration/snowflake/test_snowflake_failures.py (8)

4-9: Imports look good!

The added imports are relevant for the tests in this file.


76-79: Test case for missing role access looks good!

The test correctly checks for the PipelineInitError when the role is not granted.


4-4: Test case for missing warehouse access looks good!

The test correctly simulates the condition and asserts the expected failure message.

Also applies to: 76-79


4-4: Test case for no databases with access looks good!

The test correctly simulates the condition and asserts the expected failure message.

Also applies to: 76-79


4-4: Test case for no tables access looks good!

The test correctly simulates the condition and asserts the expected failure message.

Also applies to: 76-79


4-4: Test case for listing columns error looks good!

The test correctly simulates the condition and asserts the expected warning message.

Also applies to: 76-79


4-4: Test case for listing primary keys error looks good!

The test correctly simulates the condition and asserts the expected warning message.

Also applies to: 76-79


4-4: Test cases for missing permissions look good!

The tests correctly simulate the conditions and assert the expected failure messages.

Also applies to: 76-79

metadata-ingestion/tests/unit/stateful_ingestion/state/test_redundant_run_skip_handler.py (5)

2-2: Imports look good!

The added import Iterable is relevant for the tests in this file.


Line range hint 28-49: Fixture setup looks good!

The stateful_source fixture correctly sets up the SnowflakeV2Source with the necessary configurations.

Tools
Ruff

48-48: Local variable mock_connect is assigned to but never used

Remove assignment to unused variable mock_connect

(F841)


47-49: Test case for redundant run job IDs looks good!

The test correctly validates the job IDs for both lineage and usage extractors.

Tools
Ruff

48-48: Local variable mock_connect is assigned to but never used

Remove assignment to unused variable mock_connect

(F841)


47-49: Test case for redundant run skip handler looks good!

The test correctly covers multiple scenarios and validates the skip logic and suggested time windows.

Tools
Ruff

48-48: Local variable mock_connect is assigned to but never used

Remove assignment to unused variable mock_connect

(F841)


47-49: Utility functions and checkpoint tests look good!

The functions and tests correctly validate the checkpoint creation logic using mocks and assertions.

Tools
Ruff

48-48: Local variable mock_connect is assigned to but never used

Remove assignment to unused variable mock_connect

(F841)

metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_temp_table.json (1)

84-100: JSON data for SQL parsing test cases looks good!

The structure and values are correct and consistent with the expected schema.

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_utils.py (7)

22-33: Class SnowflakeStructuredReportMixin looks good!

The methods correctly use the structured_reporter for reporting warnings and errors.


Line range hint 36-63: Class SnowflakeCommonProtocol looks good!

The class defines essential methods and properties for Snowflake integration.


Line range hint 65-141: Class SnowsightUrlBuilder looks good!

The methods are well-structured and handle various scenarios for building URLs.


Line range hint 143-225: Class SnowflakeFilterMixin looks good!

The methods correctly implement the filtering logic based on the configurations.

Tools
Ruff

195-203: Return the negated condition directly

Inline condition

(SIM103)


227-258: Class SnowflakeIdentifierMixin looks good!

The methods correctly handle identifiers based on the configurations.


Line range hint 259-283: Class SnowflakeCommonMixin looks good!

The methods correctly combine the functionalities of the mixins and provide additional utilities.

Tools
Ruff

195-203: Return the negated condition directly

Inline condition

(SIM103)


Line range hint 259-283: Method warn_if_stateful_else_error looks good!

The method correctly checks the configuration and logs appropriately.

Tools
Ruff

195-203: Return the negated condition directly

Inline condition

(SIM103)

metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_overlapping_inserts_from_temp_tables.json (1)

179-193: Ensure consistency in entity representation.

The JSON structure looks correct. However, ensure that all entities are consistently represented across the dataset.

metadata-ingestion/src/datahub/ingestion/source/redshift/lineage_v2.py (5)

Line range hint 271-288:
Ensure proper exception handling and logging.

The method handles exceptions and logs warnings. Ensure that the logging provides sufficient context for debugging.


Line range hint 305-320:
Handle missing DDL in STL scan entries.

The method logs a warning for missing DDL. Ensure that the warning provides sufficient context for debugging.


Line range hint 322-333:
Ensure consistent handling of DDL in view lineage.

The method handles DDL for views. Ensure that the handling is consistent with other methods.


Line range hint 335-348:
Ensure proper handling of source and target URNs.

The method handles source and target URNs for copy commands. Ensure that the handling is consistent and correct.


Line range hint 391-393:
Ensure consistent generation of metadata work units.

The method generates metadata work units. Ensure that the generation is consistent and follows best practices.

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_connection.py (8)

Line range hint 135-155:
Ensure comprehensive validation for OAuth configuration.

The method provides detailed validation for OAuth configuration. Ensure all edge cases are covered.


197-200: Ensure correct generation of SQLAlchemy URL.

The method correctly generates the SQLAlchemy URL with the provided parameters.


Line range hint 225-263:
Ensure proper handling of private key in connection arguments.

The method correctly handles private key for connection arguments.


Line range hint 263-301:
Ensure proper handling of OAuth connection.

The method correctly handles OAuth connection generation.


305-314: Ensure proper handling of key pair connection.

The method correctly handles key pair connection generation.


Line range hint 318-342:
Ensure proper handling of native connection.

The method correctly handles native connection generation.


349-362: Ensure proper exception handling for connection generation.

The method handles exceptions correctly when generating a connection.


114-114: Remove unnecessary .keys() call.

Use key not in dict instead of key not in dict.keys().

- if v not in _VALID_AUTH_TYPES.keys():
+ if v not in _VALID_AUTH_TYPES:

Likely invalid or redundant comment.

metadata-ingestion/src/datahub/ingestion/api/source.py (5)

117-119: Ensure context is truncated correctly.

The method correctly truncates the context if it exceeds the maximum length.


Line range hint 142-146:
Ensure correct retrieval of log entries.

The method correctly retrieves log entries of the specified type.


Line range hint 166-188:
Ensure correct reporting of work units.

The method correctly reports work units and updates the relevant metrics.


Line range hint 194-199:
Ensure correct reporting of warnings.

The method correctly reports warnings using structured logs.


Line range hint 225-232:
Ensure correct computation of statistics.

The method correctly computes statistics for the report.

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_config.py (9)

7-7: Imports look good.

The new imports from pydantic and datahub.configuration.source_common are appropriate for the added functionality.

Also applies to: 12-16


84-96: New fields in SnowflakeFilterConfig look good.

The added fields for database_pattern, schema_pattern, and match_fully_qualified_names are appropriate for filtering configurations.


103-125: Root validator logic is sound but check backward compatibility.

The root validator ensures proper configuration for schema patterns and maintains backward compatibility. Verify if the deprecation warning is communicated effectively to users.


128-134: New field in SnowflakeIdentifierConfig looks good.

The convert_urns_to_lowercase field with a default value of True is appropriate for identifier configurations.


146-167: New fields in SnowflakeConfig look good.

The added fields for including table and view lineage are appropriate for lineage configurations.


158-168: Root validator logic is sound but check dependency on include_table_lineage.

The root validator ensures that include_table_lineage is set to True when include_view_lineage is enabled. Verify if this dependency is clearly documented and communicated to users.


Line range hint 170-365: New fields and validators in SnowflakeV2Config look good.

The added fields and validators for usage statistics, technical schema, primary and foreign keys, column lineage, lazy schema resolver, tags, and other configurations are appropriate for Snowflake V2.


327-330: Method get_sql_alchemy_url looks good.

The method constructs a SQLAlchemy URL for Snowflake using the connection configuration.


Line range hint 371-417: Method validate_shares looks good.

The method validates the shares configuration, ensuring that platform instances and databases are correctly configured.

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_schema.py (9)

10-10: Import looks good.

The new import from datahub.ingestion.source.snowflake.snowflake_connection is appropriate for the added functionality.


Line range hint 186-229: New methods in SnowflakeDataDictionary look good.

The added methods for showing databases, getting databases, and getting schemas for a database are appropriate for data dictionary operations.


Line range hint 270-299: Method get_tables_for_database looks good but verify error handling.

The method retrieves tables for a given database. Verify if the error handling for the query is sufficient.


Line range hint 303-313: Method get_tables_for_schema looks good.

The method retrieves tables for a given schema in a database.


Line range hint 331-361: Method get_views_for_database looks good but verify pagination logic.

The method retrieves views for a given database with pagination. Verify if the pagination logic handles large result sets correctly.


Line range hint 424-438: Method get_pk_constraints_for_schema looks good.

The method retrieves primary key constraints for a given schema in a database.


Line range hint 443-471: Method get_fk_constraints_for_schema looks good.

The method retrieves foreign key constraints for a given schema in a database.


Line range hint 475-496: Method get_tags_for_database_without_propagation looks good.

The method retrieves tags for a database without propagation.


Line range hint 530-541: Method get_tags_on_columns_for_table looks good.

The method retrieves tags on columns for a given table in a database.

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_queries.py (8)

1-12: Imports look good.

The new imports from pydantic, pathlib, and typing_extensions are appropriate for the added functionality.


57-87: New fields in SnowflakeQueriesExtractorConfig look good.

The added fields for window, deny usernames, temporary tables pattern, and local temp path are appropriate for query extraction configurations.


92-94: New field in SnowflakeQueriesSourceConfig looks good.

The added field for connection configuration is appropriate for Snowflake queries.


108-148: New methods in SnowflakeQueriesExtractor look good.

The added methods for initializing the extractor, handling configurations, and managing temporary paths are appropriate for query extraction.


175-203: Method get_workunits_internal looks good but verify caching logic.

The method retrieves work units for Snowflake queries. Verify if the caching logic for the audit log is sufficient.


205-258: Method fetch_audit_log looks good but verify error handling.

The method fetches the audit log for Snowflake queries. Verify if the error handling for parsing audit log rows is sufficient.


259-365: Method _parse_audit_log_row looks good but verify JSON parsing logic.

The method parses a row from the audit log. Verify if the JSON parsing logic for specific fields is sufficient.


402-501: Function _build_enriched_audit_log_query looks good.

The function constructs a query for fetching enriched audit logs with appropriate filters and pagination.

metadata-ingestion/tests/unit/test_snowflake_source.py (5)

27-27: Import looks good.

The new import from datahub.ingestion.source.snowflake.snowflake_utils is appropriate for the added functionality.


448-460: Function test_aws_cloud_region_from_snowflake_region_id looks good.

The function correctly tests the conversion of Snowflake region ID to AWS cloud region.


470-472: Function test_google_cloud_region_from_snowflake_region_id looks good.

The function correctly tests the conversion of Snowflake region ID to Google Cloud region.


Line range hint 482-492: Function test_azure_cloud_region_from_snowflake_region_id looks good.

The function correctly tests the conversion of Snowflake region ID to Azure cloud region.


502-504: Function test_unknown_cloud_region_from_snowflake_region_id looks good.

The function correctly tests the handling of unknown Snowflake region IDs.

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_lineage_v2.py (17)

10-10: Import Statement for Closeable Interface Added

The Closeable interface was added. This is necessary for ensuring that resources are properly released when the object is no longer needed.


18-19: Import Statement for SnowflakeConnection and SnowflakePermissionError Added

The SnowflakeConnection and SnowflakePermissionError imports were added, which are essential for handling Snowflake connections and related errors.


32-32: Import Statement for KnownLineageMapping Added

The KnownLineageMapping import was added. This is crucial for handling known lineage mappings in the lineage extraction process.


104-104: Class SnowflakeLineageExtractor Now Implements Closeable

The SnowflakeLineageExtractor class now implements the Closeable interface. This is important for ensuring that resources are properly released.


121-121: Connection Initialization in Constructor

The SnowflakeConnection is now initialized in the constructor, which aligns with the PR objectives of initializing the connection in the constructor.


130-130: Use of SnowflakeConnection

The SnowflakeConnection is now assigned to self.connection in the constructor, which ensures that the connection is available throughout the class methods.


262-265: Use of KnownLineageMapping in _populate_external_upstreams

The _populate_external_upstreams method now uses KnownLineageMapping. This improves how external lineage data is processed and aggregated.


271-275: Use of KnownLineageMapping in _populate_external_upstreams

The _populate_external_upstreams method now uses KnownLineageMapping for show queries as well. This ensures consistency in handling external lineage data.


287-287: Return Type Changed to Iterable[KnownLineageMapping]

The _populate_external_lineage_from_show_query method now returns Iterable[KnownLineageMapping]. This aligns with the improved handling of lineage data.


321-321: Return Type Changed to Iterable[KnownLineageMapping]

The _populate_external_lineage_from_copy_history method now returns Iterable[KnownLineageMapping]. This aligns with the improved handling of lineage data.


329-334: Use of KnownLineageMapping in _populate_external_lineage_from_copy_history

The _populate_external_lineage_from_copy_history method now uses KnownLineageMapping. This improves how external lineage data is processed and aggregated.


349-349: Return Type Changed to Optional[KnownLineageMapping]

The _process_external_lineage_result_row method now returns Optional[KnownLineageMapping]. This aligns with the improved handling of lineage data.


355-355: Return None for Non-discovered Tables

The _process_external_lineage_result_row method now returns None if the table is not in discovered_tables. This ensures that only relevant tables are processed.


362-368: Use of KnownLineageMapping in _process_external_lineage_result_row

The _process_external_lineage_result_row method now uses KnownLineageMapping for creating lineage mappings. This improves the consistency and clarity of the lineage data.


423-428: Added Dataset Pattern Validation in map_query_result_upstreams

The map_query_result_upstreams method now includes dataset pattern validation. This ensures that only allowed datasets are processed.


509-514: Added Dataset Pattern Validation in build_finegrained_lineage_upstreams

The build_finegrained_lineage_upstreams method now includes dataset pattern validation. This ensures that only allowed datasets are processed.


565-567: Added close Method to Implement Closeable

The close method was added to the SnowflakeLineageExtractor class to fulfill the Closeable interface requirements. This method should ensure that any resources are properly released.

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_usage_v2.py (13)

12-12: Import Statement for Closeable Interface Added

The Closeable interface was added. This is necessary for ensuring that resources are properly released when the object is no longer needed.


17-18: Import Statement for SnowflakeConnection and SnowflakePermissionError Added

The SnowflakeConnection and SnowflakePermissionError imports were added, which are essential for handling Snowflake connections and related errors.


109-109: Class SnowflakeUsageExtractor Now Implements Closeable

The SnowflakeUsageExtractor class now implements the Closeable interface. This is important for ensuring that resources are properly released.


114-114: Connection Initialization in Constructor

The SnowflakeConnection is now initialized in the constructor, which aligns with the PR objectives of initializing the connection in the constructor.


122-122: Use of SnowflakeConnection

The SnowflakeConnection is now assigned to self.connection in the constructor, which ensures that the connection is available throughout the class methods.


203-203: Use of SnowflakeConnection in _get_workunits_internal

The _get_workunits_internal method now uses self.connection.query for querying Snowflake. This ensures consistency in how queries are executed.


235-235: Added Dataset Pattern Validation in _get_workunits_internal

The _get_workunits_internal method now includes dataset pattern validation. This ensures that only allowed datasets are processed.


289-289: Added Warning for Failed Usage Statistics Parsing

A warning is logged if parsing usage statistics fails. This helps in identifying issues during the ingestion process.


372-373: Assertion for Connection in _get_snowflake_history

An assertion is added to ensure that self.connection is not None before querying. This prevents potential runtime errors.


395-396: Assertion for Connection in _check_usage_date_ranges

An assertion is added to ensure that self.connection is not None before querying. This prevents potential runtime errors.


505-505: Added Warning for Failed Operation History Parsing

A warning is logged if parsing operation history fails. This helps in identifying issues during the ingestion process.


564-564: Added Dataset Pattern Validation in _is_object_valid

The _is_object_valid method now includes dataset pattern validation. This ensures that only allowed datasets are processed.


590-592: Added close Method to Implement Closeable

The close method was added to the SnowflakeUsageExtractor class to fulfill the Closeable interface requirements. This method should ensure that any resources are properly released.

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_audit.py (1)

195-195: Handle URNs with Different Lengths in from_urn Method

The from_urn method now handles URNs with different lengths. This ensures that both standard and non-standard URNs are processed correctly.

metadata-ingestion/tests/integration/snowflake/common.py (3)

531-531: LGTM! The query condition is correctly handled.

The inclusion of view lineage and exclusion of column lineage is correctly implemented in the query.


607-610: LGTM! The query condition is correctly handled.

The time window for copying lineage history is correctly implemented in the query.


608-610: LGTM! The query condition is correctly handled.

The time window for copying lineage history is correctly implemented in the query.

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_v2.py (9)

131-132: Improvement: Initialize connection in the constructor.

The connection initialization in the constructor is a good practice for better resource management.


141-141: Refactor: Use composition for connection.

Using composition for the connection (i.e., self.connection) improves code readability and reusability.


236-238: Update: Use SnowflakeConnectionConfig for connection parsing.

Using SnowflakeConnectionConfig for connection parsing aligns with the new connection handling approach.


264-264: Update: Use SnowflakeConnection in check_capabilities.

The function now uses SnowflakeConnection for querying the Snowflake database, which aligns with the new connection handling approach.


426-426: Improvement: Reinitialize connection at the start.

Reinitializing the connection at the start of the function ensures that the latest connection settings are used.


432-434: Improvement: Use SnowsightUrlBuilder for external URL generation.

Using SnowsightUrlBuilder for external URL generation improves the handling of external URLs.


538-538: Update: Use SnowflakeConnection for session metadata queries.

The function now uses SnowflakeConnection for querying the Snowflake database for session metadata, which aligns with the new connection handling approach.


567-570: Update: Use SnowflakeConnection for Snowsight URL generation.

The function now uses SnowflakeConnection for querying the Snowflake database to generate the Snowsight URL, which aligns with the new connection handling approach.


Line range hint 618-618:
Improvement: Ensure proper resource management.

The function ensures that the connection and extractors are properly closed, which improves resource management.

metadata-ingestion/setup.py (2)

414-414: Addition: Include snowflake-queries dependency.

The snowflake-queries dependency has been added, which is necessary for the new Snowflake queries source.


667-667: Addition: Register snowflake-queries source in entry points.

The snowflake-queries source has been added to the entry points, making it discoverable.

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_query.py (2)

363-363: LGTM!

The upstreams_deny_pattern parameter addition is appropriate and the function logic is intact.


414-414: LGTM!

The downstreams_deny_pattern parameter addition and its usage in create_deny_regex_sql_filter are appropriate.

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_schema_gen.py (6)

Line range hint 146-176: LGTM!

The snowsight_url_builder parameter addition and its assignment are appropriate and consistent with the rest of the initialization logic.


365-368: LGTM!

The addition of the match_fully_qualified_names parameter in the is_schema_allowed function call is appropriate and consistent with the method's logic.


479-479: LGTM!

The addition of the view_pattern.allowed check is appropriate and ensures only allowed views are processed.


512-512: LGTM!

The addition of the table_pattern.allowed check is appropriate and ensures only allowed tables are processed.


Line range hint 789-799: LGTM!

The usage of snowsight_url_builder for generating external URLs is appropriate and consistent with the method's logic.


984-987: LGTM!

The usage of snowsight_url_builder for generating external URLs is appropriate and consistent with the method's logic.

metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py (15)

108-108: Verify the usage of column_usage field.

Ensure that the addition of column_usage to the QueryMetadata dataclass is correctly utilized and doesn't introduce issues.


134-134: Verify the usage of column_usage field.

Ensure that the addition of column_usage to the KnownQueryLineageInfo dataclass is correctly utilized and doesn't introduce issues.

Verification successful

Verify the usage of column_usage field.

The column_usage field has been integrated into the sql_parsing_aggregator.py file and is used in various contexts such as assignments and function calls. The integration appears thorough and appropriate for tracking column usage without introducing issues.

  • Instances of column_usage are found in the following lines:
    • Assignments: Lines where column_usage is assigned values.
    • Function calls: Lines where column_usage is passed as an argument.
    • Conditionals: Lines where column_usage is used in conditions.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the usage of `column_usage` field in the file.

# Test: Search for the usage of `column_usage`. Expect: Appropriate usage for tracking column usage.
rg --type python 'column_usage' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py

Length of output: 131


Script:

#!/bin/bash
# Description: Verify the usage of `column_usage` field in the file.

# Test: Search for the usage of `column_usage`. Expect: Appropriate usage for tracking column usage.
rg 'column_usage' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py

Length of output: 714


203-203: Verify the usage of num_preparsed_queries field.

Ensure that the addition of num_preparsed_queries to the SqlAggregatorReport dataclass is correctly utilized and doesn't introduce issues.

Verification successful

The num_preparsed_queries field is correctly utilized for tracking preparsed queries.

  • The field is declared and initialized to 0.
  • It is incremented within the code, indicating it is used to count occurrences of preparsed queries.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the usage of `num_preparsed_queries` field in the file.

# Test: Search for the usage of `num_preparsed_queries`. Expect: Appropriate usage for tracking preparsed queries.
rg --type python 'num_preparsed_queries' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py

Length of output: 140


Script:

#!/bin/bash
# Description: Verify the usage of `num_preparsed_queries` field in the file.

# Test: Search for the usage of `num_preparsed_queries`. Expect: Appropriate usage for tracking preparsed queries.
rg --type py 'num_preparsed_queries' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py

Length of output: 191


222-222: Verify the usage of num_lineage_skipped_due_to_filters field.

Ensure that the addition of num_lineage_skipped_due_to_filters to the SqlAggregatorReport dataclass is correctly utilized and doesn't introduce issues.

Verification successful

Verified the usage of num_lineage_skipped_due_to_filters field.

The num_lineage_skipped_due_to_filters field is correctly utilized for tracking skipped lineage due to filters.

  • Field declaration: Line 222
  • Field usage: Incremented within the file
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the usage of `num_lineage_skipped_due_to_filters` field in the file.

# Test: Search for the usage of `num_lineage_skipped_due_to_filters`. Expect: Appropriate usage for tracking skipped lineage.
rg --type python 'num_lineage_skipped_due_to_filters' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py

Length of output: 153


Script:

#!/bin/bash
# Description: Verify the usage of `num_lineage_skipped_due_to_filters` field in the file.

# Test: Search for the usage of `num_lineage_skipped_due_to_filters`. Expect: Appropriate usage for tracking skipped lineage.
rg 'num_lineage_skipped_due_to_filters' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py

Length of output: 220


257-259: Verify the usage of generate_query_usage_statistics parameter.

Ensure that the addition of generate_query_usage_statistics to the SqlParsingAggregator class constructor is correctly utilized and doesn't introduce issues.

Verification successful

The generate_query_usage_statistics parameter is correctly utilized.

  • The parameter is declared, assigned to an instance variable, and used in logical conditions within the SqlParsingAggregator class to control the generation of query usage statistics.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the usage of `generate_query_usage_statistics` parameter in the file.

# Test: Search for the usage of `generate_query_usage_statistics`. Expect: Appropriate usage for controlling query usage statistics generation.
rg --type python 'generate_query_usage_statistics' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py

Length of output: 150


Script:

#!/bin/bash
# Description: Verify the usage of `generate_query_usage_statistics` parameter in the file.

# Test: Search for the usage of `generate_query_usage_statistics`. Expect: Appropriate usage for controlling query usage statistics generation.
rg -t py 'generate_query_usage_statistics' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py

Length of output: 459


230-230: Verify the usage of num_query_usage_stats_generated field.

Ensure that the addition of num_query_usage_stats_generated to the SqlAggregatorReport dataclass is correctly utilized and doesn't introduce issues.

Verification successful

The num_query_usage_stats_generated field is correctly utilized.

The field num_query_usage_stats_generated is properly used for tracking the generation of query usage statistics. It is incremented appropriately, indicating that this field is being utilized as intended.

  • Declaration and initialization: num_query_usage_stats_generated: int = 0
  • Increment operation: self.report.num_query_usage_stats_generated += 1
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the usage of `num_query_usage_stats_generated` field in the file.

# Test: Search for the usage of `num_query_usage_stats_generated`. Expect: Appropriate usage for tracking generated query usage statistics.
rg --type python 'num_query_usage_stats_generated' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py

Length of output: 150


Script:

#!/bin/bash
# Description: Verify the usage of `num_query_usage_stats_generated` field in the file.

# Test: Search for the usage of `num_query_usage_stats_generated`. Expect: Appropriate usage for tracking generated query usage statistics.
rg --type py 'num_query_usage_stats_generated' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py

Length of output: 225


18-18: Verify the usage of get_time_bucket.

Ensure that the usage of get_time_bucket aligns with the intended functionality and handles time-based operations correctly.

Verification successful

The usage of get_time_bucket aligns with the intended functionality.

The function is correctly used to convert timestamps into bucketed time formats based on the specified bucket duration.

  • Location: metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the usage of `get_time_bucket` in the file.

# Test: Search for the usage of `get_time_bucket`. Expect: Appropriate usage for time-based operations.
rg --type python 'get_time_bucket' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py

Length of output: 134


Script:

#!/bin/bash
# Description: Verify the usage of `get_time_bucket` in the file.

# Test: Search for the usage of `get_time_bucket`. Expect: Appropriate usage for time-based operations.
rg --type py 'get_time_bucket' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py

Length of output: 206


Script:

#!/bin/bash
# Description: Retrieve more lines of context around the usage of `get_time_bucket` to ensure it aligns with the intended functionality.

# Test: Display lines around the usage of `get_time_bucket`.
rg --type py -A 5 -B 5 'get_time_bucket' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py

Length of output: 1171


273-275: Verify the usage of self.generate_query_usage_statistics.

Ensure that the assignment of generate_query_usage_statistics to self.generate_query_usage_statistics is correctly utilized in the class methods.

Verification successful

The assignment and usage of self.generate_query_usage_statistics are correctly utilized in the class methods.

  • The variable is assigned appropriately and used in multiple conditional checks to control the generation of query usage statistics.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the usage of `self.generate_query_usage_statistics` in the file.

# Test: Search for the usage of `self.generate_query_usage_statistics`. Expect: Appropriate usage for controlling query usage statistics generation.
rg --type python 'self.generate_query_usage_statistics' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py

Length of output: 155


Script:

#!/bin/bash
# Description: Verify the usage of `self.generate_query_usage_statistics` in the file.

# Test: Search for the usage of `self.generate_query_usage_statistics`. Expect: Appropriate usage for controlling query usage statistics generation.
rg 'self.generate_query_usage_statistics' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py

Length of output: 403


262-263: Verify the usage of is_temp_table and is_allowed_table parameters.

Ensure that the addition of is_temp_table and is_allowed_table parameters to the SqlParsingAggregator class constructor is correctly utilized and doesn't introduce issues.

Verification successful

Verified the usage of is_temp_table and is_allowed_table parameters.

The parameters is_temp_table and is_allowed_table have been correctly integrated into the SqlParsingAggregator class. They are assigned to instance variables and utilized in methods to control table filtering.

  • The parameters are assigned to self._is_temp_table and self._is_allowed_table.
  • Methods is_temp_table and is_allowed_table use these instance variables to perform their logic.
  • These methods are used in various parts of the class to appropriately filter tables.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the usage of `is_temp_table` and `is_allowed_table` parameters in the file.

# Test: Search for the usage of `is_temp_table` and `is_allowed_table`. Expect: Appropriate usage for controlling table filtering.
rg --type python 'is_temp_table|is_allowed_table' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py

Length of output: 151


Script:

#!/bin/bash
# Description: Verify the usage of `is_temp_table` and `is_allowed_table` parameters in the file.

# Test: Search for the usage of `is_temp_table` and `is_allowed_table`. Expect: Appropriate usage for controlling table filtering.
rg 'is_temp_table|is_allowed_table' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py

Length of output: 964


381-389: Verify the usage of query_usage_counts field.

Ensure that the addition of query_usage_counts to the SqlParsingAggregator class is correctly utilized and doesn't introduce issues.

Verification successful

The query_usage_counts field is correctly utilized.

The query_usage_counts field added to the SqlParsingAggregator class is appropriately used for tracking query usage statistics.

  • Lines involving query_usage_counts include initialization, checks for its presence, and usage in various operations.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the usage of `query_usage_counts` field in the file.

# Test: Search for the usage of `query_usage_counts`. Expect: Appropriate usage for tracking query usage statistics.
rg --type python 'query_usage_counts' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py

Length of output: 137


Script:

#!/bin/bash
# Description: Verify the usage of `query_usage_counts` field in the file.

# Test: Search for the usage of `query_usage_counts`. Expect: Appropriate usage for tracking query usage statistics.
rg 'query_usage_counts' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py

Length of output: 685


281-283: Verify the usage of the usage_config check.

Ensure that the check for usage_config when generate_usage_statistics or generate_query_usage_statistics is enabled is correctly utilized and doesn't introduce issues.

Verification successful

The usage of the usage_config check is correctly utilized.

The code contains several assertions to ensure that usage_config is not None when necessary, indicating that the check is correctly handled and does not introduce issues.

  • Line defining usage_config as Optional[BaseUsageConfig] = None
  • Line assigning usage_config to self.usage_config
  • Line checking if self.usage_config is None
  • Several assertions assert self.usage_config is not None
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the usage of the `usage_config` check in the file.

# Test: Search for the usage of the `usage_config` check. Expect: Appropriate usage for ensuring `usage_config` is provided.
rg --type python 'usage_config' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py

Length of output: 131


Script:

#!/bin/bash
# Description: Verify the usage of the `usage_config` check in the file.

# Test: Search for the usage of the `usage_config` check. Expect: Appropriate usage for ensuring `usage_config` is provided.
rg --type py 'usage_config' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py

Length of output: 662


4-4: Verify the usage of functools.

Ensure that the usage of functools doesn't introduce memory leaks or other issues as warned by static analysis tools.

Verification successful

Verified the appropriate usage of functools.

The functools.lru_cache decorator is correctly applied to the _name_from_urn method, providing caching benefits without introducing memory leaks or other issues.

  • metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py:
    • import functools
    • @functools.lru_cache(maxsize=128)
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the usage of `functools` in the file.

# Test: Search for the usage of `functools`. Expect: Appropriate usage without memory leaks.
rg --type python 'functools' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py

Length of output: 128


Script:

#!/bin/bash
# Description: Verify the usage of `functools` in the file.

# Test: Search for the usage of `functools`. Expect: Appropriate usage without memory leaks.
rg 'functools' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py

Length of output: 138


Script:

#!/bin/bash
# Description: Gather context around the usage of `functools.lru_cache` in the file.

# Test: Search for the function where `functools.lru_cache` is applied and get surrounding lines to understand the context.
rg -A 10 '@functools.lru_cache' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py

Length of output: 555


148-169: Verify the usage of PreparsedQuery dataclass.

Ensure that the PreparsedQuery dataclass is correctly utilized and its fields align with the intended functionality.

Verification successful

Verified the usage of PreparsedQuery dataclass.

The PreparsedQuery dataclass is correctly utilized in the codebase. The fields and usage align with the intended functionality of representing pre-parsed queries.

  • Defined as a dataclass.
  • Used in type annotations.
  • Instantiated in the method add_preparsed_query.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the usage of `PreparsedQuery` dataclass in the file.

# Test: Search for the usage of `PreparsedQuery`. Expect: Appropriate usage for representing pre-parsed queries.
rg --type python 'PreparsedQuery' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py

Length of output: 133


Script:

#!/bin/bash
# Description: Verify the usage of `PreparsedQuery` dataclass in the file.

# Test: Search for the usage of `PreparsedQuery`. Expect: Appropriate usage for representing pre-parsed queries.
rg 'PreparsedQuery' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py

Length of output: 303


Script:

#!/bin/bash
# Description: Gather surrounding context of `PreparsedQuery` usages in the file.

# Test: Search for the surrounding context of `PreparsedQuery` usages.
rg 'PreparsedQuery' -A 5 -B 5 metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py

Length of output: 1751


141-146: Verify the usage of KnownLineageMapping dataclass.

Ensure that the KnownLineageMapping dataclass is correctly utilized and its fields align with the intended functionality.

Verification successful

The KnownLineageMapping dataclass is correctly utilized.

The KnownLineageMapping dataclass is used appropriately in type hints and type checks to handle lineage information, ensuring it aligns with the intended functionality.

  • Definition of KnownLineageMapping: lines 141-146
  • Usage in add method: lines following the definition
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the usage of `KnownLineageMapping` dataclass in the file.

# Test: Search for the usage of `KnownLineageMapping`. Expect: Appropriate usage for lineage mappings.
rg --type python 'KnownLineageMapping' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py

Length of output: 138


Script:

#!/bin/bash
# Description: Verify the usage of `KnownLineageMapping` dataclass in the file.

# Test: Search for the usage of `KnownLineageMapping`. Expect: Appropriate usage for lineage mappings.
rg 'KnownLineageMapping' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py

Length of output: 258


Script:

#!/bin/bash
# Description: Inspect the context of `KnownLineageMapping` usages in the file.

# Test: Extract the context lines around the usages of `KnownLineageMapping` to verify alignment with the expected functionality.
rg 'KnownLineageMapping' -A 5 -B 5 metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py

Length of output: 1015


226-226: Verify the usage of num_queries_skipped_due_to_filters field.

Ensure that the addition of num_queries_skipped_due_to_filters to the SqlAggregatorReport dataclass is correctly utilized and doesn't introduce issues.

Verification successful

Verified the usage of num_queries_skipped_due_to_filters field.

The addition of num_queries_skipped_due_to_filters to the SqlAggregatorReport dataclass is correctly utilized to track the number of queries skipped due to filters.

  • The field is incremented appropriately within the _gen_query method when a query doesn't involve any allowed tables.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the usage of `num_queries_skipped_due_to_filters` field in the file.

# Test: Search for the usage of `num_queries_skipped_due_to_filters`. Expect: Appropriate usage for tracking skipped queries.
rg --type python 'num_queries_skipped_due_to_filters' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py

Length of output: 153


Script:

#!/bin/bash
# Description: Verify the usage of `num_queries_skipped_due_to_filters` field in the file.

# Test: Search for the usage of `num_queries_skipped_due_to_filters`. Expect: Appropriate usage for tracking skipped queries.
rg 'num_queries_skipped_due_to_filters' metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py

Length of output: 220


Script:

#!/bin/bash
# Description: Verify the context of the usage of `num_queries_skipped_due_to_filters` field in the file.

# Test: Search for the context where `self.report.num_queries_skipped_due_to_filters += 1` is used.
rg 'self\.report\.num_queries_skipped_due_to_filters \+= 1' -A 5 metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py

Length of output: 355


Script:

#!/bin/bash
# Description: Verify the context of the usage of `num_queries_skipped_due_to_filters` field in the file.

# Test: Search for the full context of the method or function where `self.report.num_queries_skipped_due_to_filters += 1` occurs.
rg 'def ' -A 20 metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py | rg -B 20 'self\.report\.num_queries_skipped_due_to_filters \+= 1'

Length of output: 987

metadata-ingestion/tests/integration/snowflake/snowflake_privatelink_golden.json (2)

3895-3959: LGTM! Schema fields are correctly added to the querySubjects aspect.

The schema fields are correctly specified with appropriate field paths, types, and other properties.


4174-4238: LGTM! Schema fields are correctly added to the querySubjects aspect.

The schema fields are correctly specified with appropriate field paths, types, and other properties.

metadata-ingestion/tests/integration/snowflake/snowflake_golden.json (13)

Line range hint 1-1:
Approved: Addition of new dataset entity.

The addition of the new dataset entity urn:li:dataset:(urn:li:dataPlatform:snowflake,test_db.test_schema.table_2,PROD) is consistent with the PR summary.


4524-4566: Approved: Addition of multiple dataset and schemaField entities.

The addition of multiple dataset and schemaField entities related to test_db.test_schema.table_1 is consistent with the PR summary.


5138-5174: Approved: Addition of multiple schemaField entities.

The addition of multiple schemaField entities related to test_db.test_schema.table_2 is consistent with the PR summary.


5202-5243: Approved: Addition of multiple dataset and schemaField entities.

The addition of multiple dataset and schemaField entities related to test_db.test_schema.table_10 is consistent with the PR summary.


5755-5796: Approved: Addition of multiple dataset and schemaField entities.

The addition of multiple dataset and schemaField entities related to test_db.test_schema.table_4 is consistent with the PR summary.


5988-6029: Approved: Addition of multiple dataset and schemaField entities.

The addition of multiple dataset and schemaField entities related to test_db.test_schema.table_5 is consistent with the PR summary.


6421-6462: Approved: Addition of multiple dataset and schemaField entities.

The addition of multiple dataset and schemaField entities related to test_db.test_schema.table_3 is consistent with the PR summary.


6860-6901: Approved: Addition of multiple dataset and schemaField entities.

The addition of multiple dataset and schemaField entities related to test_db.test_schema.table_6 is consistent with the PR summary.


7043-7084: Approved: Addition of multiple dataset and schemaField entities.

The addition of multiple dataset and schemaField entities related to test_db.test_schema.table_8 is consistent with the PR summary.


7096-7167: Approved: Addition of multiple dataset and schemaField entities.

The addition of multiple dataset and schemaField entities related to test_db.test_schema.table_1 and test_db.test_schema.view_1 is consistent with the PR summary.


7268-7309: Approved: Addition of multiple dataset and schemaField entities.

The addition of multiple dataset and schemaField entities related to test_db.test_schema.table_9 is consistent with the PR summary.


7435-7476: Approved: Addition of multiple dataset and schemaField entities.

The addition of multiple dataset and schemaField entities related to test_db.test_schema.table_7 is consistent with the PR summary.


7488-7559: Approved: Addition of multiple dataset and schemaField entities.

The addition of multiple dataset and schemaField entities related to test_db.test_schema.table_2 and test_db.test_schema.view_2 is consistent with the PR summary.

Comment on lines +108 to +116
self.aggregator._is_temp_table = (
lambda name: DatasetUrn.create_from_ids(
self.platform,
name,
env=self.config.env,
platform_instance=self.config.platform_instance,
).urn()
not in self.known_urns
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improve readability by using a named function for the lambda.

Using a named function improves readability and maintainability.

- self.aggregator._is_temp_table = (
-   lambda name: DatasetUrn.create_from_ids(
-       self.platform,
-       name,
-       env=self.config.env,
-       platform_instance=self.config.platform_instance,
-   ).urn()
-   not in self.known_urns
- )
+ def is_temp_table(name: str) -> bool:
+   return DatasetUrn.create_from_ids(
+       self.platform,
+       name,
+       env=self.config.env,
+       platform_instance=self.config.platform_instance,
+   ).urn() not in self.known_urns
+ self.aggregator._is_temp_table = is_temp_table
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
self.aggregator._is_temp_table = (
lambda name: DatasetUrn.create_from_ids(
self.platform,
name,
env=self.config.env,
platform_instance=self.config.platform_instance,
).urn()
not in self.known_urns
)
def is_temp_table(name: str) -> bool:
return DatasetUrn.create_from_ids(
self.platform,
name,
env=self.config.env,
platform_instance=self.config.platform_instance,
).urn() not in self.known_urns
self.aggregator._is_temp_table = is_temp_table

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 4c4fb22 and 2e31235.

Files selected for processing (1)
  • metadata-ingestion/tests/unit/stateful_ingestion/state/test_redundant_run_skip_handler.py (3 hunks)
Additional comments not posted (2)
metadata-ingestion/tests/unit/stateful_ingestion/state/test_redundant_run_skip_handler.py (2)

2-2: Correctly updated return type to Iterable[SnowflakeV2Source]

The return type of the stateful_source fixture has been correctly updated to Iterable[SnowflakeV2Source], aligning with the changes in the source code.


47-49: Proper use of context manager for SnowflakeV2Source

The addition of a context manager to manage the SnowflakeV2Source instance ensures proper resource management and cleanup.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Outside diff range, codebase verification and nitpick comments (2)
metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_queries.py (2)

71-77: Local temporary path configuration for audit logs.

The local_temp_path field is configured to store the audit log locally. The TODO comments suggest improvements for caching and local testing.

Ensure the improvements for caching and local testing are tracked and implemented.


402-502: Build enriched audit log query.

The _build_enriched_audit_log_query function correctly builds the SQL query to fetch enriched audit logs from Snowflake. The TODO comments suggest improvements for generating better query fingerprints, adding table filter clauses, and dropping unnecessary columns.

Ensure the improvements suggested in the TODO comments are tracked and implemented.

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 2e31235 and 9d43343.

Files selected for processing (3)
  • docs-website/vercel-setup.sh (2 hunks)
  • metadata-ingestion/scripts/install_deps.sh (1 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_queries.py (1 hunks)
Files skipped from review due to trivial changes (1)
  • docs-website/vercel-setup.sh
Additional comments not posted (17)
metadata-ingestion/scripts/install_deps.sh (1)

21-22: Addition of krb5-devel dependency for yum systems.

The addition of krb5-devel is correctly placed under the yum package manager section. This ensures that Kerberos development libraries are available for systems using yum.

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_queries.py (16)

92-94: Snowflake connection configuration.

The connection field is correctly defined to configure the Snowflake connection.


96-99: Snowflake Queries Extractor Report fields.

The fields for the time window and SQL aggregator report are correctly defined.


103-106: Snowflake Queries Source Report field.

The field for the queries extractor report is correctly defined.


108-120: Initialization of SnowflakeQueriesExtractor.

The constructor initializes the connection, configuration, reports, and SQL aggregator.


155-165: Local temporary path for audit logs.

The local_temp_path method ensures a temporary directory is created for storing audit logs. It logs the path being used.


166-170: Check for temporary tables.

The is_temp_table method checks if a table name matches any of the temporary table patterns.


172-174: Check for allowed tables.

The is_allowed_table method checks if a table name is allowed based on dataset patterns.


175-203: Generate work units from queries.

The get_workunits_internal method generates work units from the queries. It handles the audit log caching and iterates through the queries to add them to the SQL aggregator.


204-257: Fetch audit logs from Snowflake.

The fetch_audit_log method fetches audit logs from Snowflake. It includes TODO comments for fetching additional information and handling errors.


259-262: Generate dataset identifier from qualified name.

The get_dataset_identifier_from_qualified_name method generates a dataset identifier from a qualified name.


263-365: Parse audit log row.

The _parse_audit_log_row method parses a row from the audit log and generates a PreparsedQuery object. It includes TODO comments for filtering table names and mapping email addresses.


368-373: Initialization of SnowflakeQueriesSource.

The constructor initializes the context, configuration, reports, and queries extractor.


385-388: Create SnowflakeQueriesSource from config.

The create method creates a SnowflakeQueriesSource instance from a configuration dictionary and pipeline context.


390-392: Generate work units from queries.

The get_workunits_internal method generates work units from the queries using the queries extractor.


394-395: Get report for SnowflakeQueriesSource.

The get_report method returns the report for the SnowflakeQueriesSource.


504-515: Snowflake query type mappings.

The SNOWFLAKE_QUERY_TYPE_MAPPING constant correctly maps Snowflake query types to internal query types.

Comment on lines +57 to +59
class SnowflakeQueriesExtractorConfig(SnowflakeIdentifierConfig, SnowflakeFilterConfig):
# TODO: Support stateful ingestion for the time windows.
window: BaseTimeWindowConfig = BaseTimeWindowConfig()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add support for stateful ingestion for the time windows.

The TODO comment indicates that support for stateful ingestion for the time windows is pending.

Do you want me to generate the implementation for stateful ingestion or open a GitHub issue to track this task?

Comment on lines +61 to +63
# TODO: make this a proper allow/deny pattern
deny_usernames: List[str] = []

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider making this a proper allow/deny pattern.

The TODO comment suggests that the deny_usernames field should be converted to a proper allow/deny pattern.

Consider refactoring this field to support a proper allow/deny pattern.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 9d43343 and 95d1ea2.

Files selected for processing (3)
  • metadata-ingestion/setup.py (2 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_schema_gen.py (19 hunks)
  • metadata-models/src/main/resources/entity-registry.yml (1 hunks)
Files skipped from review due to trivial changes (2)
  • metadata-ingestion/setup.py
  • metadata-models/src/main/resources/entity-registry.yml
Files skipped from review as they are similar to previous changes (1)
  • metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_schema_gen.py

@hsheth2 hsheth2 merged commit a4bce6a into master Jul 12, 2024
62 of 63 checks passed
@hsheth2 hsheth2 deleted the snowflake-queries-oss branch July 12, 2024 22:08
hsheth2 added a commit that referenced this pull request Jul 16, 2024
aviv-julienjehannet pushed a commit to aviv-julienjehannet/datahub that referenced this pull request Jul 25, 2024
arosanda added a commit to infobip/datahub that referenced this pull request Sep 23, 2024
* feat(forms) Handle deleting forms references when hard deleting forms (datahub-project#10820)

* refactor(ui): Misc improvements to the setup ingestion flow (ingest uplift 1/2)  (datahub-project#10764)

Co-authored-by: John Joyce <[email protected]>
Co-authored-by: John Joyce <[email protected]>

* fix(ingestion/airflow-plugin): pipeline tasks discoverable in search (datahub-project#10819)

* feat(ingest/transformer): tags to terms transformer (datahub-project#10758)

Co-authored-by: Aseem Bansal <[email protected]>

* fix(ingestion/unity-catalog): fixed issue with profiling with GE turned on (datahub-project#10752)

Co-authored-by: Aseem Bansal <[email protected]>

* feat(forms) Add java SDK for form entity PATCH + CRUD examples (datahub-project#10822)

* feat(SDK) Add java SDK for structuredProperty entity PATCH + CRUD examples (datahub-project#10823)

* feat(SDK) Add StructuredPropertyPatchBuilder in python sdk and provide sample CRUD files (datahub-project#10824)

* feat(forms) Add CRUD endpoints to GraphQL for Form entities (datahub-project#10825)

* add flag for includeSoftDeleted in scroll entities API (datahub-project#10831)

* feat(deprecation) Return actor entity with deprecation aspect (datahub-project#10832)

* feat(structuredProperties) Add CRUD graphql APIs for structured property entities (datahub-project#10826)

* add scroll parameters to openapi v3 spec (datahub-project#10833)

* fix(ingest): correct profile_day_of_week implementation (datahub-project#10818)

* feat(ingest/glue): allow ingestion of empty databases from Glue (datahub-project#10666)

Co-authored-by: Harshal Sheth <[email protected]>

* feat(cli): add more details to get cli (datahub-project#10815)

* fix(ingestion/glue): ensure date formatting works on all platforms for aws glue (datahub-project#10836)

* fix(ingestion): fix datajob patcher (datahub-project#10827)

* fix(smoke-test): add suffix in temp file creation (datahub-project#10841)

* feat(ingest/glue): add helper method to permit user or group ownership (datahub-project#10784)

* feat(): Show data platform instances in policy modal if they are set on the policy (datahub-project#10645)

Co-authored-by: Hendrik Richert <[email protected]>

* docs(patch): add patch documentation for how implementation works (datahub-project#10010)

Co-authored-by: John Joyce <[email protected]>

* fix(jar): add missing custom-plugin-jar task (datahub-project#10847)

* fix(): also check exceptions/stack trace when filtering log messages (datahub-project#10391)

Co-authored-by: John Joyce <[email protected]>

* docs(): Update posts.md (datahub-project#9893)

Co-authored-by: Hyejin Yoon <[email protected]>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* chore(ingest): update acryl-datahub-classify version (datahub-project#10844)

* refactor(ingest): Refactor structured logging to support infos, warnings, and failures structured reporting to UI (datahub-project#10828)

Co-authored-by: John Joyce <[email protected]>
Co-authored-by: Harshal Sheth <[email protected]>

* fix(restli): log aspect-not-found as a warning rather than as an error (datahub-project#10834)

* fix(ingest/nifi): remove duplicate upstream jobs (datahub-project#10849)

* fix(smoke-test): test access to create/revoke personal access tokens (datahub-project#10848)

* fix(smoke-test): missing test for move domain (datahub-project#10837)

* ci: update usernames to not considered for community (datahub-project#10851)

* env: change defaults for data contract visibility (datahub-project#10854)

* fix(ingest/tableau): quote special characters in external URL (datahub-project#10842)

* fix(smoke-test): fix flakiness of auto complete test

* ci(ingest): pin dask dependency for feast (datahub-project#10865)

* fix(ingestion/lookml): liquid template resolution and view-to-view cll (datahub-project#10542)

* feat(ingest/audit): add client id and version in system metadata props (datahub-project#10829)

* chore(ingest): Mypy 1.10.1 pin (datahub-project#10867)

* docs: use acryl-datahub-actions as expected python package to install (datahub-project#10852)

* docs: add new js snippet (datahub-project#10846)

* refactor(ingestion): remove company domain for security reason (datahub-project#10839)

* fix(ingestion/spark): Platform instance and column level lineage fix (datahub-project#10843)

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* feat(ingestion/tableau): optionally ingest multiple sites and create site containers (datahub-project#10498)

Co-authored-by: Yanik Häni <[email protected]>

* fix(ingestion/looker): Add sqlglot dependency and remove unused sqlparser (datahub-project#10874)

* fix(manage-tokens): fix manage access token policy (datahub-project#10853)

* Batch get entity endpoints (datahub-project#10880)

* feat(system): support conditional write semantics (datahub-project#10868)

* fix(build): upgrade vercel builds to Node 20.x (datahub-project#10890)

* feat(ingest/lookml): shallow clone repos (datahub-project#10888)

* fix(ingest/looker): add missing dependency (datahub-project#10876)

* fix(ingest): only populate audit stamps where accurate (datahub-project#10604)

* fix(ingest/dbt): always encode tag urns (datahub-project#10799)

* fix(ingest/redshift): handle multiline alter table commands (datahub-project#10727)

* fix(ingestion/looker): column name missing in explore (datahub-project#10892)

* fix(lineage) Fix lineage source/dest filtering with explored per hop limit (datahub-project#10879)

* feat(conditional-writes): misc updates and fixes (datahub-project#10901)

* feat(ci): update outdated action (datahub-project#10899)

* feat(rest-emitter): adding async flag to rest emitter (datahub-project#10902)

Co-authored-by: Gabe Lyons <[email protected]>

* feat(ingest): add snowflake-queries source (datahub-project#10835)

* fix(ingest): improve `auto_materialize_referenced_tags_terms` error handling (datahub-project#10906)

* docs: add new company to adoption list (datahub-project#10909)

* refactor(redshift): Improve redshift error handling with new structured reporting system (datahub-project#10870)

Co-authored-by: John Joyce <[email protected]>
Co-authored-by: Harshal Sheth <[email protected]>

* feat(ui) Finalize support for all entity types on forms (datahub-project#10915)

* Index ExecutionRequestResults status field (datahub-project#10811)

* feat(ingest): grafana connector (datahub-project#10891)

Co-authored-by: Shirshanka Das <[email protected]>
Co-authored-by: Harshal Sheth <[email protected]>

* fix(gms) Add Form entity type to EntityTypeMapper (datahub-project#10916)

* feat(dataset): add support for external url in Dataset (datahub-project#10877)

* docs(saas-overview) added missing features to observe section (datahub-project#10913)

Co-authored-by: John Joyce <[email protected]>

* fix(ingest/spark): Fixing Micrometer warning (datahub-project#10882)

* fix(structured properties): allow application of structured properties without schema file (datahub-project#10918)

* fix(data-contracts-web) handle other schedule types (datahub-project#10919)

* fix(ingestion/tableau): human-readable message for PERMISSIONS_MODE_SWITCHED error (datahub-project#10866)

Co-authored-by: Harshal Sheth <[email protected]>

* Add feature flag for view defintions (datahub-project#10914)

Co-authored-by: Ethan Cartwright <[email protected]>

* feat(ingest/BigQuery): refactor+parallelize dataset metadata extraction (datahub-project#10884)

* fix(airflow): add error handling around render_template() (datahub-project#10907)

* feat(ingestion/sqlglot): add optional `default_dialect` parameter to sqlglot lineage (datahub-project#10830)

* feat(mcp-mutator): new mcp mutator plugin (datahub-project#10904)

* fix(ingest/bigquery): changes helper function to decode unicode scape sequences (datahub-project#10845)

* feat(ingest/postgres): fetch table sizes for profile (datahub-project#10864)

* feat(ingest/abs): Adding azure blob storage ingestion source (datahub-project#10813)

* fix(ingest/redshift): reduce severity of SQL parsing issues (datahub-project#10924)

* fix(build): fix lint fix web react (datahub-project#10896)

* fix(ingest/bigquery): handle quota exceeded for project.list requests (datahub-project#10912)

* feat(ingest): report extractor failures more loudly (datahub-project#10908)

* feat(ingest/snowflake): integrate snowflake-queries into main source (datahub-project#10905)

* fix(ingest): fix docs build (datahub-project#10926)

* fix(ingest/snowflake): fix test connection (datahub-project#10927)

* fix(ingest/lookml): add view load failures to cache (datahub-project#10923)

* docs(slack) overhauled setup instructions and screenshots (datahub-project#10922)

Co-authored-by: John Joyce <[email protected]>

* fix(airflow): Add comma parsing of owners to DataJobs (datahub-project#10903)

* fix(entityservice): fix merging sideeffects (datahub-project#10937)

* feat(ingest): Support System Ingestion Sources, Show and hide system ingestion sources with Command-S (datahub-project#10938)

Co-authored-by: John Joyce <[email protected]>

* chore() Set a default lineage filtering end time on backend when a start time is present (datahub-project#10925)

Co-authored-by: John Joyce <[email protected]>
Co-authored-by: John Joyce <[email protected]>

* Added relationships APIs to V3. Added these generic APIs to V3 swagger doc. (datahub-project#10939)

* docs: add learning center to docs (datahub-project#10921)

* doc: Update hubspot form id (datahub-project#10943)

* chore(airflow): add python 3.11 w/ Airflow 2.9 to CI (datahub-project#10941)

* fix(ingest/Glue): column upstream lineage between S3 and Glue (datahub-project#10895)

* fix(ingest/abs): split abs utils into multiple files (datahub-project#10945)

* doc(ingest/looker): fix doc for sql parsing documentation (datahub-project#10883)

Co-authored-by: Harshal Sheth <[email protected]>

* fix(ingest/bigquery): Adding missing BigQuery types (datahub-project#10950)

* fix(ingest/setup): feast and abs source setup (datahub-project#10951)

* fix(connections) Harden adding /gms to connections in backend (datahub-project#10942)

* feat(siblings) Add flag to prevent combining siblings in the UI (datahub-project#10952)

* fix(docs): make graphql doc gen more automated (datahub-project#10953)

* feat(ingest/athena): Add option for Athena partitioned profiling (datahub-project#10723)

* fix(spark-lineage): default timeout for future responses (datahub-project#10947)

* feat(datajob/flow): add environment filter using info aspects (datahub-project#10814)

* fix(ui/ingest): correct privilege used to show tab (datahub-project#10483)

Co-authored-by: Kunal-kankriya <[email protected]>

* feat(ingest/looker): include dashboard urns in browse v2 (datahub-project#10955)

* add a structured type to batchGet in OpenAPI V3 spec (datahub-project#10956)

* fix(ui): scroll on the domain sidebar to show all domains (datahub-project#10966)

* fix(ingest/sagemaker): resolve incorrect variable assignment for SageMaker API call (datahub-project#10965)

* fix(airflow/build): Pinning mypy (datahub-project#10972)

* Fixed a bug where the OpenAPI V3 spec was incorrect. The bug was introduced in datahub-project#10939. (datahub-project#10974)

* fix(ingest/test): Fix for mssql integration tests (datahub-project#10978)

* fix(entity-service) exist check correctly extracts status (datahub-project#10973)

* fix(structuredProps) casing bug in StructuredPropertiesValidator (datahub-project#10982)

* bugfix: use anyOf instead of allOf when creating references in openapi v3 spec (datahub-project#10986)

* fix(ui): Remove ant less imports (datahub-project#10988)

* feat(ingest/graph): Add get_results_by_filter to DataHubGraph (datahub-project#10987)

* feat(ingest/cli): init does not actually support environment variables (datahub-project#10989)

* fix(ingest/graph): Update get_results_by_filter graphql query (datahub-project#10991)

* feat(ingest/spark): Promote beta plugin (datahub-project#10881)

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* feat(ingest): support domains in meta -> "datahub" section (datahub-project#10967)

* feat(ingest): add `check server-config` command (datahub-project#10990)

* feat(cli): Make consistent use of DataHubGraphClientConfig (datahub-project#10466)

Deprecates get_url_and_token() in favor of a more complete option: load_graph_config() that returns a full DatahubClientConfig.
This change was then propagated across previous usages of get_url_and_token so that connections to DataHub server from the client respect the full breadth of configuration specified by DatahubClientConfig.

I.e: You can now specify disable_ssl_verification: true in your ~/.datahubenv file so that all cli functions to the server work when ssl certification is disabled.

Fixes datahub-project#9705

* fix(ingest/s3): Fixing container creation when there is no folder in path (datahub-project#10993)

* fix(ingest/looker): support platform instance for dashboards & charts (datahub-project#10771)

* feat(ingest/bigquery): improve handling of information schema in sql parser (datahub-project#10985)

* feat(ingest): improve `ingest deploy` command (datahub-project#10944)

* fix(backend): allow excluding soft-deleted entities in relationship-queries; exclude soft-deleted members of groups (datahub-project#10920)

- allow excluding soft-deleted entities in relationship-queries
- exclude soft-deleted members of groups

* fix(ingest/looker): downgrade missing chart type log level (datahub-project#10996)

* doc(acryl-cloud): release docs for 0.3.4.x (datahub-project#10984)

Co-authored-by: John Joyce <[email protected]>
Co-authored-by: RyanHolstien <[email protected]>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: Pedro Silva <[email protected]>

* fix(protobuf/build): Fix protobuf check jar script (datahub-project#11006)

* fix(ui/ingest): Support invalid cron jobs (datahub-project#10998)

* fix(ingest): fix graph config loading (datahub-project#11002)

Co-authored-by: Pedro Silva <[email protected]>

* feat(docs): Document __DATAHUB_TO_FILE_ directive (datahub-project#10968)

Co-authored-by: Harshal Sheth <[email protected]>

* fix(graphql/upsertIngestionSource): Validate cron schedule; parse error in CLI (datahub-project#11011)

* feat(ece): support custom ownership type urns in ECE generation (datahub-project#10999)

* feat(assertion-v2): changed Validation tab to Quality and created new Governance tab (datahub-project#10935)

* fix(ingestion/glue): Add support for missing config options for profiling in Glue (datahub-project#10858)

* feat(propagation): Add models for schema field docs, tags, terms (datahub-project#2959) (datahub-project#11016)

Co-authored-by: Chris Collins <[email protected]>

* docs: standardize terminology to DataHub Cloud (datahub-project#11003)

* fix(ingestion/transformer): replace the externalUrl container (datahub-project#11013)

* docs(slack) troubleshoot docs (datahub-project#11014)

* feat(propagation): Add graphql API (datahub-project#11030)

Co-authored-by: Chris Collins <[email protected]>

* feat(propagation):  Add models for Action feature settings (datahub-project#11029)

* docs(custom properties): Remove duplicate from sidebar (datahub-project#11033)

* feat(models): Introducing Dataset Partitions Aspect (datahub-project#10997)

Co-authored-by: John Joyce <[email protected]>
Co-authored-by: John Joyce <[email protected]>

* feat(propagation): Add Documentation Propagation Settings (datahub-project#11038)

* fix(models): chart schema fields mapping, add dataHubAction entity, t… (datahub-project#11040)

* fix(ci): smoke test lint failures (datahub-project#11044)

* docs: fix learning center color scheme & typo (datahub-project#11043)

* feat: add cloud main page (datahub-project#11017)

Co-authored-by: Jay <[email protected]>

* feat(restore-indices): add additional step to also clear system metadata service (datahub-project#10662)

Co-authored-by: John Joyce <[email protected]>

* docs: fix typo (datahub-project#11046)

* fix(lint): apply spotless (datahub-project#11050)

* docs(airflow): example query to get datajobs for a dataflow (datahub-project#11034)

* feat(cli): Add run-id option to put sub-command (datahub-project#11023)

Adds an option to assign run-id to a given put command execution. 
This is useful when transformers do not exist for a given ingestion payload, we can follow up with custom metadata and assign it to an ingestion pipeline.

* fix(ingest): improve sql error reporting calls (datahub-project#11025)

* fix(airflow): fix CI setup (datahub-project#11031)

* feat(ingest/dbt): add experimental `prefer_sql_parser_lineage` flag (datahub-project#11039)

* fix(ingestion/lookml): enable stack-trace in lookml logs (datahub-project#10971)

* (chore): Linting fix (datahub-project#11015)

* chore(ci): update deprecated github actions (datahub-project#10977)

* Fix ALB configuration example (datahub-project#10981)

* chore(ingestion-base): bump base image packages (datahub-project#11053)

* feat(cli): Trim report of dataHubExecutionRequestResult to max GMS size (datahub-project#11051)

* fix(ingestion/lookml): emit dummy sql condition for lookml custom condition tag (datahub-project#11008)

Co-authored-by: Harshal Sheth <[email protected]>

* fix(ingestion/powerbi): fix issue with broken report lineage (datahub-project#10910)

* feat(ingest/tableau): add retry on timeout (datahub-project#10995)

* change generate kafka connect properties from env (datahub-project#10545)

Co-authored-by: david-leifker <[email protected]>

* fix(ingest): fix oracle cronjob ingestion (datahub-project#11001)

Co-authored-by: david-leifker <[email protected]>

* chore(ci): revert update deprecated github actions (datahub-project#10977) (datahub-project#11062)

* feat(ingest/dbt-cloud): update metadata_endpoint inference (datahub-project#11041)

* build: Reduce size of datahub-frontend-react image by 50-ish% (datahub-project#10878)

Co-authored-by: david-leifker <[email protected]>

* fix(ci): Fix lint issue in datahub_ingestion_run_summary_provider.py (datahub-project#11063)

* docs(ingest): update developing-a-transformer.md (datahub-project#11019)

* feat(search-test): update search tests from datahub-project#10408 (datahub-project#11056)

* feat(cli): add aspects parameter to DataHubGraph.get_entity_semityped (datahub-project#11009)

Co-authored-by: Harshal Sheth <[email protected]>

* docs(airflow): update min version for plugin v2 (datahub-project#11065)

* doc(ingestion/tableau): doc update for derived permission (datahub-project#11054)

Co-authored-by: Pedro Silva <[email protected]>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: Harshal Sheth <[email protected]>

* fix(py): remove dep on types-pkg_resources (datahub-project#11076)

* feat(ingest/mode): add option to exclude restricted (datahub-project#11081)

* fix(ingest): set lastObserved in sdk when unset (datahub-project#11071)

* doc(ingest): Update capabilities (datahub-project#11072)

* chore(vulnerability): Log Injection (datahub-project#11090)

* chore(vulnerability): Information exposure through a stack trace (datahub-project#11091)

* chore(vulnerability): Comparison of narrow type with wide type in loop condition (datahub-project#11089)

* chore(vulnerability): Insertion of sensitive information into log files (datahub-project#11088)

* chore(vulnerability): Risky Cryptographic Algorithm (datahub-project#11059)

* chore(vulnerability): Overly permissive regex range (datahub-project#11061)

Co-authored-by: Harshal Sheth <[email protected]>

* fix: update customer data (datahub-project#11075)

* fix(models): fixing the datasetPartition models (datahub-project#11085)

Co-authored-by: John Joyce <[email protected]>

* fix(ui): Adding view, forms GraphQL query, remove showing a fallback error message on unhandled GraphQL error (datahub-project#11084)

Co-authored-by: John Joyce <[email protected]>

* feat(docs-site): hiding learn more from cloud page (datahub-project#11097)

* fix(docs): Add correct usage of orFilters in search API docs (datahub-project#11082)

Co-authored-by: Jay <[email protected]>

* fix(ingest/mode): Regexp in mode name matcher didn't allow underscore (datahub-project#11098)

* docs: Refactor customer stories section (datahub-project#10869)

Co-authored-by: Jeff Merrick <[email protected]>

* fix(release): fix full/slim suffix on tag (datahub-project#11087)

* feat(config): support alternate hashing algorithm for doc id (datahub-project#10423)

Co-authored-by: david-leifker <[email protected]>
Co-authored-by: John Joyce <[email protected]>

* fix(emitter): fix typo in get method of java kafka emitter (datahub-project#11007)

* fix(ingest): use correct native data type in all SQLAlchemy sources by compiling data type using dialect (datahub-project#10898)

Co-authored-by: Harshal Sheth <[email protected]>

* chore: Update contributors list in PR labeler (datahub-project#11105)

* feat(ingest): tweak stale entity removal messaging (datahub-project#11064)

* fix(ingestion): enforce lastObserved timestamps in SystemMetadata (datahub-project#11104)

* fix(ingest/powerbi): fix broken lineage between chart and dataset (datahub-project#11080)

* feat(ingest/lookml): CLL support for sql set in sql_table_name attribute of lookml view (datahub-project#11069)

* docs: update graphql docs on forms & structured properties (datahub-project#11100)

* test(search): search openAPI v3 test (datahub-project#11049)

* fix(ingest/tableau): prevent empty site content urls (datahub-project#11057)

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* feat(entity-client): implement client batch interface (datahub-project#11106)

* fix(snowflake): avoid reporting warnings/info for sys tables (datahub-project#11114)

* fix(ingest): downgrade column type mapping warning to info (datahub-project#11115)

* feat(api): add AuditStamp to the V3 API entity/aspect response (datahub-project#11118)

* fix(ingest/redshift): replace r'\n' with '\n' to avoid token error redshift serverless… (datahub-project#11111)

* fix(entiy-client): handle null entityUrn case for restli (datahub-project#11122)

* fix(sql-parser): prevent bad urns from alter table lineage (datahub-project#11092)

* fix(ingest/bigquery): use small batch size if use_tables_list_query_v2 is set (datahub-project#11121)

* fix(graphql): add missing entities to EntityTypeMapper and EntityTypeUrnMapper (datahub-project#10366)

* feat(ui): Changes to allow editable dataset name (datahub-project#10608)

Co-authored-by: Jay Kadambi <[email protected]>

* fix: remove saxo (datahub-project#11127)

* feat(mcl-processor): Update mcl processor hooks (datahub-project#11134)

* fix(openapi): fix openapi v2 endpoints & v3 documentation update

* Revert "fix(openapi): fix openapi v2 endpoints & v3 documentation update"

This reverts commit 573c1cb.

* docs(policies): updates to policies documentation (datahub-project#11073)

* fix(openapi): fix openapi v2 and v3 docs update (datahub-project#11139)

* feat(auth): grant type and acr values custom oidc parameters support (datahub-project#11116)

* fix(mutator): mutator hook fixes (datahub-project#11140)

* feat(search): support sorting on multiple fields (datahub-project#10775)

* feat(ingest): various logging improvements (datahub-project#11126)

* fix(ingestion/lookml): fix for sql parsing error (datahub-project#11079)

Co-authored-by: Harshal Sheth <[email protected]>

* feat(docs-site) cloud page spacing and content polishes (datahub-project#11141)

* feat(ui) Enable editing structured props on fields (datahub-project#11042)

* feat(tests): add md5 and last computed to testResult model (datahub-project#11117)

* test(openapi): openapi regression smoke tests (datahub-project#11143)

* fix(airflow): fix tox tests + update docs (datahub-project#11125)

* docs: add chime to adoption stories (datahub-project#11142)

* fix(ingest/databricks): Updating code to work with Databricks sdk 0.30 (datahub-project#11158)

* fix(kafka-setup): add missing script to image (datahub-project#11190)

* fix(config): fix hash algo config (datahub-project#11191)

* test(smoke-test): updates to smoke-tests (datahub-project#11152)

* fix(elasticsearch): refactor idHashAlgo setting (datahub-project#11193)

* chore(kafka): kafka version bump (datahub-project#11211)

* readd UsageStatsWorkUnit

* fix merge problems

* change logo

---------

Co-authored-by: Chris Collins <[email protected]>
Co-authored-by: John Joyce <[email protected]>
Co-authored-by: John Joyce <[email protected]>
Co-authored-by: John Joyce <[email protected]>
Co-authored-by: dushayntAW <[email protected]>
Co-authored-by: sagar-salvi-apptware <[email protected]>
Co-authored-by: Aseem Bansal <[email protected]>
Co-authored-by: Kevin Chun <[email protected]>
Co-authored-by: jordanjeremy <[email protected]>
Co-authored-by: skrydal <[email protected]>
Co-authored-by: Harshal Sheth <[email protected]>
Co-authored-by: david-leifker <[email protected]>
Co-authored-by: sid-acryl <[email protected]>
Co-authored-by: Julien Jehannet <[email protected]>
Co-authored-by: Hendrik Richert <[email protected]>
Co-authored-by: Hendrik Richert <[email protected]>
Co-authored-by: RyanHolstien <[email protected]>
Co-authored-by: Felix Lüdin <[email protected]>
Co-authored-by: Pirry <[email protected]>
Co-authored-by: Hyejin Yoon <[email protected]>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: cburroughs <[email protected]>
Co-authored-by: ksrinath <[email protected]>
Co-authored-by: Mayuri Nehate <[email protected]>
Co-authored-by: Kunal-kankriya <[email protected]>
Co-authored-by: Shirshanka Das <[email protected]>
Co-authored-by: ipolding-cais <[email protected]>
Co-authored-by: Tamas Nemeth <[email protected]>
Co-authored-by: Shubham Jagtap <[email protected]>
Co-authored-by: haeniya <[email protected]>
Co-authored-by: Yanik Häni <[email protected]>
Co-authored-by: Gabe Lyons <[email protected]>
Co-authored-by: Gabe Lyons <[email protected]>
Co-authored-by: 808OVADOZE <[email protected]>
Co-authored-by: noggi <[email protected]>
Co-authored-by: Nicholas Pena <[email protected]>
Co-authored-by: Jay <[email protected]>
Co-authored-by: ethan-cartwright <[email protected]>
Co-authored-by: Ethan Cartwright <[email protected]>
Co-authored-by: Nadav Gross <[email protected]>
Co-authored-by: Patrick Franco Braz <[email protected]>
Co-authored-by: pie1nthesky <[email protected]>
Co-authored-by: Joel Pinto Mata (KPN-DSH-DEX team) <[email protected]>
Co-authored-by: Ellie O'Neil <[email protected]>
Co-authored-by: Ajoy Majumdar <[email protected]>
Co-authored-by: deepgarg-visa <[email protected]>
Co-authored-by: Tristan Heisler <[email protected]>
Co-authored-by: Andrew Sikowitz <[email protected]>
Co-authored-by: Davi Arnaut <[email protected]>
Co-authored-by: Pedro Silva <[email protected]>
Co-authored-by: amit-apptware <[email protected]>
Co-authored-by: Sam Black <[email protected]>
Co-authored-by: Raj Tekal <[email protected]>
Co-authored-by: Steffen Grohsschmiedt <[email protected]>
Co-authored-by: jaegwon.seo <[email protected]>
Co-authored-by: Renan F. Lima <[email protected]>
Co-authored-by: Matt Exchange <[email protected]>
Co-authored-by: Jonny Dixon <[email protected]>
Co-authored-by: Pedro Silva <[email protected]>
Co-authored-by: Pinaki Bhattacharjee <[email protected]>
Co-authored-by: Jeff Merrick <[email protected]>
Co-authored-by: skrydal <[email protected]>
Co-authored-by: AndreasHegerNuritas <[email protected]>
Co-authored-by: jayasimhankv <[email protected]>
Co-authored-by: Jay Kadambi <[email protected]>
Co-authored-by: David Leifker <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ingestion PR or Issue related to the ingestion of metadata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants