-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(ingest/snowflake): add limits on tables/columns/queries in lineage #10804
Conversation
WalkthroughIn the Changes
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_query.py (3 hunks)
Additional comments not posted (3)
metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_query.py (3)
670-672
: Introduce constants for limits on tables, columns, and queries.The addition of these constants is a good practice as it centralizes control over these limits, making the code more maintainable and easier to configure. However, ensure that these values are derived from either configuration settings or environmental variables to allow for easy adjustments without code changes.
793-812
: Review the implementation of slicing for upstream tables, columns, and queries.The use of
ARRAY_SLICE
to enforce limits on the number of upstream tables, columns, and queries is effective and clear. This approach ensures that the data remains manageable and prevents performance degradation when dealing with large datasets. However, consider adding a comment explaining why these specific limits were chosen (e.g., based on performance tests or user feedback), as this will help maintainers understand the reasoning behind these values.
Line range hint
670-812
: Ensure comprehensive testing on the modified lineage query logic.Given the complexity of the
table_upstreams_with_column_lineage
function and its critical role in lineage tracking, it's essential to ensure that this functionality is thoroughly tested. This includes testing with various sizes of datasets to verify that the slicing does not omit critical data and that performance remains acceptable under load.
Checklist
Summary by CodeRabbit