-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(ingest/abs): Adding azure blob storage ingestion source #10813
feat(ingest/abs): Adding azure blob storage ingestion source #10813
Conversation
WalkthroughThe latest updates introduce a new connector for ingesting Azure Blob Storage (ABS) datasets into DataHub. This enhancement includes comprehensive configurations, profiling options, and utilities for managing ABS URIs. Additionally, the documentation has been updated to reflect the new features, along with minor adjustments for AWS S3. Changes
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 23
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (14)
- metadata-ingestion/docs/sources/abs/README.md (1 hunks)
- metadata-ingestion/docs/sources/abs/abs.md (1 hunks)
- metadata-ingestion/docs/sources/abs/abs_recipe.yml (1 hunks)
- metadata-ingestion/setup.py (3 hunks)
- metadata-ingestion/src/datahub/ingestion/source/abs/config.py (1 hunks)
- metadata-ingestion/src/datahub/ingestion/source/abs/datalake_profiler_config.py (1 hunks)
- metadata-ingestion/src/datahub/ingestion/source/abs/profiling.py (1 hunks)
- metadata-ingestion/src/datahub/ingestion/source/abs/report.py (1 hunks)
- metadata-ingestion/src/datahub/ingestion/source/abs/source.py (1 hunks)
- metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py (1 hunks)
- metadata-ingestion/src/datahub/ingestion/source/azure/azure_common.py (1 hunks)
- metadata-ingestion/src/datahub/ingestion/source/common/subtypes.py (1 hunks)
- metadata-ingestion/src/datahub/ingestion/source/data_lake_common/data_lake_utils.py (4 hunks)
- metadata-ingestion/src/datahub/ingestion/source/data_lake_common/path_spec.py (4 hunks)
Files not summarized due to errors (7)
- metadata-ingestion/src/datahub/ingestion/source/common/subtypes.py: Error: Server error. Please try again later.
- metadata-ingestion/src/datahub/ingestion/source/abs/report.py: Error: Server error. Please try again later.
- metadata-ingestion/src/datahub/ingestion/source/abs/source.py: Error: Server error. Please try again later.
- metadata-ingestion/setup.py: Error: Server error. Please try again later.
- metadata-ingestion/src/datahub/ingestion/source/data_lake_common/path_spec.py: Error: Server error. Please try again later.
- metadata-ingestion/src/datahub/ingestion/source/abs/datalake_profiler_config.py: Error: Server error. Please try again later.
- metadata-ingestion/src/datahub/ingestion/source/azure/azure_common.py: Error: Server error. Please try again later.
Additional context used
LanguageTool
metadata-ingestion/docs/sources/abs/README.md
[style] ~34-~34: ‘on the basis of’ might be wordy. Consider a shorter alternative.
Context: ...etails)) JSON file schemas are inferred on the basis of the entire file (given the difficulty i...(EN_WORDINESS_PREMIUM_ON_THE_BASIS_OF)
metadata-ingestion/docs/sources/abs/abs.md
[uncategorized] ~4-~4: Possible missing comma found.
Context: ...) is a list of Path Spec (
path_spec) objects where each individual
path_spec` repre...(AI_HYDRA_LEO_MISSING_COMMA)
[uncategorized] ~4-~4: Possible missing article found.
Context: ...e path (path_spec.include
) represents formatted path to the dataset. This path must end...(AI_HYDRA_LEO_MISSING_A)
[uncategorized] ~4-~4: Possible missing comma found.
Context: ...o represent leaf level. If*.[ext]
is provided then files with only specified extensio...(AI_HYDRA_LEO_MISSING_COMMA)
[uncategorized] ~4-~4: Possible missing article found.
Context: ...[ext]is provided then files with only specified extension type will be scanned. "
.[ext...(AI_HYDRA_LEO_MISSING_THE)
[uncategorized] ~6-~6: Possible missing article found.
Context: ...ent a folder level and avoid specifying exact folder name. To map folder as a dataset...(AI_HYDRA_LEO_MISSING_AN)
[uncategorized] ~6-~6: Possible missing article found.
Context: ...id specifying exact folder name. To map folder as a dataset, use{table}
placeholder...(AI_HYDRA_LEO_MISSING_A)
[uncategorized] ~6-~6: Possible missing article found.
Context: ... use{table}
placeholder to represent folder level for which dataset is to be create...(AI_HYDRA_LEO_MISSING_THE)
[uncategorized] ~6-~6: Possible missing article found.
Context: ...der to represent folder level for which dataset is to be created. For a partitioned dat...(AI_HYDRA_LEO_MISSING_THE)
[grammar] ~17-~17: You’ve repeated a verb. Did you mean to only write one of them?
Context: ...in named variables. ### Path Specs - Examples #### Example 1 - Individual file as Dataset Contain...(REPEATED_VERBS)
Markdownlint
metadata-ingestion/docs/sources/abs/README.md
40-40: null
Files should end with a single newline character(MD047, single-trailing-newline)
33-33: null
Link fragments should be valid(MD051, link-fragments)
metadata-ingestion/docs/sources/abs/abs.md
55-55: Expected: 0 or 2; Actual: 1
Trailing spaces(MD009, no-trailing-spaces)
202-202: Expected: 0 or 2; Actual: 1
Trailing spaces(MD009, no-trailing-spaces)
16-16: Expected: 1; Actual: 2
Multiple consecutive blank lines(MD012, no-multiple-blanks)
107-107: Expected: 1; Actual: 2
Multiple consecutive blank lines(MD012, no-multiple-blanks)
169-169: Expected: 1; Actual: 2
Multiple consecutive blank lines(MD012, no-multiple-blanks)
170-170: Expected: 1; Actual: 3
Multiple consecutive blank lines(MD012, no-multiple-blanks)
17-17: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines(MD022, blanks-around-headings)
18-18: Expected: 1; Actual: 0; Above
Headings should be surrounded by blank lines(MD022, blanks-around-headings)
30-30: null
Fenced code blocks should be surrounded by blank lines(MD031, blanks-around-fences)
34-34: null
Fenced code blocks should be surrounded by blank lines(MD031, blanks-around-fences)
40-40: null
Fenced code blocks should be surrounded by blank lines(MD031, blanks-around-fences)
49-49: null
Fenced code blocks should be surrounded by blank lines(MD031, blanks-around-fences)
59-59: null
Fenced code blocks should be surrounded by blank lines(MD031, blanks-around-fences)
74-74: null
Fenced code blocks should be surrounded by blank lines(MD031, blanks-around-fences)
84-84: null
Fenced code blocks should be surrounded by blank lines(MD031, blanks-around-fences)
100-100: null
Fenced code blocks should be surrounded by blank lines(MD031, blanks-around-fences)
111-111: null
Fenced code blocks should be surrounded by blank lines(MD031, blanks-around-fences)
131-131: null
Fenced code blocks should be surrounded by blank lines(MD031, blanks-around-fences)
13-13: null
Lists should be surrounded by blank lines(MD032, blanks-around-lists)
141-141: null
Lists should be surrounded by blank lines(MD032, blanks-around-lists)
164-164: null
Lists should be surrounded by blank lines(MD032, blanks-around-lists)
165-165: null
Bare URL used(MD034, no-bare-urls)
167-167: null
Bare URL used(MD034, no-bare-urls)
12-12: null
Emphasis used instead of a heading(MD036, no-emphasis-as-heading)
147-147: null
Emphasis used instead of a heading(MD036, no-emphasis-as-heading)
163-163: null
Emphasis used instead of a heading(MD036, no-emphasis-as-heading)
22-22: null
Fenced code blocks should have a language specified(MD040, fenced-code-language)
30-30: null
Fenced code blocks should have a language specified(MD040, fenced-code-language)
40-40: null
Fenced code blocks should have a language specified(MD040, fenced-code-language)
49-49: null
Fenced code blocks should have a language specified(MD040, fenced-code-language)
59-59: null
Fenced code blocks should have a language specified(MD040, fenced-code-language)
74-74: null
Fenced code blocks should have a language specified(MD040, fenced-code-language)
84-84: null
Fenced code blocks should have a language specified(MD040, fenced-code-language)
100-100: null
Fenced code blocks should have a language specified(MD040, fenced-code-language)
111-111: null
Fenced code blocks should have a language specified(MD040, fenced-code-language)
131-131: null
Fenced code blocks should have a language specified(MD040, fenced-code-language)
204-204: null
Files should end with a single newline character(MD047, single-trailing-newline)
4-4: null
Link fragments should be valid(MD051, link-fragments)
Ruff
metadata-ingestion/src/datahub/ingestion/source/abs/config.py
148-150: Use
dict.get()
without default valueRemove default value
(SIM910)
metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py
2-2: Redefinition of unused
logging
from line 1Remove definition:
logging
(F811)
155-155: Use
key in dict
instead ofkey in dict.keys()
Remove
.keys()
(SIM118)
280-280: Use
key in dict
instead ofkey in dict.keys()
Remove
.keys()
(SIM118)
metadata-ingestion/src/datahub/ingestion/source/abs/profiling.py
126-130: Use a single
if
statement instead of nestedif
statements(SIM102)
metadata-ingestion/src/datahub/ingestion/source/abs/source.py
162-163: Use a single
if
statement instead of nestedif
statements(SIM102)
693-693: Loop control variable
guid
not used within loop bodyRename unused
guid
to_guid
(B007)
Additional comments not posted (42)
metadata-ingestion/docs/sources/abs/abs_recipe.yml (1)
1-11
: Ensure sensitive information is secured.The configuration contains sensitive information such as
account_name
andsas_token
. Ensure these are secured and not hard-coded in production environments.
[SECURITY]
Consider using environment variables or a secrets management service.azure_config: account_name: "${{ secrets.AZURE_ACCOUNT_NAME }}" sas_token: "${{ secrets.AZURE_SAS_TOKEN }}"metadata-ingestion/src/datahub/ingestion/source/abs/report.py (1)
1-19
: LGTM!The data class
DataLakeSourceReport
and its methodsreport_file_scanned
andreport_file_dropped
are well-defined and follow best practices.metadata-ingestion/src/datahub/ingestion/source/common/subtypes.py (1)
38-38
: LGTM!The new subtype
ABS_CONTAINER
has been correctly added to theDatasetContainerSubTypes
enum.metadata-ingestion/docs/sources/abs/README.md (3)
1-35
: Consider rephrasing for conciseness.The phrase "on the basis of" in line 34 might be wordy. Consider using "based on" instead.
- JSON file schemas are inferred on the basis of the entire file (given the difficulty in extracting only the first few objects of the file), which may impact performance. + JSON file schemas are inferred based on the entire file (given the difficulty in extracting only the first few objects of the file), which may impact performance.Tools
LanguageTool
[style] ~34-~34: ‘on the basis of’ might be wordy. Consider a shorter alternative.
Context: ...etails)) JSON file schemas are inferred on the basis of the entire file (given the difficulty i...(EN_WORDINESS_PREMIUM_ON_THE_BASIS_OF)
Markdownlint
33-33: null
Link fragments should be valid(MD051, link-fragments)
33-33
: Ensure link fragments are valid.The link fragment in line 33 should be valid.
Please ensure the
max_rows
link fragment is correct and points to the appropriate section.Tools
Markdownlint
33-33: null
Link fragments should be valid(MD051, link-fragments)
40-40
: Add a newline at the end of the file.Files should end with a single newline character.
- Profiling is not available in the current release. + Profiling is not available in the current release. +Tools
Markdownlint
40-40: null
Files should end with a single newline character(MD047, single-trailing-newline)
metadata-ingestion/src/datahub/ingestion/source/abs/datalake_profiler_config.py (2)
26-28
: Consider renaming_allow_deny_patterns
.The private attribute
_allow_deny_patterns
could be more descriptive, indicating it is used for profiling patterns.-_allow_deny_patterns: AllowDenyPattern = pydantic.PrivateAttr( +_profiling_allow_deny_patterns: AllowDenyPattern = pydantic.PrivateAttr(
75-92
: Improve the root validator method name.The root validator method
ensure_field_level_settings_are_normalized
could be renamed to better reflect its purpose.- def ensure_field_level_settings_are_normalized( + def normalize_field_level_settings(metadata-ingestion/src/datahub/ingestion/source/data_lake_common/data_lake_utils.py (1)
38-38
: Use consistent naming for platform constants.The platform constants should use consistent naming conventions.
-PLATFORM_ABS = "abs" +PLATFORM_AZURE_BLOB_STORAGE = "abs"metadata-ingestion/src/datahub/ingestion/source/abs/config.py (1)
99-101
: Usedict.get()
without default value.Remove the default value in the
dict.get()
method.- _rename_path_spec_to_plural = pydantic_renamed_field( - "path_spec", "path_specs", lambda path_spec: [path_spec] - ) + _rename_path_spec_to_plural = pydantic_renamed_field("path_spec", "path_specs", lambda path_spec: [path_spec])Likely invalid or redundant comment.
metadata-ingestion/src/datahub/ingestion/source/data_lake_common/path_spec.py (1)
218-220
: LGTM!The code changes are approved.
metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py (7)
31-32
: LGTM!The code changes are approved.
35-39
: LGTM!The code changes are approved.
42-50
: LGTM!The code changes are approved.
53-65
: LGTM!The code changes are approved.
68-73
: LGTM!The code changes are approved.
76-81
: LGTM!The code changes are approved.
162-195
: LGTM!The code changes are approved.
metadata-ingestion/src/datahub/ingestion/source/abs/profiling.py (1)
55-57
: LGTM!The code changes are approved.
metadata-ingestion/src/datahub/ingestion/source/abs/source.py (19)
204-204
: ClassABSSource
Implementation Looks GoodThe class is well-structured and adheres to the project conventions.
210-227
: Constructor Initialization and Telemetry DataThe constructor correctly initializes the attributes and sends telemetry data.
229-233
: Class Methodcreate
Looks GoodThe method correctly parses the configuration dictionary and creates an instance of
ABSSource
.
235-299
: Methodget_fields
Looks GoodThe method correctly handles both ABS and local file sources and infers the schema accurately. Error handling is also properly implemented.
301-324
: Methodadd_partition_columns_to_schema
Looks GoodThe method correctly derives partition keys and adds them to the schema fields.
325-334
: Method_create_table_operation_aspect
Looks GoodThe method correctly creates the operation aspect with the appropriate timestamps and operation type.
336-437
: Methodingest_table
Looks GoodThe method correctly handles metadata extraction, schema inference, and aspect creation.
439-444
: Methodget_prefix
Looks GoodThe method correctly extracts the prefix from a relative path using regex.
446-449
: Methodextract_table_name
Looks GoodThe method correctly extracts the table name from the path specification and named variables.
451-472
: Methodextract_table_data
Looks GoodThe method correctly extracts table data including display name, path, timestamp, and size.
474-490
: Methodresolve_templated_folders
Looks GoodThe method correctly resolves templated folders recursively.
491-520
: Methodget_dir_to_process
Looks GoodThe method correctly retrieves the directory to process, handling nested directories and sorting them.
522-615
: Methodabs_browser
Looks GoodThe method correctly handles ABS browsing, including error handling and sampling logic.
616-623
: Methodcreate_abs_path
Looks GoodThe method correctly constructs the ABS path using the account name and container name.
625-648
: Methodlocal_browser
Looks GoodThe method correctly handles local file browsing, including file and directory scanning.
649-695
: Methodget_workunits_internal
Looks GoodThe method correctly handles work unit generation, including ABS and local file sources.
Tools
Ruff
693-693: Loop control variable
guid
not used within loop bodyRename unused
guid
to_guid
(B007)
696-702
: Methodget_workunit_processors
Looks GoodThe method correctly retrieves the work unit processors, including the stale entity removal handler.
704-705
: Methodis_abs_platform
Looks GoodThe method correctly checks if the platform is ABS.
707-708
: Methodget_report
Looks GoodThe method correctly retrieves the report.
metadata-ingestion/setup.py (4)
261-266
: Addition of ABS Base Dependencies Looks GoodThe ABS base dependencies are correctly added to the
abs_base
dictionary.
275-275
: Addition of ABS Base to Delta Lake Dependencies Looks GoodThe ABS base dependencies are correctly included in the
delta_lake
dictionary.
416-416
: Addition of ABS Plugin Dependencies Looks GoodThe ABS plugin dependencies are correctly added to the
plugins
dictionary.
693-693
: Addition of ABS Source Entry Point Looks GoodThe ABS source entry point is correctly added to the
entry_points
dictionary.
base_path: str = Field( | ||
default="/", | ||
description="Base folder in hierarchical namespaces to start from.", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding a validator for base_path
.
The base_path
field could benefit from a validator to ensure it starts with a /
.
@pydantic.validator("base_path")
def validate_base_path(cls, value):
if not value.startswith("/"):
raise ValueError("base_path must start with a '/'")
return value
def get_abfss_url(self, folder_path: str = "") -> str: | ||
if not folder_path.startswith("/"): | ||
folder_path = f"/{folder_path}" | ||
return f"abfss://{self.container_name}@{self.account_name}.dfs.core.windows.net{folder_path}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simplify get_abfss_url
method.
The get_abfss_url
method can be simplified by using str.lstrip
.
- if not folder_path.startswith("/"):
- folder_path = f"/{folder_path}"
+ folder_path = folder_path.lstrip("/")
Committable suggestion was skipped due to low confidence.
elif is_abs_uri(path): | ||
return get_container_name(path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle invalid ABS URIs gracefully.
Add error handling for invalid ABS URIs in get_bucket_name
.
- elif is_abs_uri(path):
- return get_container_name(path)
+ elif is_abs_uri(path):
+ try:
+ return get_container_name(path)
+ except ValueError:
+ raise ValueError(f"Invalid ABS URI: {path}")
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
elif is_abs_uri(path): | |
return get_container_name(path) | |
elif is_abs_uri(path): | |
try: | |
return get_container_name(path) | |
except ValueError: | |
raise ValueError(f"Invalid ABS URI: {path}") |
elif is_abs_uri(path): | ||
protocol = get_abs_prefix(path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle invalid ABS URIs gracefully.
Add error handling for invalid ABS URIs in get_protocol
.
- elif is_abs_uri(path):
- protocol = get_abs_prefix(path)
+ elif is_abs_uri(path):
+ try:
+ protocol = get_abs_prefix(path)
+ except ValueError:
+ raise ValueError(f"Invalid ABS URI: {path}")
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
elif is_abs_uri(path): | |
protocol = get_abs_prefix(path) | |
elif is_abs_uri(path): | |
try: | |
protocol = get_abs_prefix(path) | |
except ValueError: | |
raise ValueError(f"Invalid ABS URI: {path}") |
def get_sub_types(self) -> str: | ||
if self.platform == "s3": | ||
return DatasetContainerSubTypes.S3_BUCKET | ||
elif self.platform == "gcs": | ||
return DatasetContainerSubTypes.GCS_BUCKET | ||
elif self.platform == "abs": | ||
return DatasetContainerSubTypes.ABS_CONTAINER | ||
raise ValueError(f"Unable to sub type for platform: {self.platform}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use constants for platform strings.
Replace hardcoded platform strings with constants.
- if self.platform == "s3":
+ if self.platform == PLATFORM_S3:
- elif self.platform == "gcs":
+ elif self.platform == PLATFORM_GCS:
- elif self.platform == "abs":
+ elif self.platform == PLATFORM_ABS:
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def get_sub_types(self) -> str: | |
if self.platform == "s3": | |
return DatasetContainerSubTypes.S3_BUCKET | |
elif self.platform == "gcs": | |
return DatasetContainerSubTypes.GCS_BUCKET | |
elif self.platform == "abs": | |
return DatasetContainerSubTypes.ABS_CONTAINER | |
raise ValueError(f"Unable to sub type for platform: {self.platform}") | |
def get_sub_types(self) -> str: | |
if self.platform == PLATFORM_S3: | |
return DatasetContainerSubTypes.S3_BUCKET | |
elif self.platform == PLATFORM_GCS: | |
return DatasetContainerSubTypes.GCS_BUCKET | |
elif self.platform == PLATFORM_ABS: | |
return DatasetContainerSubTypes.ABS_CONTAINER | |
raise ValueError(f"Unable to sub type for platform: {self.platform}") |
|
||
Path Specs (`path_specs`) is a list of Path Spec (`path_spec`) objects where each individual `path_spec` represents one or more datasets. Include path (`path_spec.include`) represents formatted path to the dataset. This path must end with `*.*` or `*.[ext]` to represent leaf level. If `*.[ext]` is provided then files with only specified extension type will be scanned. "`.[ext]`" can be any of [supported file types](#supported-file-types). Refer [example 1](#example-1---individual-file-as-dataset) below for more details. | ||
|
||
All folder levels need to be specified in include path. You can use `/*/` to represent a folder level and avoid specifying exact folder name. To map folder as a dataset, use `{table}` placeholder to represent folder level for which dataset is to be created. For a partitioned dataset, you can use placeholder `{partition_key[i]}` to represent name of `i`th partition and `{partition[i]}` to represent value of `i`th partition. During ingestion, `i` will be used to match partition_key to partition. Refer [example 2 and 3](#example-2---folder-of-files-as-dataset-without-partitions) below for more details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix grammatical issues.
Add the missing articles for better readability.
- All folder levels need to be specified in include path. You can use `/*/` to represent a folder level and avoid specifying exact folder name. To map folder as a dataset, use `{table}` placeholder to represent folder level for which dataset is to be created. For a partitioned dataset, you can use placeholder `{partition_key[i]}` to represent name of `i`th partition and `{partition[i]}` to represent value of `i`th partition. During ingestion, `i` will be used to match partition_key to partition. Refer [example 2 and 3](#example-2---folder-of-files-as-dataset-without-partitions) below for more details.
+ All folder levels need to be specified in the include path. You can use `/*/` to represent a folder level and avoid specifying the exact folder name. To map a folder as a dataset, use the `{table}` placeholder to represent the folder level for which the dataset is to be created. For a partitioned dataset, you can use the placeholder `{partition_key[i]}` to represent the name of the `i`th partition and `{partition[i]}` to represent the value of the `i`th partition. During ingestion, `i` will be used to match the partition_key to the partition. Refer to [example 2 and 3](#example-2---folder-of-files-as-dataset-without-partitions) below for more details.
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
All folder levels need to be specified in include path. You can use `/*/` to represent a folder level and avoid specifying exact folder name. To map folder as a dataset, use `{table}` placeholder to represent folder level for which dataset is to be created. For a partitioned dataset, you can use placeholder `{partition_key[i]}` to represent name of `i`th partition and `{partition[i]}` to represent value of `i`th partition. During ingestion, `i` will be used to match partition_key to partition. Refer [example 2 and 3](#example-2---folder-of-files-as-dataset-without-partitions) below for more details. | |
All folder levels need to be specified in the include path. You can use `/*/` to represent a folder level and avoid specifying the exact folder name. To map a folder as a dataset, use the `{table}` placeholder to represent the folder level for which the dataset is to be created. For a partitioned dataset, you can use the placeholder `{partition_key[i]}` to represent the name of the `i`th partition and `{partition[i]}` to represent the value of the `i`th partition. During ingestion, `i` will be used to match the partition_key to the partition. Refer to [example 2 and 3](#example-2---folder-of-files-as-dataset-without-partitions) below for more details. |
Tools
LanguageTool
[uncategorized] ~6-~6: Possible missing article found.
Context: ...ent a folder level and avoid specifying exact folder name. To map folder as a dataset...(AI_HYDRA_LEO_MISSING_AN)
[uncategorized] ~6-~6: Possible missing article found.
Context: ...id specifying exact folder name. To map folder as a dataset, use{table}
placeholder...(AI_HYDRA_LEO_MISSING_A)
[uncategorized] ~6-~6: Possible missing article found.
Context: ... use{table}
placeholder to represent folder level for which dataset is to be create...(AI_HYDRA_LEO_MISSING_THE)
[uncategorized] ~6-~6: Possible missing article found.
Context: ...der to represent folder level for which dataset is to be created. For a partitioned dat...(AI_HYDRA_LEO_MISSING_THE)
is_abs = is_abs_uri(values.get("include") or "") | ||
if not is_s3 and not is_gcs and not is_abs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tip
Codebase Verification
The turn_off_sampling_for_non_s3
method appears to be unused.
The method turn_off_sampling_for_non_s3
is defined in metadata-ingestion/src/datahub/ingestion/source/data_lake_common/path_spec.py
but is not invoked anywhere in the codebase. This suggests that it might be dead code.
metadata-ingestion/src/datahub/ingestion/source/data_lake_common/path_spec.py
:def turn_off_sampling_for_non_s3(cls, v, values)
Consider removing this method if it is no longer needed.
Analysis chain
Ensure sampling is correctly handled for Azure Blob Storage.
The sampling logic has been updated to handle Azure Blob Storage URIs. Ensure that this change is consistent with the rest of the codebase.
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify the usage of the `turn_off_sampling_for_non_s3` validator in the codebase.
# Test: Search for the usage of the `turn_off_sampling_for_non_s3` validator. Expect: Consistent handling of Azure Blob Storage URIs.
rg --type python $'turn_off_sampling_for_non_s3'
Length of output: 78
Script:
#!/bin/bash
# Description: Verify the usage of the `turn_off_sampling_for_non_s3` validator in the codebase.
# Test: Search for the usage of the `turn_off_sampling_for_non_s3` validator. Expect: Consistent handling of Azure Blob Storage URIs.
rg -t py 'turn_off_sampling_for_non_s3'
Length of output: 171
Script:
#!/bin/bash
# Description: Verify the usage of the `turn_off_sampling_for_non_s3` validator within the path_spec.py file.
# Test: Search for invocations of the `turn_off_sampling_for_non_s3` method within the path_spec.py file.
rg 'turn_off_sampling_for_non_s3' metadata-ingestion/src/datahub/ingestion/source/data_lake_common/path_spec.py
Length of output: 165
Script:
#!/bin/bash
# Description: Verify the usage of the `turn_off_sampling_for_non_s3` validator across the entire codebase.
# Test: Search for invocations of the `turn_off_sampling_for_non_s3` method across the entire codebase.
rg 'turn_off_sampling_for_non_s3'
Length of output: 165
continue | ||
|
||
folder_name = f"{prefix}{folder_name}" | ||
if folder_name in this_dict.keys(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unnecessary .keys()
method.
Use key in dict
instead of key in dict.keys()
for better performance.
- if folder_name in this_dict.keys():
+ if folder_name in this_dict:
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if folder_name in this_dict.keys(): | |
if folder_name in this_dict: |
Tools
Ruff
280-280: Use
key in dict
instead ofkey in dict.keys()
Remove
.keys()
(SIM118)
if azure_config is None: | ||
raise ValueError("No container_client available.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improve the error message.
The error message for azure_config
being None
could be more descriptive.
- raise ValueError("No container_client available.")
+ raise ValueError("Azure configuration is not provided. Cannot retrieve container client.")
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if azure_config is None: | |
raise ValueError("No container_client available.") | |
if azure_config is None: | |
raise ValueError("Azure configuration is not provided. Cannot retrieve container client.") |
def add_property( | ||
key: str, value: str, custom_properties: Dict[str, str], resource_name: str | ||
) -> Dict[str, str]: | ||
if key in custom_properties.keys(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unnecessary .keys()
method.
Use key in dict
instead of key in dict.keys()
for better performance.
- if key in custom_properties.keys():
+ if key in custom_properties:
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if key in custom_properties.keys(): | |
if key in custom_properties: |
Tools
Ruff
155-155: Use
key in dict
instead ofkey in dict.keys()
Remove
.keys()
(SIM118)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 18
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (14)
- metadata-ingestion/docs/sources/abs/README.md (1 hunks)
- metadata-ingestion/docs/sources/abs/abs.md (1 hunks)
- metadata-ingestion/docs/sources/abs/abs_recipe.yml (1 hunks)
- metadata-ingestion/setup.py (3 hunks)
- metadata-ingestion/src/datahub/ingestion/source/abs/config.py (1 hunks)
- metadata-ingestion/src/datahub/ingestion/source/abs/datalake_profiler_config.py (1 hunks)
- metadata-ingestion/src/datahub/ingestion/source/abs/profiling.py (1 hunks)
- metadata-ingestion/src/datahub/ingestion/source/abs/report.py (1 hunks)
- metadata-ingestion/src/datahub/ingestion/source/abs/source.py (1 hunks)
- metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py (1 hunks)
- metadata-ingestion/src/datahub/ingestion/source/azure/azure_common.py (1 hunks)
- metadata-ingestion/src/datahub/ingestion/source/common/subtypes.py (1 hunks)
- metadata-ingestion/src/datahub/ingestion/source/data_lake_common/data_lake_utils.py (4 hunks)
- metadata-ingestion/src/datahub/ingestion/source/data_lake_common/path_spec.py (4 hunks)
Files not summarized due to errors (4)
- metadata-ingestion/src/datahub/ingestion/source/abs/profiling.py: Error: Server error. Please try again later.
- metadata-ingestion/setup.py: Error: Server error. Please try again later.
- metadata-ingestion/src/datahub/ingestion/source/abs/source.py: Error: Server error. Please try again later.
- metadata-ingestion/docs/sources/abs/README.md: Error: Server error. Please try again later.
Files skipped from review due to trivial changes (2)
- metadata-ingestion/docs/sources/abs/abs_recipe.yml
- metadata-ingestion/src/datahub/ingestion/source/common/subtypes.py
Additional context used
LanguageTool
metadata-ingestion/docs/sources/abs/README.md
[uncategorized] ~3-~3: You might be missing the article “the” here.
Context: ...group of files that form a dataset, usepath_specs
configuration in ingestion r...(AI_EN_LECTOR_MISSING_DETERMINER_THE)
[uncategorized] ~3-~3: You might be missing the article “the” here.
Context: ...aset, usepath_specs
configuration in ingestion recipe. Refer section [Path Specs](http...(AI_EN_LECTOR_MISSING_DETERMINER_THE)
[style] ~34-~34: ‘on the basis of’ might be wordy. Consider a shorter alternative.
Context: ...etails)) JSON file schemas are inferred on the basis of the entire file (given the difficulty i...(EN_WORDINESS_PREMIUM_ON_THE_BASIS_OF)
metadata-ingestion/docs/sources/abs/abs.md
[uncategorized] ~4-~4: Possible missing comma found.
Context: ...) is a list of Path Spec (
path_spec) objects where each individual
path_spec` repre...(AI_HYDRA_LEO_MISSING_COMMA)
[uncategorized] ~4-~4: Possible missing article found.
Context: ...e path (path_spec.include
) represents formatted path to the dataset. This path must end...(AI_HYDRA_LEO_MISSING_A)
[uncategorized] ~4-~4: Possible missing comma found.
Context: ...o represent leaf level. If*.[ext]
is provided then files with only specified extensio...(AI_HYDRA_LEO_MISSING_COMMA)
[uncategorized] ~4-~4: Possible missing article found.
Context: ...[ext]is provided then files with only specified extension type will be scanned. "
.[ext...(AI_HYDRA_LEO_MISSING_THE)
[uncategorized] ~6-~6: Possible missing article found.
Context: ...id specifying exact folder name. To map folder as a dataset, use{table}
placeholder...(AI_HYDRA_LEO_MISSING_A)
[uncategorized] ~6-~6: Possible missing article found.
Context: ... use{table}
placeholder to represent folder level for which dataset is to be create...(AI_HYDRA_LEO_MISSING_THE)
[uncategorized] ~6-~6: Possible missing article found.
Context: ...der to represent folder level for which dataset is to be created. For a partitioned dat...(AI_HYDRA_LEO_MISSING_THE)
[grammar] ~17-~17: You’ve repeated a verb. Did you mean to only write one of them?
Context: ...in named variables. ### Path Specs - Examples #### Example 1 - Individual file as Dataset Contain...(REPEATED_VERBS)
[uncategorized] ~54-~54: Possible missing article found.
Context: ...table}/*.avro ```{table}
represents folder for which dataset will be created. ##...(AI_HYDRA_LEO_MISSING_A)
[uncategorized] ~54-~54: Possible missing article found.
Context: ...{table}
represents folder for which dataset will be created. #### Example 3 - Fol...(AI_HYDRA_LEO_MISSING_A)
Markdownlint
metadata-ingestion/docs/sources/abs/README.md
40-40: null
Files should end with a single newline character(MD047, single-trailing-newline)
33-33: null
Link fragments should be valid(MD051, link-fragments)
metadata-ingestion/docs/sources/abs/abs.md
55-55: Expected: 0 or 2; Actual: 1
Trailing spaces(MD009, no-trailing-spaces)
202-202: Expected: 0 or 2; Actual: 1
Trailing spaces(MD009, no-trailing-spaces)
16-16: Expected: 1; Actual: 2
Multiple consecutive blank lines(MD012, no-multiple-blanks)
107-107: Expected: 1; Actual: 2
Multiple consecutive blank lines(MD012, no-multiple-blanks)
169-169: Expected: 1; Actual: 2
Multiple consecutive blank lines(MD012, no-multiple-blanks)
170-170: Expected: 1; Actual: 3
Multiple consecutive blank lines(MD012, no-multiple-blanks)
17-17: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines(MD022, blanks-around-headings)
18-18: Expected: 1; Actual: 0; Above
Headings should be surrounded by blank lines(MD022, blanks-around-headings)
30-30: null
Fenced code blocks should be surrounded by blank lines(MD031, blanks-around-fences)
34-34: null
Fenced code blocks should be surrounded by blank lines(MD031, blanks-around-fences)
40-40: null
Fenced code blocks should be surrounded by blank lines(MD031, blanks-around-fences)
49-49: null
Fenced code blocks should be surrounded by blank lines(MD031, blanks-around-fences)
59-59: null
Fenced code blocks should be surrounded by blank lines(MD031, blanks-around-fences)
74-74: null
Fenced code blocks should be surrounded by blank lines(MD031, blanks-around-fences)
84-84: null
Fenced code blocks should be surrounded by blank lines(MD031, blanks-around-fences)
100-100: null
Fenced code blocks should be surrounded by blank lines(MD031, blanks-around-fences)
111-111: null
Fenced code blocks should be surrounded by blank lines(MD031, blanks-around-fences)
131-131: null
Fenced code blocks should be surrounded by blank lines(MD031, blanks-around-fences)
13-13: null
Lists should be surrounded by blank lines(MD032, blanks-around-lists)
141-141: null
Lists should be surrounded by blank lines(MD032, blanks-around-lists)
164-164: null
Lists should be surrounded by blank lines(MD032, blanks-around-lists)
165-165: null
Bare URL used(MD034, no-bare-urls)
167-167: null
Bare URL used(MD034, no-bare-urls)
12-12: null
Emphasis used instead of a heading(MD036, no-emphasis-as-heading)
147-147: null
Emphasis used instead of a heading(MD036, no-emphasis-as-heading)
163-163: null
Emphasis used instead of a heading(MD036, no-emphasis-as-heading)
22-22: null
Fenced code blocks should have a language specified(MD040, fenced-code-language)
30-30: null
Fenced code blocks should have a language specified(MD040, fenced-code-language)
40-40: null
Fenced code blocks should have a language specified(MD040, fenced-code-language)
49-49: null
Fenced code blocks should have a language specified(MD040, fenced-code-language)
59-59: null
Fenced code blocks should have a language specified(MD040, fenced-code-language)
74-74: null
Fenced code blocks should have a language specified(MD040, fenced-code-language)
84-84: null
Fenced code blocks should have a language specified(MD040, fenced-code-language)
100-100: null
Fenced code blocks should have a language specified(MD040, fenced-code-language)
111-111: null
Fenced code blocks should have a language specified(MD040, fenced-code-language)
131-131: null
Fenced code blocks should have a language specified(MD040, fenced-code-language)
204-204: null
Files should end with a single newline character(MD047, single-trailing-newline)
4-4: null
Link fragments should be valid(MD051, link-fragments)
Ruff
metadata-ingestion/src/datahub/ingestion/source/abs/config.py
148-150: Use
dict.get()
without default valueRemove default value
(SIM910)
metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py
2-2: Redefinition of unused
logging
from line 1Remove definition:
logging
(F811)
155-155: Use
key in dict
instead ofkey in dict.keys()
Remove
.keys()
(SIM118)
280-280: Use
key in dict
instead ofkey in dict.keys()
Remove
.keys()
(SIM118)
metadata-ingestion/src/datahub/ingestion/source/abs/profiling.py
126-130: Use a single
if
statement instead of nestedif
statements(SIM102)
metadata-ingestion/src/datahub/ingestion/source/abs/source.py
162-163: Use a single
if
statement instead of nestedif
statements(SIM102)
693-693: Loop control variable
guid
not used within loop bodyRename unused
guid
to_guid
(B007)
Additional comments not posted (28)
metadata-ingestion/src/datahub/ingestion/source/abs/report.py (1)
10-19
: LGTM!The class
DataLakeSourceReport
correctly extendsStaleEntityRemovalSourceReport
and adds functionality to track scanned and filtered files.metadata-ingestion/docs/sources/abs/README.md (1)
34-34
: Simplify the phrase "on the basis of".The phrase "on the basis of" can be simplified to "based on".
- JSON file schemas are inferred on the basis of the entire file (given the difficulty in extracting only the first few objects of the file), which may impact performance. + JSON file schemas are inferred based on the entire file (given the difficulty in extracting only the first few objects of the file), which may impact performance.Tools
LanguageTool
[style] ~34-~34: ‘on the basis of’ might be wordy. Consider a shorter alternative.
Context: ...etails)) JSON file schemas are inferred on the basis of the entire file (given the difficulty i...(EN_WORDINESS_PREMIUM_ON_THE_BASIS_OF)
metadata-ingestion/src/datahub/ingestion/source/abs/datalake_profiler_config.py (1)
75-92
: LGTM!The root validator
ensure_field_level_settings_are_normalized
correctly ensures that field-level settings are normalized based on theprofile_table_level_only
flag.metadata-ingestion/src/datahub/ingestion/source/azure/azure_common.py (1)
84-98
: LGTM!The root validator
_check_credential_values
correctly ensures that the necessary credential values are provided.metadata-ingestion/src/datahub/ingestion/source/data_lake_common/data_lake_utils.py (7)
19-24
: Imports look good!The import statements for Azure Blob Storage utilities are correctly added.
38-38
: Constant declaration looks good!The
PLATFORM_ABS
constant is correctly introduced.
95-96
: ABS URI handling looks good!The
get_protocol
method is correctly updated to handle ABS URIs.
111-113
: ABS URI handling looks good!The
get_bucket_name
method is correctly updated to handle ABS URIs.
115-122
: ABS container subtype handling looks good!The
get_sub_types
method is correctly updated to return ABS container subtype.
124-129
: ABS URI handling looks good!The
get_base_full_path
method is correctly updated to handle ABS URIs.
137-148
: ABS platform handling looks good!The
create_container_hierarchy
method is correctly updated to handle ABS platform.metadata-ingestion/src/datahub/ingestion/source/abs/config.py (4)
1-17
: Imports look good!The import statements for Azure Blob Storage configurations and utilities are correctly added.
28-107
: Class definition looks good!The
DataLakeSourceConfig
class is correctly introduced with necessary fields and methods.
108-145
: ABS platform handling looks good!The
check_path_specs_and_infer_platform
method is correctly updated to handle ABS platform.
156-163
: Profiling pattern handling looks good!The
ensure_profiling_pattern_is_passed_to_profiling
method is correctly implemented to handle profiling patterns.metadata-ingestion/src/datahub/ingestion/source/data_lake_common/path_spec.py (3)
14-14
: Imports look good!The import statement for Azure Blob Storage utility is correctly added.
173-174
: ABS URI handling looks good!The
sample_files
validator is correctly updated to handle ABS URIs.
218-221
: ABS URI check looks good!The
is_abs
cached property is correctly implemented to check for ABS URIs.metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py (6)
31-32
: LGTM!The function correctly checks if a URI matches the Azure Blob Storage pattern.
35-39
: LGTM!The function correctly extracts the prefix from an Azure Blob Storage URI.
42-50
: LGTM!The function correctly removes the Azure Blob Storage prefix from a URI.
53-65
: LGTM!The function correctly creates a URN for an Azure Blob Storage URI.
68-73
: LGTM!The function correctly extracts the container name from an Azure Blob Storage URI.
76-81
: LGTM!The function correctly extracts the key prefix from an Azure Blob Storage URI.
metadata-ingestion/src/datahub/ingestion/source/abs/profiling.py (1)
60-73
: LGTM!The class correctly defines the specification for profiling a single column.
metadata-ingestion/setup.py (3)
261-266
: LGTM! Adding Azure Blob Storage dependencies.The new dependencies for Azure Blob Storage are appropriate for the intended functionality.
275-275
: LGTM! Including Azure Blob Storage dependencies indelta_lake
plugin.The addition of
abs_base
to thedelta_lake
plugin is logical and aligns with the objective of adding Azure Blob ingestion functionality.
416-416
: LGTM! Addingabs
plugin for Azure Blob Storage ingestion.The new
abs
plugin withabs_base
dependencies aligns with the PR objective of adding Azure Blob ingestion functionality.
inferred_platform = values.get( | ||
"platform", None | ||
) # we may have inferred it above |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simplify the platform inference logic.
Use dict.get()
without the default value as suggested by static analysis tools.
- inferred_platform = values.get("platform", None)
+ inferred_platform = values.get("platform")
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
inferred_platform = values.get( | |
"platform", None | |
) # we may have inferred it above | |
inferred_platform = values.get("platform") |
Tools
Ruff
148-150: Use
dict.get()
without default valueRemove default value
(SIM910)
Path Specs (`path_specs`) is a list of Path Spec (`path_spec`) objects where each individual `path_spec` represents one or more datasets. Include path (`path_spec.include`) represents formatted path to the dataset. This path must end with `*.*` or `*.[ext]` to represent leaf level. If `*.[ext]` is provided then files with only specified extension type will be scanned. "`.[ext]`" can be any of [supported file types](#supported-file-types). Refer [example 1](#example-1---individual-file-as-dataset) below for more details. | ||
|
||
All folder levels need to be specified in include path. You can use `/*/` to represent a folder level and avoid specifying exact folder name. To map folder as a dataset, use `{table}` placeholder to represent folder level for which dataset is to be created. For a partitioned dataset, you can use placeholder `{partition_key[i]}` to represent name of `i`th partition and `{partition[i]}` to represent value of `i`th partition. During ingestion, `i` will be used to match partition_key to partition. Refer [example 2 and 3](#example-2---folder-of-files-as-dataset-without-partitions) below for more details. | ||
|
||
Exclude paths (`path_spec.exclude`) can be used to ignore paths that are not relevant to current `path_spec`. This path cannot have named variables ( `{}` ). Exclude path can have `**` to represent multiple folder levels. Refer [example 4](#example-4---folder-of-files-as-dataset-with-partitions-and-exclude-filter) below for more details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix grammatical issues and formatting inconsistencies.
There are grammatical issues and formatting inconsistencies that need to be addressed for better readability.
- Path Specs (`path_specs`) is a list of Path Spec (`path_spec`) objects where each individual `path_spec` represents one or more datasets. Include path (`path_spec.include`) represents formatted path to the dataset. This path must end with `*.*` or `*.[ext]` to represent leaf level. If `*.[ext]` is provided then files with only specified extension type will be scanned. "`.[ext]`" can be any of [supported file types](#supported-file-types). Refer [example 1](#example-1---individual-file-as-dataset) below for more details.
+ Path Specs (`path_specs`) is a list of Path Spec (`path_spec`) objects, where each individual `path_spec` represents one or more datasets. The include path (`path_spec.include`) represents a formatted path to the dataset. This path must end with `*.*` or `*.[ext]` to represent the leaf level. If `*.[ext]` is provided, then only files with the specified extension type will be scanned. "`.[ext]`" can be any of the [supported file types](#supported-file-types). Refer to [example 1](#example-1---individual-file-as-dataset) below for more details.
- All folder levels need to be specified in include path. You can use `/*/` to represent a folder level and avoid specifying exact folder name. To map folder as a dataset, use `{table}` placeholder to represent folder level for which dataset is to be created. For a partitioned dataset, you can use placeholder `{partition_key[i]}` to represent name of `i`th partition and `{partition[i]}` to represent value of `i`th partition. During ingestion, `i` will be used to match partition_key to partition. Refer [example 2 and 3](#example-2---folder-of-files-as-dataset-without-partitions) below for more details.
+ All folder levels need to be specified in the include path. You can use `/*/` to represent a folder level and avoid specifying the exact folder name. To map a folder as a dataset, use the `{table}` placeholder to represent the folder level for which the dataset is to be created. For a partitioned dataset, you can use the placeholder `{partition_key[i]}` to represent the name of the `i`th partition and `{partition[i]}` to represent the value of the `i`th partition. During ingestion, `i` will be used to match the partition_key to the partition. Refer to [examples 2 and 3](#example-2---folder-of-files-as-dataset-without-partitions) below for more details.
- Exclude paths (`path_spec.exclude`) can be used to ignore paths that are not relevant to current `path_spec`. This path cannot have named variables ( `{}` ). Exclude path can have `**` to represent multiple folder levels. Refer [example 4](#example-4---folder-of-files-as-dataset-with-partitions-and-exclude-filter) below for more details.
+ Exclude paths (`path_spec.exclude`) can be used to ignore paths that are not relevant to the current `path_spec`. This path cannot have named variables (`{}`). The exclude path can have `**` to represent multiple folder levels. Refer to [example 4](#example-4---folder-of-files-as-dataset-with-partitions-and-exclude-filter) below for more details.
- Refer [example 5](#example-5---advanced---either-individual-file-or-folder-of-files-as-dataset) if your container has more complex dataset representation.
+ Refer to [example 5](#example-5---advanced---either-individual-file-or-folder-of-files-as-dataset) if your container has a more complex dataset representation.
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
Path Specs (`path_specs`) is a list of Path Spec (`path_spec`) objects where each individual `path_spec` represents one or more datasets. Include path (`path_spec.include`) represents formatted path to the dataset. This path must end with `*.*` or `*.[ext]` to represent leaf level. If `*.[ext]` is provided then files with only specified extension type will be scanned. "`.[ext]`" can be any of [supported file types](#supported-file-types). Refer [example 1](#example-1---individual-file-as-dataset) below for more details. | |
All folder levels need to be specified in include path. You can use `/*/` to represent a folder level and avoid specifying exact folder name. To map folder as a dataset, use `{table}` placeholder to represent folder level for which dataset is to be created. For a partitioned dataset, you can use placeholder `{partition_key[i]}` to represent name of `i`th partition and `{partition[i]}` to represent value of `i`th partition. During ingestion, `i` will be used to match partition_key to partition. Refer [example 2 and 3](#example-2---folder-of-files-as-dataset-without-partitions) below for more details. | |
Exclude paths (`path_spec.exclude`) can be used to ignore paths that are not relevant to current `path_spec`. This path cannot have named variables ( `{}` ). Exclude path can have `**` to represent multiple folder levels. Refer [example 4](#example-4---folder-of-files-as-dataset-with-partitions-and-exclude-filter) below for more details. | |
Path Specs (`path_specs`) is a list of Path Spec (`path_spec`) objects, where each individual `path_spec` represents one or more datasets. The include path (`path_spec.include`) represents a formatted path to the dataset. This path must end with `*.*` or `*.[ext]` to represent the leaf level. If `*.[ext]` is provided, then only files with the specified extension type will be scanned. "`.[ext]`" can be any of the [supported file types](#supported-file-types). Refer to [example 1](#example-1---individual-file-as-dataset) below for more details. | |
All folder levels need to be specified in the include path. You can use `/*/` to represent a folder level and avoid specifying the exact folder name. To map a folder as a dataset, use the `{table}` placeholder to represent the folder level for which the dataset is to be created. For a partitioned dataset, you can use the placeholder `{partition_key[i]}` to represent the name of the `i`th partition and `{partition[i]}` to represent the value of the `i`th partition. During ingestion, `i` will be used to match the partition_key to the partition. Refer to [examples 2 and 3](#example-2---folder-of-files-as-dataset-without-partitions) below for more details. | |
Exclude paths (`path_spec.exclude`) can be used to ignore paths that are not relevant to the current `path_spec`. This path cannot have named variables (`{}`). The exclude path can have `**` to represent multiple folder levels. Refer to [example 4](#example-4---folder-of-files-as-dataset-with-partitions-and-exclude-filter) below for more details. | |
Refer to [example 5](#example-5---advanced---either-individual-file-or-folder-of-files-as-dataset) if your container has a more complex dataset representation. |
Tools
LanguageTool
[uncategorized] ~4-~4: Possible missing comma found.
Context: ...) is a list of Path Spec (
path_spec) objects where each individual
path_spec` repre...(AI_HYDRA_LEO_MISSING_COMMA)
[uncategorized] ~4-~4: Possible missing article found.
Context: ...e path (path_spec.include
) represents formatted path to the dataset. This path must end...(AI_HYDRA_LEO_MISSING_A)
[uncategorized] ~4-~4: Possible missing comma found.
Context: ...o represent leaf level. If*.[ext]
is provided then files with only specified extensio...(AI_HYDRA_LEO_MISSING_COMMA)
[uncategorized] ~4-~4: Possible missing article found.
Context: ...[ext]is provided then files with only specified extension type will be scanned. "
.[ext...(AI_HYDRA_LEO_MISSING_THE)
[uncategorized] ~6-~6: Possible missing article found.
Context: ...id specifying exact folder name. To map folder as a dataset, use{table}
placeholder...(AI_HYDRA_LEO_MISSING_A)
[uncategorized] ~6-~6: Possible missing article found.
Context: ... use{table}
placeholder to represent folder level for which dataset is to be create...(AI_HYDRA_LEO_MISSING_THE)
[uncategorized] ~6-~6: Possible missing article found.
Context: ...der to represent folder level for which dataset is to be created. For a partitioned dat...(AI_HYDRA_LEO_MISSING_THE)
Markdownlint
4-4: null
Link fragments should be valid(MD051, link-fragments)
**Valid path_specs.include** | ||
|
||
```python | ||
https://storageaccountname.blob.core.windows.net/my-container/foo/tests/bar.avro # single file table | ||
https://storageaccountname.blob.core.windows.net/my-container/foo/tests/*.* # mulitple file level tables | ||
https://storageaccountname.blob.core.windows.net/my-container/foo/tests/{table}/*.avro #table without partition | ||
https://storageaccountname.blob.core.windows.net/my-container/foo/tests/{table}/*/*.avro #table where partitions are not specified | ||
https://storageaccountname.blob.core.windows.net/my-container/foo/tests/{table}/*.* # table where no partitions as well as data type specified | ||
https://storageaccountname.blob.core.windows.net/my-container/{dept}/tests/{table}/*.avro # specifying keywords to be used in display name | ||
https://storageaccountname.blob.core.windows.net/my-container/{dept}/tests/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/*.avro # specify partition key and value format | ||
https://storageaccountname.blob.core.windows.net/my-container/{dept}/tests/{table}/{partition[0]}/{partition[1]}/{partition[2]}/*.avro # specify partition value only format | ||
https://storageaccountname.blob.core.windows.net/my-container/{dept}/tests/{table}/{partition[0]}/{partition[1]}/{partition[2]}/*.* # for all extensions | ||
https://storageaccountname.blob.core.windows.net/my-container/*/{table}/{partition[0]}/{partition[1]}/{partition[2]}/*.* # table is present at 2 levels down in container | ||
https://storageaccountname.blob.core.windows.net/my-container/*/*/{table}/{partition[0]}/{partition[1]}/{partition[2]}/*.* # table is present at 3 levels down in container | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix grammatical issues and formatting inconsistencies.
There are grammatical issues and formatting inconsistencies that need to be addressed for better readability.
- https://storageaccountname.blob.core.windows.net/my-container/foo/tests/bar.avro # single file table
- https://storageaccountname.blob.core.windows.net/my-container/foo/tests/*.* # mulitple file level tables
+ https://storageaccountname.blob.core.windows.net/my-container/foo/tests/bar.avro # single file table
+ https://storageaccountname.blob.core.windows.net/my-container/foo/tests/*.* # multiple file level tables
Committable suggestion was skipped due to low confidence.
Tools
Markdownlint
147-147: null
Emphasis used instead of a heading(MD036, no-emphasis-as-heading)
### Path Specs - Examples | ||
#### Example 1 - Individual file as Dataset | ||
|
||
Container structure: | ||
|
||
``` | ||
test-container | ||
├── employees.csv | ||
├── departments.json | ||
└── food_items.csv | ||
``` | ||
|
||
Path specs config to ingest `employees.csv` and `food_items.csv` as datasets: | ||
``` | ||
path_specs: | ||
- include: https://storageaccountname.blob.core.windows.net/test-container/*.csv | ||
|
||
``` | ||
This will automatically ignore `departments.json` file. To include it, use `*.*` instead of `*.csv`. | ||
|
||
#### Example 2 - Folder of files as Dataset (without Partitions) | ||
|
||
Container structure: | ||
``` | ||
test-container | ||
└── offers | ||
├── 1.avro | ||
└── 2.avro | ||
|
||
``` | ||
|
||
Path specs config to ingest folder `offers` as dataset: | ||
``` | ||
path_specs: | ||
- include: https://storageaccountname.blob.core.windows.net/test-container/{table}/*.avro | ||
``` | ||
|
||
`{table}` represents folder for which dataset will be created. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix grammatical issues and formatting inconsistencies.
There are grammatical issues and formatting inconsistencies that need to be addressed for better readability.
- `{table}` represents folder for which dataset will be created.
+ `{table}` represents the folder for which the dataset will be created.
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
### Path Specs - Examples | |
#### Example 1 - Individual file as Dataset | |
Container structure: | |
``` | |
test-container | |
├── employees.csv | |
├── departments.json | |
└── food_items.csv | |
``` | |
Path specs config to ingest `employees.csv` and `food_items.csv` as datasets: | |
``` | |
path_specs: | |
- include: https://storageaccountname.blob.core.windows.net/test-container/*.csv | |
``` | |
This will automatically ignore `departments.json` file. To include it, use `*.*` instead of `*.csv`. | |
#### Example 2 - Folder of files as Dataset (without Partitions) | |
Container structure: | |
``` | |
test-container | |
└── offers | |
├── 1.avro | |
└── 2.avro | |
``` | |
Path specs config to ingest folder `offers` as dataset: | |
``` | |
path_specs: | |
- include: https://storageaccountname.blob.core.windows.net/test-container/{table}/*.avro | |
``` | |
`{table}` represents folder for which dataset will be created. | |
`{table}` represents the folder for which the dataset will be created. |
Tools
LanguageTool
[grammar] ~17-~17: You’ve repeated a verb. Did you mean to only write one of them?
Context: ...in named variables. ### Path Specs - Examples #### Example 1 - Individual file as Dataset Contain...(REPEATED_VERBS)
[uncategorized] ~54-~54: Possible missing article found.
Context: ...table}/*.avro ```{table}
represents folder for which dataset will be created. ##...(AI_HYDRA_LEO_MISSING_A)
[uncategorized] ~54-~54: Possible missing article found.
Context: ...{table}
represents folder for which dataset will be created. #### Example 3 - Fol...(AI_HYDRA_LEO_MISSING_A)
Markdownlint
55-55: Expected: 0 or 2; Actual: 1
Trailing spaces(MD009, no-trailing-spaces)
17-17: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines(MD022, blanks-around-headings)
18-18: Expected: 1; Actual: 0; Above
Headings should be surrounded by blank lines(MD022, blanks-around-headings)
30-30: null
Fenced code blocks should be surrounded by blank lines(MD031, blanks-around-fences)
34-34: null
Fenced code blocks should be surrounded by blank lines(MD031, blanks-around-fences)
40-40: null
Fenced code blocks should be surrounded by blank lines(MD031, blanks-around-fences)
49-49: null
Fenced code blocks should be surrounded by blank lines(MD031, blanks-around-fences)
22-22: null
Fenced code blocks should have a language specified(MD040, fenced-code-language)
30-30: null
Fenced code blocks should have a language specified(MD040, fenced-code-language)
40-40: null
Fenced code blocks should have a language specified(MD040, fenced-code-language)
49-49: null
Fenced code blocks should have a language specified(MD040, fenced-code-language)
@@ -0,0 +1,289 @@ | |||
import logging | |||
import logging |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove redundant import.
The logging
module is imported twice.
-import logging
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
import logging |
Tools
Ruff
2-2: Redefinition of unused
logging
from line 1Remove definition:
logging
(F811)
@@ -0,0 +1,40 @@ | |||
This connector ingests Azure Blob Storage (abbreviated to abs) datasets into DataHub. It allows mapping an individual | |||
file or a folder of files to a dataset in DataHub. | |||
To specify the group of files that form a dataset, use `path_specs` configuration in ingestion recipe. Refer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add missing article "the".
The sentence is missing the article "the" before "section."
- Refer section [Path Specs](https://datahubproject.io/docs/generated/ingestion/sources/s3/#path-specs) for more details.
+ Refer to the section [Path Specs](https://datahubproject.io/docs/generated/ingestion/sources/s3/#path-specs) for more details.
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
To specify the group of files that form a dataset, use `path_specs` configuration in ingestion recipe. Refer | |
To specify the group of files that form a dataset, use `path_specs` configuration in ingestion recipe. Refer to the section [Path Specs](https://datahubproject.io/docs/generated/ingestion/sources/s3/#path-specs) for more details. |
Tools
LanguageTool
[uncategorized] ~3-~3: You might be missing the article “the” here.
Context: ...group of files that form a dataset, usepath_specs
configuration in ingestion r...(AI_EN_LECTOR_MISSING_DETERMINER_THE)
[uncategorized] ~3-~3: You might be missing the article “the” here.
Context: ...aset, usepath_specs
configuration in ingestion recipe. Refer section [Path Specs](http...(AI_EN_LECTOR_MISSING_DETERMINER_THE)
#### Example 5 - Advanced - Either Individual file OR Folder of files as Dataset | ||
|
||
Container structure: | ||
``` | ||
test-container | ||
├── customers | ||
│ ├── part1.json | ||
│ ├── part2.json | ||
│ ├── part3.json | ||
│ └── part4.json | ||
├── employees.csv | ||
├── food_items.csv | ||
├── tmp_10101000.csv | ||
└── orders | ||
└── year=2022 | ||
└── month=2 | ||
├── 1.parquet | ||
├── 2.parquet | ||
└── 3.parquet | ||
|
||
``` | ||
|
||
Path specs config: | ||
``` | ||
path_specs: | ||
- include: https://storageaccountname.blob.core.windows.net/test-container/*.csv | ||
exclude: | ||
- **/tmp_10101000.csv | ||
- include: https://storageaccountname.blob.core.windows.net/test-container/{table}/*.json | ||
- include: https://storageaccountname.blob.core.windows.net/test-container/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/*.parquet | ||
``` | ||
|
||
Above config has 3 path_specs and will ingest following datasets | ||
- `employees.csv` - Single File as Dataset | ||
- `food_items.csv` - Single File as Dataset | ||
- `customers` - Folder as Dataset | ||
- `orders` - Folder as Dataset | ||
and will ignore file `tmp_10101000.csv` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix grammatical issues and formatting inconsistencies.
There are grammatical issues and formatting inconsistencies that need to be addressed for better readability.
- Above config has 3 path_specs and will ingest following datasets
+ The above config has 3 path_specs and will ingest the following datasets:
Tools
Markdownlint
111-111: null
Fenced code blocks should be surrounded by blank lines(MD031, blanks-around-fences)
131-131: null
Fenced code blocks should be surrounded by blank lines(MD031, blanks-around-fences)
141-141: null
Lists should be surrounded by blank lines(MD032, blanks-around-lists)
111-111: null
Fenced code blocks should have a language specified(MD040, fenced-code-language)
131-131: null
Fenced code blocks should have a language specified(MD040, fenced-code-language)
|
||
### Profiling | ||
|
||
Profiling is not available in the current release. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure the file ends with a newline.
Markdown files should end with a single newline character.
+
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
Profiling is not available in the current release. | |
Profiling is not available in the current release. | |
Tools
Markdownlint
40-40: null
Files should end with a single newline character(MD047, single-trailing-newline)
from typing import Any, Dict, Optional | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unused import.
The Optional
import from typing
is not used in the file.
- from typing import Any, Dict, Optional
+ from typing import Any, Dict
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
from typing import Any, Dict, Optional | |
from typing import Any, Dict | |
folder_path = f"/{folder_path}" | ||
return f"abfss://{self.container_name}@{self.account_name}.dfs.core.windows.net{folder_path}" | ||
|
||
# TODO DEX-1010 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove TODO comment or create an issue.
The TODO comment should either be removed or an issue should be created to track this task.
Do you want me to open a GitHub issue to track this task?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (2)
- metadata-ingestion/src/datahub/ingestion/source/abs/source.py (1 hunks)
- metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py (1 hunks)
Additional context used
Ruff
metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py
2-2: Redefinition of unused
logging
from line 1Remove definition:
logging
(F811)
146-146: Use
key in dict
instead ofkey in dict.keys()
Remove
.keys()
(SIM118)
272-272: Use
key in dict
instead ofkey in dict.keys()
Remove
.keys()
(SIM118)
metadata-ingestion/src/datahub/ingestion/source/abs/source.py
162-163: Use a single
if
statement instead of nestedif
statements(SIM102)
685-685: Loop control variable
guid
not used within loop bodyRename unused
guid
to_guid
(B007)
Additional comments not posted (26)
metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py (9)
22-23
: LGTM!The function correctly checks if a URI matches the Azure Blob Storage pattern.
26-30
: LGTM!The function correctly extracts the prefix from an Azure Blob Storage URI.
33-41
: LGTM!The function correctly removes the Azure Blob Storage prefix from a URI.
44-56
: LGTM!The function correctly creates a URN for an Azure Blob Storage dataset.
59-64
: LGTM!The function correctly extracts the container name from an Azure Blob Storage URI.
67-72
: LGTM!The function correctly extracts the key prefix from an Azure Blob Storage URI.
153-187
: LGTM!The function correctly processes properties and adds them to the custom properties dictionary.
189-237
: LGTM!The function correctly retrieves tags from an Azure Blob Storage object or container.
86-87
: Improve the error message.The error message for
azure_config
beingNone
could be more descriptive.- raise ValueError("No container_client available.") + raise ValueError("Azure configuration is not provided. Cannot retrieve container client.")Likely invalid or redundant comment.
metadata-ingestion/src/datahub/ingestion/source/abs/source.py (17)
127-147
: LGTM!The function correctly maps Spark column types to DataHub types.
235-299
: LGTM!The function correctly retrieves fields from a table based on its file extension.
301-324
: LGTM!The function correctly adds partition columns to the schema.
325-334
: LGTM!The function correctly creates an operation aspect for a table.
336-437
: LGTM!The function correctly ingests a table and generates metadata work units.
439-444
: LGTM!The function correctly extracts the prefix from a relative path.
446-449
: LGTM!The function correctly extracts the table name from a path specification and named variables.
451-472
: LGTM!The function correctly extracts table data from a path specification.
474-490
: LGTM!The function correctly resolves templated folders in a container.
491-520
: LGTM!The function correctly retrieves the directory to process based on a path specification.
522-606
: LGTM!The function correctly browses Azure Blob Storage and retrieves metadata for files.
608-615
: LGTM!The function correctly creates an Azure Blob Storage path from a key.
617-639
: LGTM!The function correctly browses the local file system and retrieves metadata for files.
641-687
: LGTM!The function correctly retrieves metadata work units.
Tools
Ruff
685-685: Loop control variable
guid
not used within loop bodyRename unused
guid
to_guid
(B007)
688-694
: LGTM!The function correctly retrieves work unit processors.
696-697
: LGTM!The function correctly checks if the platform is Azure Blob Storage.
699-700
: LGTM!The function correctly retrieves the report.
def list_folders( | ||
container_name: str, prefix: str, azure_config: Optional[AzureConnectionConfig] | ||
) -> Iterable[str]: | ||
if azure_config is None: | ||
raise ValueError("azure_config not set. Cannot browse Azure Blob Storage") | ||
|
||
abs_blob_service_client = azure_config.get_blob_service_client() | ||
container_client = abs_blob_service_client.get_container_client(container_name) | ||
|
||
current_level = prefix.count("/") | ||
blob_list = container_client.list_blobs(name_starts_with=prefix) | ||
|
||
this_dict = {} | ||
for blob in blob_list: | ||
blob_name = blob.name[: blob.name.rfind("/") + 1] | ||
folder_structure_arr = blob_name.split("/") | ||
|
||
folder_name = "" | ||
if len(folder_structure_arr) > current_level: | ||
folder_name = f"{folder_name}/{folder_structure_arr[current_level]}" | ||
else: | ||
continue | ||
|
||
folder_name = folder_name[1 : len(folder_name)] | ||
|
||
if folder_name.endswith("/"): | ||
folder_name = folder_name[:-1] | ||
|
||
if folder_name == "": | ||
continue | ||
|
||
folder_name = f"{prefix}{folder_name}" | ||
if folder_name in this_dict.keys(): | ||
continue | ||
else: | ||
this_dict[folder_name] = folder_name | ||
|
||
yield f"{folder_name}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unnecessary .keys()
method.
Use key in dict
instead of key in dict.keys()
for better performance.
- if folder_name in this_dict.keys():
+ if folder_name in this_dict:
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def list_folders( | |
container_name: str, prefix: str, azure_config: Optional[AzureConnectionConfig] | |
) -> Iterable[str]: | |
if azure_config is None: | |
raise ValueError("azure_config not set. Cannot browse Azure Blob Storage") | |
abs_blob_service_client = azure_config.get_blob_service_client() | |
container_client = abs_blob_service_client.get_container_client(container_name) | |
current_level = prefix.count("/") | |
blob_list = container_client.list_blobs(name_starts_with=prefix) | |
this_dict = {} | |
for blob in blob_list: | |
blob_name = blob.name[: blob.name.rfind("/") + 1] | |
folder_structure_arr = blob_name.split("/") | |
folder_name = "" | |
if len(folder_structure_arr) > current_level: | |
folder_name = f"{folder_name}/{folder_structure_arr[current_level]}" | |
else: | |
continue | |
folder_name = folder_name[1 : len(folder_name)] | |
if folder_name.endswith("/"): | |
folder_name = folder_name[:-1] | |
if folder_name == "": | |
continue | |
folder_name = f"{prefix}{folder_name}" | |
if folder_name in this_dict.keys(): | |
continue | |
else: | |
this_dict[folder_name] = folder_name | |
yield f"{folder_name}" | |
if folder_name in this_dict: |
Tools
Ruff
272-272: Use
key in dict
instead ofkey in dict.keys()
Remove
.keys()
(SIM118)
def get_abs_properties( | ||
container_name: str, | ||
blob_name: Optional[str], | ||
full_path: str, | ||
number_of_files: int, | ||
size_in_bytes: int, | ||
sample_files: bool, | ||
azure_config: Optional[AzureConnectionConfig], | ||
use_abs_container_properties: Optional[bool] = False, | ||
use_abs_blob_properties: Optional[bool] = False, | ||
) -> Dict[str, str]: | ||
if azure_config is None: | ||
raise ValueError("No container_client available.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add error handling for get_blob_properties
and get_container_properties
.
The function can be improved by adding error handling for the get_blob_properties
and get_container_properties
calls to handle potential errors gracefully.
- blob_properties = blob_client.get_blob_properties()
+ try:
+ blob_properties = blob_client.get_blob_properties()
+ except Exception as e:
+ logger.error(f"Failed to get blob properties: {e}")
+ blob_properties = None
- container_properties = container_client.get_container_properties()
+ try:
+ container_properties = container_client.get_container_properties()
+ except Exception as e:
+ logger.error(f"Failed to get container properties: {e}")
+ container_properties = None
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def get_abs_properties( | |
container_name: str, | |
blob_name: Optional[str], | |
full_path: str, | |
number_of_files: int, | |
size_in_bytes: int, | |
sample_files: bool, | |
azure_config: Optional[AzureConnectionConfig], | |
use_abs_container_properties: Optional[bool] = False, | |
use_abs_blob_properties: Optional[bool] = False, | |
) -> Dict[str, str]: | |
if azure_config is None: | |
raise ValueError("No container_client available.") | |
def get_abs_properties( | |
container_name: str, | |
blob_name: Optional[str], | |
full_path: str, | |
number_of_files: int, | |
size_in_bytes: int, | |
sample_files: bool, | |
azure_config: Optional[AzureConnectionConfig], | |
use_abs_container_properties: Optional[bool] = False, | |
use_abs_blob_properties: Optional[bool] = False, | |
) -> Dict[str, str]: | |
if azure_config is None: | |
raise ValueError("No container_client available.") | |
try: | |
blob_properties = blob_client.get_blob_properties() | |
except Exception as e: | |
logger.error(f"Failed to get blob properties: {e}") | |
blob_properties = None | |
try: | |
container_properties = container_client.get_container_properties() | |
except Exception as e: | |
logger.error(f"Failed to get container properties: {e}") | |
container_properties = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Outside diff range and nitpick comments (2)
metadata-ingestion/docs/sources/s3/README.md (2)
Line range hint
41-44
: Fix unordered list indentation.Ensure proper indentation for unordered lists.
- - Row and column counts for each dataset - - For each column, if profiling is enabled: - - null counts and proportions - - distinct counts and proportions - - minimum, maximum, mean, median, standard deviation, some quantile values - - histograms or frequencies of unique values + - Row and column counts for each dataset + - For each column, if profiling is enabled: + - null counts and proportions + - distinct counts and proportions + - minimum, maximum, mean, median, standard deviation, some quantile values + - histograms or frequencies of unique valuesTools
Markdownlint
1-1: Expected: 0 or 2; Actual: 1
Trailing spaces(MD009, no-trailing-spaces)
Line range hint
21-21
: Fix invalid link fragments.The link fragments should be valid.
- max_rows` recipe parameter (see [below](#config-details)) + max_rows` recipe parameter (see [below](#profiling)) - see [compatibility](#compatibility) for more details + see [compatibility](#compatibility) for more detailsAlso applies to: 46-46
Tools
Markdownlint
1-1: Expected: 0 or 2; Actual: 1
Trailing spaces(MD009, no-trailing-spaces)
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (5)
- metadata-ingestion/docs/sources/abs/README.md (1 hunks)
- metadata-ingestion/docs/sources/abs/abs.md (1 hunks)
- metadata-ingestion/docs/sources/s3/README.md (1 hunks)
- metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py (1 hunks)
- metadata-ingestion/src/datahub/ingestion/source/data_lake_common/data_lake_utils.py (4 hunks)
Files not reviewed due to errors (1)
- metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py (no review received)
Files skipped from review as they are similar to previous changes (1)
- metadata-ingestion/src/datahub/ingestion/source/data_lake_common/data_lake_utils.py
Additional context used
LanguageTool
metadata-ingestion/docs/sources/abs/README.md
[uncategorized] ~3-~3: You might be missing the article “the” here.
Context: ...group of files that form a dataset, usepath_specs
configuration in ingestion r...(AI_EN_LECTOR_MISSING_DETERMINER_THE)
[uncategorized] ~3-~3: You might be missing the article “the” here.
Context: ...aset, usepath_specs
configuration in ingestion recipe. Refer section [Path Specs](http...(AI_EN_LECTOR_MISSING_DETERMINER_THE)
[style] ~34-~34: ‘on the basis of’ might be wordy. Consider a shorter alternative.
Context: ...etails)) JSON file schemas are inferred on the basis of the entire file (given the difficulty i...(EN_WORDINESS_PREMIUM_ON_THE_BASIS_OF)
metadata-ingestion/docs/sources/s3/README.md
[style] ~22-~22: ‘on the basis of’ might be wordy. Consider a shorter alternative.
Context: ...etails)) JSON file schemas are inferred on the basis of the entire file (given the difficulty i...(EN_WORDINESS_PREMIUM_ON_THE_BASIS_OF)
[style] ~46-~46: This phrase is redundant. Consider using “outside”.
Context: ...to interface with AWS (schema inference outside of profiling requires s3:// access). Enabl...(OUTSIDE_OF)
metadata-ingestion/docs/sources/abs/abs.md
[grammar] ~4-~4: The verb ‘include’ does not usually follow articles like ‘The’. Check that ‘include’ is spelled correctly; using ‘include’ as a noun may be non-standard.
Context: ..._specrepresents one or more datasets. The include path (
path_spec.include`) represents a...(A_INFINITIVE)
[grammar] ~17-~17: You’ve repeated a verb. Did you mean to only write one of them?
Context: ...in named variables. ### Path Specs - Examples #### Example 1 - Individual file as Dataset Contain...(REPEATED_VERBS)
[uncategorized] ~202-~202: This verb does not appear to agree with the subject. Consider using a different form.
Context: ...ames that don't start with a letter and contains other character than letters, number, a...(AI_EN_LECTOR_REPLACEMENT_VERB_AGREEMENT)
[uncategorized] ~202-~202: The grammatical number of this noun doesn’t look right. Consider replacing it.
Context: ... contains other character than letters, number, and underscore. [https://github.com/ap...(AI_EN_LECTOR_REPLACEMENT_NOUN_NUMBER)
Markdownlint
metadata-ingestion/docs/sources/abs/README.md
33-33: null
Link fragments should be valid(MD051, link-fragments)
metadata-ingestion/docs/sources/s3/README.md
41-41: Expected: 2; Actual: 4
Unordered list indentation(MD007, ul-indent)
42-42: Expected: 2; Actual: 4
Unordered list indentation(MD007, ul-indent)
43-43: Expected: 2; Actual: 4
Unordered list indentation(MD007, ul-indent)
44-44: Expected: 2; Actual: 4
Unordered list indentation(MD007, ul-indent)
1-1: Expected: 0 or 2; Actual: 1
Trailing spaces(MD009, no-trailing-spaces)
9-9: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines(MD022, blanks-around-headings)
39-39: null
Lists should be surrounded by blank lines(MD032, blanks-around-lists)
21-21: null
Link fragments should be valid(MD051, link-fragments)
46-46: null
Link fragments should be valid(MD051, link-fragments)
metadata-ingestion/docs/sources/abs/abs.md
55-55: Expected: 0 or 2; Actual: 1
Trailing spaces(MD009, no-trailing-spaces)
202-202: Expected: 0 or 2; Actual: 1
Trailing spaces(MD009, no-trailing-spaces)
16-16: Expected: 1; Actual: 2
Multiple consecutive blank lines(MD012, no-multiple-blanks)
107-107: Expected: 1; Actual: 2
Multiple consecutive blank lines(MD012, no-multiple-blanks)
169-169: Expected: 1; Actual: 2
Multiple consecutive blank lines(MD012, no-multiple-blanks)
170-170: Expected: 1; Actual: 3
Multiple consecutive blank lines(MD012, no-multiple-blanks)
17-17: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines(MD022, blanks-around-headings)
18-18: Expected: 1; Actual: 0; Above
Headings should be surrounded by blank lines(MD022, blanks-around-headings)
30-30: null
Fenced code blocks should be surrounded by blank lines(MD031, blanks-around-fences)
34-34: null
Fenced code blocks should be surrounded by blank lines(MD031, blanks-around-fences)
40-40: null
Fenced code blocks should be surrounded by blank lines(MD031, blanks-around-fences)
49-49: null
Fenced code blocks should be surrounded by blank lines(MD031, blanks-around-fences)
59-59: null
Fenced code blocks should be surrounded by blank lines(MD031, blanks-around-fences)
74-74: null
Fenced code blocks should be surrounded by blank lines(MD031, blanks-around-fences)
84-84: null
Fenced code blocks should be surrounded by blank lines(MD031, blanks-around-fences)
100-100: null
Fenced code blocks should be surrounded by blank lines(MD031, blanks-around-fences)
111-111: null
Fenced code blocks should be surrounded by blank lines(MD031, blanks-around-fences)
131-131: null
Fenced code blocks should be surrounded by blank lines(MD031, blanks-around-fences)
13-13: null
Lists should be surrounded by blank lines(MD032, blanks-around-lists)
141-141: null
Lists should be surrounded by blank lines(MD032, blanks-around-lists)
164-164: null
Lists should be surrounded by blank lines(MD032, blanks-around-lists)
165-165: null
Bare URL used(MD034, no-bare-urls)
167-167: null
Bare URL used(MD034, no-bare-urls)
12-12: null
Emphasis used instead of a heading(MD036, no-emphasis-as-heading)
147-147: null
Emphasis used instead of a heading(MD036, no-emphasis-as-heading)
163-163: null
Emphasis used instead of a heading(MD036, no-emphasis-as-heading)
22-22: null
Fenced code blocks should have a language specified(MD040, fenced-code-language)
30-30: null
Fenced code blocks should have a language specified(MD040, fenced-code-language)
40-40: null
Fenced code blocks should have a language specified(MD040, fenced-code-language)
49-49: null
Fenced code blocks should have a language specified(MD040, fenced-code-language)
59-59: null
Fenced code blocks should have a language specified(MD040, fenced-code-language)
74-74: null
Fenced code blocks should have a language specified(MD040, fenced-code-language)
84-84: null
Fenced code blocks should have a language specified(MD040, fenced-code-language)
100-100: null
Fenced code blocks should have a language specified(MD040, fenced-code-language)
111-111: null
Fenced code blocks should have a language specified(MD040, fenced-code-language)
131-131: null
Fenced code blocks should have a language specified(MD040, fenced-code-language)
204-204: null
Files should end with a single newline character(MD047, single-trailing-newline)
4-4: null
Link fragments should be valid(MD051, link-fragments)
Ruff
metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py
145-145: Use
key in dict
instead ofkey in dict.keys()
Remove
.keys()
(SIM118)
271-271: Use
key in dict
instead ofkey in dict.keys()
Remove
.keys()
(SIM118)
Additional comments not posted (10)
metadata-ingestion/docs/sources/abs/README.md (2)
3-3
: Add missing article "the".The sentence is missing the article "the" before "section."
- Refer section [Path Specs](https://datahubproject.io/docs/generated/ingestion/sources/s3/#path-specs) for more details. + Refer to the section [Path Specs](https://datahubproject.io/docs/generated/ingestion/sources/s3/#path-specs) for more details.Tools
LanguageTool
[uncategorized] ~3-~3: You might be missing the article “the” here.
Context: ...group of files that form a dataset, usepath_specs
configuration in ingestion r...(AI_EN_LECTOR_MISSING_DETERMINER_THE)
[uncategorized] ~3-~3: You might be missing the article “the” here.
Context: ...aset, usepath_specs
configuration in ingestion recipe. Refer section [Path Specs](http...(AI_EN_LECTOR_MISSING_DETERMINER_THE)
34-34
: Consider a shorter alternative to "on the basis of".The phrase "on the basis of" might be wordy. Consider using "based on" instead.
- JSON file schemas are inferred on the basis of the entire file (given the difficulty in extracting only the first few objects of the file), which may impact performance. + JSON file schemas are inferred based on the entire file (given the difficulty in extracting only the first few objects of the file), which may impact performance.Tools
LanguageTool
[style] ~34-~34: ‘on the basis of’ might be wordy. Consider a shorter alternative.
Context: ...etails)) JSON file schemas are inferred on the basis of the entire file (given the difficulty i...(EN_WORDINESS_PREMIUM_ON_THE_BASIS_OF)
metadata-ingestion/docs/sources/s3/README.md (5)
Line range hint
22-22
: Consider a shorter alternative to "on the basis of".The phrase "on the basis of" might be wordy. Consider using "based on" instead.
- JSON file schemas are inferred on the basis of the entire file (given the difficulty in extracting only the first few objects of the file), which may impact performance. + JSON file schemas are inferred based on the entire file (given the difficulty in extracting only the first few objects of the file), which may impact performance.Tools
Markdownlint
1-1: Expected: 0 or 2; Actual: 1
Trailing spaces(MD009, no-trailing-spaces)
Line range hint
46-46
: Consider using "outside" instead of "outside of".The phrase "outside of" is redundant. Consider using "outside" instead.
- schema inference outside of profiling requires s3:// access + schema inference outside profiling requires s3:// accessTools
Markdownlint
1-1: Expected: 0 or 2; Actual: 1
Trailing spaces(MD009, no-trailing-spaces)
1-1
: Remove trailing spaces.Trailing spaces should be removed for better formatting.
- This connector ingests AWS S3 datasets into DataHub. It allows mapping an individual file or a folder of files to a dataset in DataHub. + This connector ingests AWS S3 datasets into DataHub. It allows mapping an individual file or a folder of files to a dataset in DataHub.Tools
Markdownlint
1-1: Expected: 0 or 2; Actual: 1
Trailing spaces(MD009, no-trailing-spaces)
Line range hint
9-9
: Add blank lines around headings.Headings should be surrounded by blank lines for better readability.
- ### Supported file types + + ### Supported file typesTools
Markdownlint
1-1: Expected: 0 or 2; Actual: 1
Trailing spaces(MD009, no-trailing-spaces)
Line range hint
39-39
: Add blank lines around lists.Lists should be surrounded by blank lines for better readability.
- - Row and column counts for each dataset + + - Row and column counts for each datasetTools
Markdownlint
1-1: Expected: 0 or 2; Actual: 1
Trailing spaces(MD009, no-trailing-spaces)
metadata-ingestion/docs/sources/abs/abs.md (3)
17-18
: Add blank lines around headings.Headings should be surrounded by blank lines for better readability.
- ### Path Specs - Examples - #### Example 1 - Individual file as Dataset + ### Path Specs - Examples + #### Example 1 - Individual file as DatasetTools
LanguageTool
[grammar] ~17-~17: You’ve repeated a verb. Did you mean to only write one of them?
Context: ...in named variables. ### Path Specs - Examples #### Example 1 - Individual file as Dataset Contain...(REPEATED_VERBS)
Markdownlint
17-17: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines(MD022, blanks-around-headings)
18-18: Expected: 1; Actual: 0; Above
Headings should be surrounded by blank lines(MD022, blanks-around-headings)
147-147
: Use headings instead of emphasis for better structure.The emphasis should be replaced with a heading for better readability and structure.
- **Valid path_specs.include** + ### Valid path_specs.include - **Valid path_specs.exclude** + ### Valid path_specs.excludeAlso applies to: 163-163
Tools
Markdownlint
147-147: null
Emphasis used instead of a heading(MD036, no-emphasis-as-heading)
164-168
: Add blank lines around lists.Lists should be surrounded by blank lines for better readability.
- - \**/tests/** + + - \**/tests/**Tools
Markdownlint
164-164: null
Lists should be surrounded by blank lines(MD032, blanks-around-lists)
165-165: null
Bare URL used(MD034, no-bare-urls)
167-167: null
Bare URL used(MD034, no-bare-urls)
Schemas for Parquet and Avro files are extracted as provided. | ||
|
||
Schemas for schemaless formats (CSV, TSV, JSONL, JSON) are inferred. For CSV, TSV and JSONL files, we consider the first | ||
100 rows by default, which can be controlled via the `max_rows` recipe parameter (see [below](#config-details)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix invalid link fragment.
The link fragment #config-details
should be valid.
- max_rows` recipe parameter (see [below](#config-details))
+ max_rows` recipe parameter (see [below](#profiling))
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
100 rows by default, which can be controlled via the `max_rows` recipe parameter (see [below](#config-details)) | |
100 rows by default, which can be controlled via the `max_rows` recipe parameter (see [below](#profiling)) |
Tools
Markdownlint
33-33: null
Link fragments should be valid(MD051, link-fragments)
|
||
Refer to [example 5](#example-5---advanced---either-individual-file-or-folder-of-files-as-dataset) if your container has a more complex dataset representation. | ||
|
||
**Additional points to note** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use headings instead of emphasis for better structure.
The emphasis should be replaced with a heading for better readability and structure.
- **Additional points to note**
+ ### Additional Points to Note
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
**Additional points to note** | |
### Additional Points to Note |
Tools
Markdownlint
12-12: null
Emphasis used instead of a heading(MD036, no-emphasis-as-heading)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py (1 hunks)
Additional context used
Ruff
metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py
147-147: Use
key in dict
instead ofkey in dict.keys()
Remove
.keys()
(SIM118)
277-277: Use
key in dict
instead ofkey in dict.keys()
Remove
.keys()
(SIM118)
Additional comments not posted (8)
metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py (8)
21-22
: LGTM!The function correctly checks if a given URI matches the Azure Blob Storage pattern.
25-29
: LGTM!The function correctly retrieves the prefix of an Azure Blob Storage URI.
32-40
: LGTM!The function correctly removes the Azure Blob Storage prefix from a URI.
43-55
: LGTM!The function correctly generates a URN for an Azure Blob Storage dataset.
58-63
: LGTM!The function correctly retrieves the container name from an Azure Blob Storage URI.
66-71
: LGTM!The function correctly retrieves the key prefix from an Azure Blob Storage URI.
154-188
: LGTM!The function correctly creates properties recursively.
285-286
: LGTM!The function correctly retrieves the relative path of a container.
def list_folders( | ||
container_name: str, prefix: str, azure_config: Optional[AzureConnectionConfig] | ||
) -> Iterable[str]: | ||
if azure_config is None: | ||
raise ValueError( | ||
"Azure configuration is not provided. Cannot retrieve container client." | ||
) | ||
|
||
abs_blob_service_client = azure_config.get_blob_service_client() | ||
container_client = abs_blob_service_client.get_container_client(container_name) | ||
|
||
current_level = prefix.count("/") | ||
blob_list = container_client.list_blobs(name_starts_with=prefix) | ||
|
||
this_dict = {} | ||
for blob in blob_list: | ||
blob_name = blob.name[: blob.name.rfind("/") + 1] | ||
folder_structure_arr = blob_name.split("/") | ||
|
||
folder_name = "" | ||
if len(folder_structure_arr) > current_level: | ||
folder_name = f"{folder_name}/{folder_structure_arr[current_level]}" | ||
else: | ||
continue | ||
|
||
folder_name = folder_name[1 : len(folder_name)] | ||
|
||
if folder_name.endswith("/"): | ||
folder_name = folder_name[:-1] | ||
|
||
if folder_name == "": | ||
continue | ||
|
||
folder_name = f"{prefix}{folder_name}" | ||
if folder_name in this_dict.keys(): | ||
continue | ||
else: | ||
this_dict[folder_name] = folder_name | ||
|
||
yield f"{folder_name}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unnecessary .keys()
method.
Use key in dict
instead of key in dict.keys()
for better performance.
- if folder_name in this_dict.keys():
+ if folder_name in this_dict:
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def list_folders( | |
container_name: str, prefix: str, azure_config: Optional[AzureConnectionConfig] | |
) -> Iterable[str]: | |
if azure_config is None: | |
raise ValueError( | |
"Azure configuration is not provided. Cannot retrieve container client." | |
) | |
abs_blob_service_client = azure_config.get_blob_service_client() | |
container_client = abs_blob_service_client.get_container_client(container_name) | |
current_level = prefix.count("/") | |
blob_list = container_client.list_blobs(name_starts_with=prefix) | |
this_dict = {} | |
for blob in blob_list: | |
blob_name = blob.name[: blob.name.rfind("/") + 1] | |
folder_structure_arr = blob_name.split("/") | |
folder_name = "" | |
if len(folder_structure_arr) > current_level: | |
folder_name = f"{folder_name}/{folder_structure_arr[current_level]}" | |
else: | |
continue | |
folder_name = folder_name[1 : len(folder_name)] | |
if folder_name.endswith("/"): | |
folder_name = folder_name[:-1] | |
if folder_name == "": | |
continue | |
folder_name = f"{prefix}{folder_name}" | |
if folder_name in this_dict.keys(): | |
continue | |
else: | |
this_dict[folder_name] = folder_name | |
yield f"{folder_name}" | |
if folder_name in this_dict: | |
continue | |
else: | |
this_dict[folder_name] = folder_name |
Tools
Ruff
277-277: Use
key in dict
instead ofkey in dict.keys()
Remove
.keys()
(SIM118)
def get_abs_tags( | ||
container_name: str, | ||
key_name: Optional[str], | ||
dataset_urn: str, | ||
azure_config: Optional[AzureConnectionConfig], | ||
ctx: PipelineContext, | ||
use_abs_blob_tags: Optional[bool] = False, | ||
) -> Optional[GlobalTagsClass]: | ||
# Todo add the service_client, when building out this get_abs_tags | ||
if azure_config is None: | ||
raise ValueError( | ||
"Azure configuration is not provided. Cannot retrieve container client." | ||
) | ||
|
||
tags_to_add: List[str] = [] | ||
blob_service_client = azure_config.get_blob_service_client() | ||
container_client = blob_service_client.get_container_client(container_name) | ||
blob_client = container_client.get_blob_client(blob=key_name) | ||
|
||
if use_abs_blob_tags and key_name is not None: | ||
tag_set = blob_client.get_blob_tags() | ||
if tag_set: | ||
tags_to_add.extend( | ||
make_tag_urn(f"""{key}:{value}""") for key, value in tag_set.items() | ||
) | ||
else: | ||
# Unlike container tags, if an object does not have tags, it will just return an empty array | ||
# as opposed to an exception. | ||
logger.info(f"No tags found for container={container_name} key={key_name}") | ||
|
||
if len(tags_to_add) == 0: | ||
return None | ||
|
||
if ctx.graph is not None: | ||
logger.debug("Connected to DatahubApi, grabbing current tags to maintain.") | ||
current_tags: Optional[GlobalTagsClass] = ctx.graph.get_aspect( | ||
entity_urn=dataset_urn, | ||
aspect_type=GlobalTagsClass, | ||
) | ||
if current_tags: | ||
tags_to_add.extend([current_tag.tag for current_tag in current_tags.tags]) | ||
else: | ||
logger.warning("Could not connect to DatahubApi. No current tags to maintain") | ||
|
||
# Sort existing tags | ||
tags_to_add = sorted(list(set(tags_to_add))) | ||
# Remove duplicate tags | ||
new_tags = GlobalTagsClass( | ||
tags=[TagAssociationClass(tag_to_add) for tag_to_add in tags_to_add] | ||
) | ||
return new_tags |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add error handling for get_blob_tags
.
The function can be improved by adding error handling for the get_blob_tags
call to handle potential errors gracefully.
- tag_set = blob_client.get_blob_tags()
+ try:
+ tag_set = blob_client.get_blob_tags()
+ except Exception as e:
+ logger.error(f"Failed to get blob tags: {e}")
+ tag_set = None
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def get_abs_tags( | |
container_name: str, | |
key_name: Optional[str], | |
dataset_urn: str, | |
azure_config: Optional[AzureConnectionConfig], | |
ctx: PipelineContext, | |
use_abs_blob_tags: Optional[bool] = False, | |
) -> Optional[GlobalTagsClass]: | |
# Todo add the service_client, when building out this get_abs_tags | |
if azure_config is None: | |
raise ValueError( | |
"Azure configuration is not provided. Cannot retrieve container client." | |
) | |
tags_to_add: List[str] = [] | |
blob_service_client = azure_config.get_blob_service_client() | |
container_client = blob_service_client.get_container_client(container_name) | |
blob_client = container_client.get_blob_client(blob=key_name) | |
if use_abs_blob_tags and key_name is not None: | |
tag_set = blob_client.get_blob_tags() | |
if tag_set: | |
tags_to_add.extend( | |
make_tag_urn(f"""{key}:{value}""") for key, value in tag_set.items() | |
) | |
else: | |
# Unlike container tags, if an object does not have tags, it will just return an empty array | |
# as opposed to an exception. | |
logger.info(f"No tags found for container={container_name} key={key_name}") | |
if len(tags_to_add) == 0: | |
return None | |
if ctx.graph is not None: | |
logger.debug("Connected to DatahubApi, grabbing current tags to maintain.") | |
current_tags: Optional[GlobalTagsClass] = ctx.graph.get_aspect( | |
entity_urn=dataset_urn, | |
aspect_type=GlobalTagsClass, | |
) | |
if current_tags: | |
tags_to_add.extend([current_tag.tag for current_tag in current_tags.tags]) | |
else: | |
logger.warning("Could not connect to DatahubApi. No current tags to maintain") | |
# Sort existing tags | |
tags_to_add = sorted(list(set(tags_to_add))) | |
# Remove duplicate tags | |
new_tags = GlobalTagsClass( | |
tags=[TagAssociationClass(tag_to_add) for tag_to_add in tags_to_add] | |
) | |
return new_tags | |
tag_set = None | |
try: | |
tag_set = blob_client.get_blob_tags() | |
except Exception as e: | |
logger.error(f"Failed to get blob tags: {e}") |
def add_property( | ||
key: str, value: str, custom_properties: Dict[str, str], resource_name: str | ||
) -> Dict[str, str]: | ||
if key in custom_properties.keys(): | ||
key = f"{key}_{resource_name}" | ||
if value is not None: | ||
custom_properties[key] = str(value) | ||
return custom_properties |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unnecessary .keys()
method.
Use key in dict
instead of key in dict.keys()
for better performance.
- if key in custom_properties.keys():
+ if key in custom_properties:
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def add_property( | |
key: str, value: str, custom_properties: Dict[str, str], resource_name: str | |
) -> Dict[str, str]: | |
if key in custom_properties.keys(): | |
key = f"{key}_{resource_name}" | |
if value is not None: | |
custom_properties[key] = str(value) | |
return custom_properties | |
def add_property( | |
key: str, value: str, custom_properties: Dict[str, str], resource_name: str | |
) -> Dict[str, str]: | |
if key in custom_properties: | |
key = f"{key}_{resource_name}" | |
if value is not None: | |
custom_properties[key] = str(value) | |
return custom_properties |
Tools
Ruff
147-147: Use
key in dict
instead ofkey in dict.keys()
Remove
.keys()
(SIM118)
def get_abs_properties( | ||
container_name: str, | ||
blob_name: Optional[str], | ||
full_path: str, | ||
number_of_files: int, | ||
size_in_bytes: int, | ||
sample_files: bool, | ||
azure_config: Optional[AzureConnectionConfig], | ||
use_abs_container_properties: Optional[bool] = False, | ||
use_abs_blob_properties: Optional[bool] = False, | ||
) -> Dict[str, str]: | ||
if azure_config is None: | ||
raise ValueError( | ||
"Azure configuration is not provided. Cannot retrieve container client." | ||
) | ||
|
||
blob_service_client = azure_config.get_blob_service_client() | ||
container_client = blob_service_client.get_container_client( | ||
container=container_name | ||
) | ||
|
||
custom_properties = {"schema_inferred_from": full_path} | ||
if not sample_files: | ||
custom_properties.update( | ||
{ | ||
"number_of_files": str(number_of_files), | ||
"size_in_bytes": str(size_in_bytes), | ||
} | ||
) | ||
|
||
if use_abs_blob_properties and blob_name is not None: | ||
blob_client = container_client.get_blob_client(blob=blob_name) | ||
blob_properties = blob_client.get_blob_properties() | ||
if blob_properties: | ||
create_properties( | ||
data=blob_properties, | ||
prefix="blob", | ||
custom_properties=custom_properties, | ||
resource_name=blob_name, | ||
json_properties=[ | ||
"metadata", | ||
"content_settings", | ||
"lease", | ||
"copy", | ||
"immutability_policy", | ||
], | ||
) | ||
else: | ||
logger.warning( | ||
f"No blob properties found for container={container_name}, blob={blob_name}." | ||
) | ||
|
||
if use_abs_container_properties: | ||
container_properties = container_client.get_container_properties() | ||
if container_properties: | ||
create_properties( | ||
data=container_properties, | ||
prefix="container", | ||
custom_properties=custom_properties, | ||
resource_name=container_name, | ||
json_properties=["metadata"], | ||
) | ||
else: | ||
logger.warning( | ||
f"No container properties found for container={container_name}." | ||
) | ||
|
||
return custom_properties |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add error handling for get_blob_properties
and get_container_properties
.
The function can be improved by adding error handling for the get_blob_properties
and get_container_properties
calls to handle potential errors gracefully.
- blob_properties = blob_client.get_blob_properties()
+ try:
+ blob_properties = blob_client.get_blob_properties()
+ except Exception as e:
+ logger.error(f"Failed to get blob properties: {e}")
+ blob_properties = None
- container_properties = container_client.get_container_properties()
+ try:
+ container_properties = container_client.get_container_properties()
+ except Exception as e:
+ logger.error(f"Failed to get container properties: {e}")
+ container_properties = None
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def get_abs_properties( | |
container_name: str, | |
blob_name: Optional[str], | |
full_path: str, | |
number_of_files: int, | |
size_in_bytes: int, | |
sample_files: bool, | |
azure_config: Optional[AzureConnectionConfig], | |
use_abs_container_properties: Optional[bool] = False, | |
use_abs_blob_properties: Optional[bool] = False, | |
) -> Dict[str, str]: | |
if azure_config is None: | |
raise ValueError( | |
"Azure configuration is not provided. Cannot retrieve container client." | |
) | |
blob_service_client = azure_config.get_blob_service_client() | |
container_client = blob_service_client.get_container_client( | |
container=container_name | |
) | |
custom_properties = {"schema_inferred_from": full_path} | |
if not sample_files: | |
custom_properties.update( | |
{ | |
"number_of_files": str(number_of_files), | |
"size_in_bytes": str(size_in_bytes), | |
} | |
) | |
if use_abs_blob_properties and blob_name is not None: | |
blob_client = container_client.get_blob_client(blob=blob_name) | |
blob_properties = blob_client.get_blob_properties() | |
if blob_properties: | |
create_properties( | |
data=blob_properties, | |
prefix="blob", | |
custom_properties=custom_properties, | |
resource_name=blob_name, | |
json_properties=[ | |
"metadata", | |
"content_settings", | |
"lease", | |
"copy", | |
"immutability_policy", | |
], | |
) | |
else: | |
logger.warning( | |
f"No blob properties found for container={container_name}, blob={blob_name}." | |
) | |
if use_abs_container_properties: | |
container_properties = container_client.get_container_properties() | |
if container_properties: | |
create_properties( | |
data=container_properties, | |
prefix="container", | |
custom_properties=custom_properties, | |
resource_name=container_name, | |
json_properties=["metadata"], | |
) | |
else: | |
logger.warning( | |
f"No container properties found for container={container_name}." | |
) | |
return custom_properties | |
def get_abs_properties( | |
container_name: str, | |
blob_name: Optional[str], | |
full_path: str, | |
number_of_files: int, | |
size_in_bytes: int, | |
sample_files: bool, | |
azure_config: Optional[AzureConnectionConfig], | |
use_abs_container_properties: Optional[bool] = False, | |
use_abs_blob_properties: Optional[bool] = False, | |
) -> Dict[str, str]: | |
if azure_config is None: | |
raise ValueError( | |
"Azure configuration is not provided. Cannot retrieve container client." | |
) | |
blob_service_client = azure_config.get_blob_service_client() | |
container_client = blob_service_client.get_container_client( | |
container=container_name | |
) | |
custom_properties = {"schema_inferred_from": full_path} | |
if not sample_files: | |
custom_properties.update( | |
{ | |
"number_of_files": str(number_of_files), | |
"size_in_bytes": str(size_in_bytes), | |
} | |
) | |
if use_abs_blob_properties and blob_name is not None: | |
blob_client = container_client.get_blob_client(blob=blob_name) | |
try: | |
blob_properties = blob_client.get_blob_properties() | |
except Exception as e: | |
logger.error(f"Failed to get blob properties: {e}") | |
blob_properties = None | |
if blob_properties: | |
create_properties( | |
data=blob_properties, | |
prefix="blob", | |
custom_properties=custom_properties, | |
resource_name=blob_name, | |
json_properties=[ | |
"metadata", | |
"content_settings", | |
"lease", | |
"copy", | |
"immutability_policy", | |
], | |
) | |
else: | |
logger.warning( | |
f"No blob properties found for container={container_name}, blob={blob_name}." | |
) | |
if use_abs_container_properties: | |
try: | |
container_properties = container_client.get_container_properties() | |
except Exception as e: | |
logger.error(f"Failed to get container properties: {e}") | |
container_properties = None | |
if container_properties: | |
create_properties( | |
data=container_properties, | |
prefix="container", | |
custom_properties=custom_properties, | |
resource_name=container_name, | |
json_properties=["metadata"], | |
) | |
else: | |
logger.warning( | |
f"No container properties found for container={container_name}." | |
) | |
return custom_properties |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py (1 hunks)
Files skipped from review as they are similar to previous changes (1)
- metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- metadata-ingestion/src/datahub/ingestion/source/abs/source.py (1 hunks)
Additional context used
Ruff
metadata-ingestion/src/datahub/ingestion/source/abs/source.py
163-164: Use a single
if
statement instead of nestedif
statements(SIM102)
686-686: Loop control variable
guid
not used within loop bodyRename unused
guid
to_guid
(B007)
Additional comments not posted (35)
metadata-ingestion/src/datahub/ingestion/source/abs/source.py (35)
128-148
: LGTM!The function correctly maps Spark types to DataHub types.
159-179
: Simplify nested if statements.Use a single
if
statement instead of nestedif
statements.- if "=" in folder1 and "=" in folder2: - if folder1.rsplit("=", 1)[0] == folder2.rsplit("=", 1)[0]: + if "=" in folder1 and "=" in folder2 and folder1.rsplit("=", 1)[0] == folder2.rsplit("=", 1)[0]:Tools
Ruff
163-164: Use a single
if
statement instead of nestedif
statements(SIM102)
182-193
: LGTM!The
TableData
dataclass correctly represents table metadata.
195-700
: LGTM!The
ABSSource
class is well-structured and follows the expected pattern for DataHub sources.Tools
Ruff
686-686: Loop control variable
guid
not used within loop bodyRename unused
guid
to_guid
(B007)
642-688
: LGTM!The function correctly generates work units for the ingestion process.
Tools
Ruff
686-686: Loop control variable
guid
not used within loop bodyRename unused
guid
to_guid
(B007)
689-695
: LGTM!The function correctly returns a list of work unit processors.
697-698
: LGTM!The function correctly checks if the platform is Azure Blob Storage.
700-701
: LGTM!The function correctly returns the report object.
236-300
: LGTM!The function correctly handles different file formats and compression types.
302-325
: LGTM!The function correctly adds partition columns to the schema.
326-335
: LGTM!The function correctly creates an operation aspect for the table.
337-438
: LGTM!The function is well-structured and handles various aspects of table ingestion.
523-607
: LGTM!The function is well-structured and handles various scenarios for browsing Azure Blob Storage.
609-616
: LGTM!The function correctly creates an ABS path from a given key.
618-641
: LGTM!The function is well-structured and handles various scenarios for browsing local files.
475-491
: LGTM!The function correctly resolves folders with templates in the prefix.
492-521
: LGTM!The function correctly retrieves the directory to process based on the path specification.
447-450
: LGTM!The function correctly extracts the table name from the path specification.
452-473
: LGTM!The function correctly extracts table data from the path specification and file metadata.
440-445
: LGTM!The function correctly retrieves the prefix from the relative path.
337-438
: LGTM!The function is well-structured and handles various aspects of table ingestion.
523-607
: LGTM!The function is well-structured and handles various scenarios for browsing Azure Blob Storage.
609-616
: LGTM!The function correctly creates an ABS path from a given key.
618-641
: LGTM!The function is well-structured and handles various scenarios for browsing local files.
475-491
: LGTM!The function correctly resolves folders with templates in the prefix.
492-521
: LGTM!The function correctly retrieves the directory to process based on the path specification.
447-450
: LGTM!The function correctly extracts the table name from the path specification.
452-473
: LGTM!The function correctly extracts table data from the path specification and file metadata.
440-445
: LGTM!The function correctly retrieves the prefix from the relative path.
642-688
: LGTM!The function correctly generates work units for the ingestion process.
Tools
Ruff
686-686: Loop control variable
guid
not used within loop bodyRename unused
guid
to_guid
(B007)
689-695
: LGTM!The function correctly returns a list of work unit processors.
697-698
: LGTM!The function correctly checks if the platform is Azure Blob Storage.
700-701
: LGTM!The function correctly returns the report object.
236-300
: LGTM!The function correctly handles different file formats and compression types.
302-325
: LGTM!The function correctly adds partition columns to the schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- metadata-ingestion/src/datahub/ingestion/source/abs/source.py (1 hunks)
Additional context used
Ruff
metadata-ingestion/src/datahub/ingestion/source/abs/source.py
162-163: Use a single
if
statement instead of nestedif
statements(SIM102)
685-685: Loop control variable
guid
not used within loop bodyRename unused
guid
to_guid
(B007)
Additional comments not posted (24)
metadata-ingestion/src/datahub/ingestion/source/abs/source.py (24)
1-95
: Imports look good.The imports are necessary for the new Azure Blob Storage functionality.
127-147
: Function looks good.The
get_column_type
function correctly maps Spark types to DataHub types and includes error handling.
158-178
: Simplify nested if statements.Use a single
if
statement instead of nestedif
statements.- if "=" in folder1 and "=" in folder2: - if folder1.rsplit("=", 1)[0] == folder2.rsplit("=", 1)[0]: + if "=" in folder1 and "=" in folder2 and folder1.rsplit("=", 1)[0] == folder2.rsplit("=", 1)[0]:Tools
Ruff
162-163: Use a single
if
statement instead of nestedif
statements(SIM102)
181-191
: Dataclass looks good.The
TableData
dataclass correctly represents table data and includes necessary fields.
210-227
: Constructor looks good.The
__init__
method correctly initializes theABSSource
class and includes telemetry reporting.
229-233
: Factory method looks good.The
create
method correctly creates an instance of theABSSource
class.
235-299
: Method looks good.The
get_fields
method correctly extracts fields from a table and includes error handling.
301-324
: Method looks good.The
add_partition_columns_to_schema
method correctly adds partition columns to the schema and includes necessary logging.
325-334
: Method looks good.The
_create_table_operation_aspect
method correctly creates a table operation aspect and follows the expected structure.
336-437
: Method looks good.The
ingest_table
method correctly ingests a table and yields metadata work units. It includes error handling and supports various aspects.
439-444
: Method looks good.The
get_prefix
method correctly gets the prefix from a relative path and includes necessary regex handling.
446-449
: Method looks good.The
extract_table_name
method correctly extracts the table name from a path spec and includes necessary validation.
451-472
: Method looks good.The
extract_table_data
method correctly extracts table data from a path spec and includes necessary logging.
474-490
: Method looks good.The
resolve_templated_folders
method correctly resolves templated folders and includes necessary recursion.
491-520
: Method looks good.The
get_dir_to_process
method correctly gets the directory to process and includes necessary sorting and recursion.
522-606
: Method looks good.The
abs_browser
method correctly browses Azure Blob Storage and includes error handling. It supports templated folders.
608-615
: Method looks good.The
create_abs_path
method correctly creates an Azure Blob Storage path and includes necessary validation.
617-640
: Method looks good.The
local_browser
method correctly browses local files and includes necessary sorting and recursion.
641-687
: Method looks good.The
get_workunits_internal
method correctly generates metadata work units and includes error handling. It supports various path specs.Tools
Ruff
685-685: Loop control variable
guid
not used within loop bodyRename unused
guid
to_guid
(B007)
688-694
: Method looks good.The
get_workunit_processors
method correctly returns a list of work unit processors and includes the stale entity removal handler.
696-697
: Method looks good.The
is_abs_platform
method correctly checks if the platform is Azure Blob Storage and includes necessary validation.
699-700
: Method looks good.The
get_report
method correctly returns the report and includes necessary validation.
162-163
: Simplify nested if statements.Use a single
if
statement instead of nestedif
statements.- if "=" in folder1 and "=" in folder2: - if folder1.rsplit("=", 1)[0] == folder2.rsplit("=", 1)[0]: + if "=" in folder1 and "=" in folder2 and folder1.rsplit("=", 1)[0] == folder2.rsplit("=", 1)[0]:Tools
Ruff
162-163: Use a single
if
statement instead of nestedif
statements(SIM102)
685-685
: Rename unused loop variable.Rename the unused loop variable
guid
to_guid
.- for guid, table_data in table_dict.items(): + for _guid, table_data in table_dict.items():Tools
Ruff
685-685: Loop control variable
guid
not used within loop bodyRename unused
guid
to_guid
(B007)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (2)
- metadata-ingestion/src/datahub/ingestion/source/abs/source.py (1 hunks)
- metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py (1 hunks)
Files skipped from review as they are similar to previous changes (1)
- metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py
Additional context used
Ruff
metadata-ingestion/src/datahub/ingestion/source/abs/source.py
162-163: Use a single
if
statement instead of nestedif
statements(SIM102)
685-685: Loop control variable
guid
not used within loop bodyRename unused
guid
to_guid
(B007)
Additional comments not posted (6)
metadata-ingestion/src/datahub/ingestion/source/abs/source.py (6)
235-299
: LGTM!The function is well-structured and handles different file formats appropriately.
301-324
: LGTM!The function is well-structured and performs the intended operations correctly.
325-334
: LGTM!The function is well-structured and performs the intended operations correctly.
336-437
: LGTM!The function is well-structured and performs the intended operations correctly.
522-607
: LGTM!The function is well-structured and performs the intended operations correctly.
617-640
: LGTM!The function is well-structured and performs the intended operations correctly.
if "=" in folder1 and "=" in folder2: | ||
if folder1.rsplit("=", 1)[0] == folder2.rsplit("=", 1)[0]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simplify nested if statements.
Use a single if
statement instead of nested if
statements.
- if "=" in folder1 and "=" in folder2:
- if folder1.rsplit("=", 1)[0] == folder2.rsplit("=", 1)[0]:
+ if "=" in folder1 and "=" in folder2 and folder1.rsplit("=", 1)[0] == folder2.rsplit("=", 1)[0]:
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if "=" in folder1 and "=" in folder2: | |
if folder1.rsplit("=", 1)[0] == folder2.rsplit("=", 1)[0]: | |
if "=" in folder1 and "=" in folder2 and folder1.rsplit("=", 1)[0] == folder2.rsplit("=", 1)[0]: |
Tools
Ruff
162-163: Use a single
if
statement instead of nestedif
statements(SIM102)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- metadata-ingestion/setup.py (3 hunks)
Additional comments not posted (3)
metadata-ingestion/setup.py (3)
261-266
: Addition ofabs_base
set looks good!The dependencies listed are necessary for Azure Blob Storage functionality. Ensure that the specified versions are compatible with the rest of the project.
275-275
: Modification todelta_lake
set looks good!Including
abs_base
ensures that Azure Blob Storage dependencies are available for Delta Lake functionality.
417-417
: Addition ofabs
source plugin looks good!The plugin is correctly referenced in the
plugins
dictionary andentry_points
. Ensure that the plugin is properly integrated and tested.Also applies to: 696-696
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (2)
- metadata-ingestion/src/datahub/ingestion/source/abs/source.py (1 hunks)
- metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py (1 hunks)
Files skipped from review as they are similar to previous changes (1)
- metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py
Additional context used
Ruff
metadata-ingestion/src/datahub/ingestion/source/abs/source.py
162-163: Use a single
if
statement instead of nestedif
statements(SIM102)
685-685: Loop control variable
guid
not used within loop bodyRename unused
guid
to_guid
(B007)
Additional comments not posted (4)
metadata-ingestion/src/datahub/ingestion/source/abs/source.py (4)
158-178
: Simplify nested if statements.The nested if statements can be simplified to improve readability. This change was previously suggested and is still valid.
- if "=" in folder1 and "=" in folder2: - if folder1.rsplit("=", 1)[0] == folder2.rsplit("=", 1)[0]: + if "=" in folder1 and "=" in folder2 and folder1.rsplit("=", 1)[0] == folder2.rsplit("=", 1)[0]:Tools
Ruff
162-163: Use a single
if
statement instead of nestedif
statements(SIM102)
685-685
: Rename unused loop variable.The loop variable
guid
is not used within the loop body. It should be renamed to_guid
to indicate that it is unused.- for guid, table_data in table_dict.items(): + for _guid, table_data in table_dict.items():Tools
Ruff
685-685: Loop control variable
guid
not used within loop bodyRename unused
guid
to_guid
(B007)
235-299
: Review file format handling in schema extraction.This method handles various file formats for schema extraction. Ensure that all expected file formats are supported and that the error handling is robust.
522-607
: Review browsing logic and error handling in Azure Blob Storage.This method handles browsing in Azure Blob Storage. Ensure that the path handling and error management are correctly implemented, especially the handling of specific exceptions like
NoSuchBucket
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py (1 hunks)
Files skipped from review as they are similar to previous changes (1)
- metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py (1 hunks)
Files skipped from review as they are similar to previous changes (1)
- metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- metadata-ingestion/src/datahub/ingestion/source/abs/source.py (1 hunks)
Additional context used
Ruff
metadata-ingestion/src/datahub/ingestion/source/abs/source.py
162-163: Use a single
if
statement instead of nestedif
statements(SIM102)
685-685: Loop control variable
guid
not used within loop bodyRename unused
guid
to_guid
(B007)
Additional comments not posted (2)
metadata-ingestion/src/datahub/ingestion/source/abs/source.py (2)
127-148
: Functionget_column_type
is well-implemented.The function correctly maps Spark types to DataHub types and handles unmapped types gracefully.
158-178
: Skip: Simplification of nested if statements already addressed.The previous comment about simplifying nested if statements has been addressed.
Tools
Ruff
162-163: Use a single
if
statement instead of nestedif
statements(SIM102)
prefix=f"{prefix}", results_per_page=PAGE_SIZE | ||
): | ||
abs_path = self.create_abs_path(obj.name) | ||
logger.debug(f"Path: {abs_path}") | ||
# the following line if using the file_system_client | ||
# yield abs_path, obj.last_modified, obj.content_length, | ||
yield abs_path, obj.name, obj.last_modified, obj.size | ||
|
||
def create_abs_path(self, key: str) -> str: | ||
if self.source_config.azure_config: | ||
account_name = self.source_config.azure_config.account_name | ||
container_name = self.source_config.azure_config.container_name | ||
return ( | ||
f"https://{account_name}.blob.core.windows.net/{container_name}/{key}" | ||
) | ||
return "" | ||
|
||
def local_browser( | ||
self, path_spec: PathSpec | ||
) -> Iterable[Tuple[str, str, datetime, int]]: | ||
prefix = self.get_prefix(path_spec.include) | ||
if os.path.isfile(prefix): | ||
logger.debug(f"Scanning single local file: {prefix}") | ||
file_name = prefix | ||
yield prefix, file_name, datetime.utcfromtimestamp( | ||
os.path.getmtime(prefix) | ||
), os.path.getsize(prefix) | ||
else: | ||
logger.debug(f"Scanning files under local folder: {prefix}") | ||
for root, dirs, files in os.walk(prefix): | ||
dirs.sort(key=functools.cmp_to_key(partitioned_folder_comparator)) | ||
|
||
for file in sorted(files): | ||
# We need to make sure the path is in posix style which is not true on windows | ||
full_path = PurePath( | ||
os.path.normpath(os.path.join(root, file)) | ||
).as_posix() | ||
yield full_path, file, datetime.utcfromtimestamp( | ||
os.path.getmtime(full_path) | ||
), os.path.getsize(full_path) | ||
|
||
def get_workunits_internal(self) -> Iterable[MetadataWorkUnit]: | ||
self.container_WU_creator = ContainerWUCreator( | ||
self.source_config.platform, | ||
self.source_config.platform_instance, | ||
self.source_config.env, | ||
) | ||
with PerfTimer(): | ||
assert self.source_config.path_specs | ||
for path_spec in self.source_config.path_specs: | ||
file_browser = ( | ||
self.abs_browser( | ||
path_spec, self.source_config.number_of_files_to_sample | ||
) | ||
if self.is_abs_platform() | ||
else self.local_browser(path_spec) | ||
) | ||
table_dict: Dict[str, TableData] = {} | ||
for file, name, timestamp, size in file_browser: | ||
if not path_spec.allowed(file): | ||
continue | ||
table_data = self.extract_table_data( | ||
path_spec, file, name, timestamp, size | ||
) | ||
if table_data.table_path not in table_dict: | ||
table_dict[table_data.table_path] = table_data | ||
else: | ||
table_dict[table_data.table_path].number_of_files = ( | ||
table_dict[table_data.table_path].number_of_files + 1 | ||
) | ||
table_dict[table_data.table_path].size_in_bytes = ( | ||
table_dict[table_data.table_path].size_in_bytes | ||
+ table_data.size_in_bytes | ||
) | ||
if ( | ||
table_dict[table_data.table_path].timestamp | ||
< table_data.timestamp | ||
) and (table_data.size_in_bytes > 0): | ||
table_dict[ | ||
table_data.table_path | ||
].full_path = table_data.full_path | ||
table_dict[ | ||
table_data.table_path | ||
].timestamp = table_data.timestamp | ||
|
||
for guid, table_data in table_dict.items(): | ||
yield from self.ingest_table(table_data, path_spec) | ||
|
||
def get_workunit_processors(self) -> List[Optional[MetadataWorkUnitProcessor]]: | ||
return [ | ||
*super().get_workunit_processors(), | ||
StaleEntityRemovalHandler.create( | ||
self, self.source_config, self.ctx | ||
).workunit_processor, | ||
] | ||
|
||
def is_abs_platform(self): | ||
return self.source_config.platform == "abs" | ||
|
||
def get_report(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider refactoring the get_fields
method for better readability.
The get_fields
method is functional but complex. Breaking it into smaller, more focused methods could improve readability and maintainability.
Tools
Ruff
685-685: Loop control variable
guid
not used within loop bodyRename unused
guid
to_guid
(B007)
table_data.table_path | ||
].timestamp = table_data.timestamp | ||
|
||
for guid, table_data in table_dict.items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename unused loop variable for clarity.
The loop variable guid
is unused and should be renamed to _guid
to indicate it's intentionally unused.
- for guid, table_data in table_table_dict.items():
+ for _guid, table_data in table_dict.items():
Committable suggestion was skipped due to low confidence.
Tools
Ruff
685-685: Loop control variable
guid
not used within loop bodyRename unused
guid
to_guid
(B007)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
Outside diff range, codebase verification and nitpick comments (2)
metadata-ingestion/src/datahub/ingestion/source/abs/source.py (2)
123-124
: Optimization of Compression RegistrationThe hack to support the
.gzip
extension withsmart_open
is clear, but consider adding a comment explaining why this specific registration is necessary if it's not commonly understood.
336-438
: Method:ingest_table
This method orchestrates the ingestion process and handles various metadata aspects. Ensure that all operations are atomic and transactional where necessary, especially when dealing with external systems like databases or cloud services.
Consider adding more detailed logging to trace the ingestion process step-by-step.
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (2)
- metadata-ingestion/src/datahub/ingestion/source/abs/source.py (1 hunks)
- metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py (1 hunks)
Files skipped from review as they are similar to previous changes (1)
- metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py
Additional context used
Ruff
metadata-ingestion/src/datahub/ingestion/source/abs/source.py
162-163: Use a single
if
statement instead of nestedif
statements(SIM102)
685-685: Loop control variable
guid
not used within loop bodyRename unused
guid
to_guid
(B007)
Additional comments not posted (3)
metadata-ingestion/src/datahub/ingestion/source/abs/source.py (3)
1-33
: Review of Imports and Global ConstantsThe imports and global constants are well-organized and relevant to the functionality being implemented. However, ensure that all these libraries and modules are necessary and used within the file to avoid unnecessary dependencies.
181-192
: Data Class:TableData
The
TableData
class is well-defined with appropriate type hints. Ensure that all fields are necessary and used effectively throughout the codebase to prevent data bloat.
194-203
: Class Decorators forABSSource
The use of decorators to define capabilities and configuration requirements is clear and enhances modularity. Ensure that all capabilities like
DATA_PROFILING
andTAGS
are implemented correctly within the class methods.
def get_column_type( | ||
report: SourceReport, dataset_name: str, column_type: str | ||
) -> SchemaFieldDataType: | ||
""" | ||
Maps known Spark types to datahub types | ||
""" | ||
TypeClass: Any = None | ||
|
||
for field_type, type_class in _field_type_mapping.items(): | ||
if isinstance(column_type, field_type): | ||
TypeClass = type_class | ||
break | ||
|
||
# if still not found, report the warning | ||
if TypeClass is None: | ||
report.report_warning( | ||
dataset_name, f"unable to map type {column_type} to metadata schema" | ||
) | ||
TypeClass = NullTypeClass | ||
|
||
return SchemaFieldDataType(type=TypeClass()) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function: get_column_type
This function maps Spark types to DataHub types. It's crucial to ensure that all possible Spark types are handled, and any unmapped types are reported correctly. The use of a warning when a type cannot be mapped is a good practice.
Consider adding unit tests to cover all branches of this function, especially the default case where a type cannot be mapped.
Would you like me to help create unit tests for this function?
def __init__(self, config: DataLakeSourceConfig, ctx: PipelineContext): | ||
super().__init__(config, ctx) | ||
self.source_config = config | ||
self.report = DataLakeSourceReport() | ||
self.profiling_times_taken = [] | ||
config_report = { | ||
config_option: config.dict().get(config_option) | ||
for config_option in config_options_to_report | ||
} | ||
config_report = { | ||
**config_report, | ||
"profiling_enabled": config.is_profiling_enabled(), | ||
} | ||
|
||
telemetry.telemetry_instance.ping( | ||
"data_lake_config", | ||
config_report, | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Constructor of ABSSource
The constructor initializes several configurations and reports telemetry. It's important to ensure that telemetry data is handled securely and does not contain sensitive information.
Consider reviewing what data is being sent in telemetry to ensure compliance with privacy laws.
def abs_browser( | ||
self, path_spec: PathSpec, sample_size: int | ||
) -> Iterable[Tuple[str, str, datetime, int]]: | ||
if self.source_config.azure_config is None: | ||
raise ValueError("azure_config not set. Cannot browse Azure Blob Storage") | ||
abs_blob_service_client = ( | ||
self.source_config.azure_config.get_blob_service_client() | ||
) | ||
container_client = abs_blob_service_client.get_container_client( | ||
self.source_config.azure_config.container_name | ||
) | ||
|
||
container_name = self.source_config.azure_config.container_name | ||
logger.debug(f"Scanning container: {container_name}") | ||
|
||
prefix = self.get_prefix(get_container_relative_path(path_spec.include)) | ||
logger.debug(f"Scanning objects with prefix:{prefix}") | ||
|
||
matches = re.finditer(r"{\s*\w+\s*}", path_spec.include, re.MULTILINE) | ||
matches_list = list(matches) | ||
if matches_list and path_spec.sample_files: | ||
max_start: int = -1 | ||
include: str = path_spec.include | ||
max_match: str = "" | ||
for match in matches_list: | ||
pos = include.find(match.group()) | ||
if pos > max_start: | ||
if max_match: | ||
include = include.replace(max_match, "*") | ||
max_start = match.start() | ||
max_match = match.group() | ||
|
||
table_index = include.find(max_match) | ||
|
||
for folder in self.resolve_templated_folders( | ||
container_name, | ||
get_container_relative_path(include[:table_index]), | ||
): | ||
try: | ||
for f in list_folders( | ||
container_name, f"{folder}", self.source_config.azure_config | ||
): | ||
logger.info(f"Processing folder: {f}") | ||
protocol = ContainerWUCreator.get_protocol(path_spec.include) | ||
dir_to_process = self.get_dir_to_process( | ||
container_name=container_name, | ||
folder=f + "/", | ||
path_spec=path_spec, | ||
protocol=protocol, | ||
) | ||
logger.info(f"Getting files from folder: {dir_to_process}") | ||
dir_to_process = dir_to_process.rstrip("\\") | ||
for obj in container_client.list_blobs( | ||
name_starts_with=f"{dir_to_process}", | ||
results_per_page=PAGE_SIZE, | ||
): | ||
abs_path = self.create_abs_path(obj.name) | ||
logger.debug(f"Sampling file: {abs_path}") | ||
yield abs_path, obj.name, obj.last_modified, obj.size, | ||
except Exception as e: | ||
# This odd check if being done because boto does not have a proper exception to catch | ||
# The exception that appears in stacktrace cannot actually be caught without a lot more work | ||
# https://github.com/boto/boto3/issues/1195 | ||
if "NoSuchBucket" in repr(e): | ||
logger.debug( | ||
f"Got NoSuchBucket exception for {container_name}", e | ||
) | ||
self.get_report().report_warning( | ||
"Missing bucket", f"No bucket found {container_name}" | ||
) | ||
else: | ||
raise e | ||
else: | ||
logger.debug( | ||
"No template in the pathspec can't do sampling, fallbacking to do full scan" | ||
) | ||
path_spec.sample_files = False | ||
for obj in container_client.list_blobs( | ||
prefix=f"{prefix}", results_per_page=PAGE_SIZE | ||
): | ||
abs_path = self.create_abs_path(obj.name) | ||
logger.debug(f"Path: {abs_path}") | ||
# the following line if using the file_system_client | ||
# yield abs_path, obj.last_modified, obj.content_length, | ||
yield abs_path, obj.name, obj.last_modified, obj.size | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Method: abs_browser
This method handles the browsing of ABS paths and is crucial for the ingestion process. Ensure that error handling is robust, especially for network-related errors or API limitations.
The handling of exceptions related to NoSuchBucket
could be improved by using more specific exception types if available.
Consider using more specific exception types for better error handling.
def partitioned_folder_comparator(folder1: str, folder2: str) -> int: | ||
# Try to convert to number and compare if the folder name is a number | ||
try: | ||
# Stripping = from the folder names as it most probably partition name part like year=2021 | ||
if "=" in folder1 and "=" in folder2: | ||
if folder1.rsplit("=", 1)[0] == folder2.rsplit("=", 1)[0]: | ||
folder1 = folder1.rsplit("=", 1)[-1] | ||
folder2 = folder2.rsplit("=", 1)[-1] | ||
|
||
num_folder1 = int(folder1) | ||
num_folder2 = int(folder2) | ||
if num_folder1 == num_folder2: | ||
return 0 | ||
else: | ||
return 1 if num_folder1 > num_folder2 else -1 | ||
except Exception: | ||
# If folder name is not a number then do string comparison | ||
if folder1 == folder2: | ||
return 0 | ||
else: | ||
return 1 if folder1 > folder2 else -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function: partitioned_folder_comparator
Review
The nested if
statements have been flagged by both a previous reviewer and static analysis tools. The suggestion to simplify these into a single if
statement is valid and should be implemented to improve readability.
- if "=" in folder1 and "=" in folder2:
- if folder1.rsplit("=", 1)[0] == folder2.rsplit("=", 1)[0]:
+ if "=" in folder1 and "=" in folder2 and folder1.rsplit("=", 1)[0] == folder2.rsplit("=", 1)[0]:
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def partitioned_folder_comparator(folder1: str, folder2: str) -> int: | |
# Try to convert to number and compare if the folder name is a number | |
try: | |
# Stripping = from the folder names as it most probably partition name part like year=2021 | |
if "=" in folder1 and "=" in folder2: | |
if folder1.rsplit("=", 1)[0] == folder2.rsplit("=", 1)[0]: | |
folder1 = folder1.rsplit("=", 1)[-1] | |
folder2 = folder2.rsplit("=", 1)[-1] | |
num_folder1 = int(folder1) | |
num_folder2 = int(folder2) | |
if num_folder1 == num_folder2: | |
return 0 | |
else: | |
return 1 if num_folder1 > num_folder2 else -1 | |
except Exception: | |
# If folder name is not a number then do string comparison | |
if folder1 == folder2: | |
return 0 | |
else: | |
return 1 if folder1 > folder2 else -1 | |
def partitioned_folder_comparator(folder1: str, folder2: str) -> int: | |
# Try to convert to number and compare if the folder name is a number | |
try: | |
# Stripping = from the folder names as it most probably partition name part like year=2021 | |
if "=" in folder1 and "=" in folder2 and folder1.rsplit("=", 1)[0] == folder2.rsplit("=", 1)[0]: | |
folder1 = folder1.rsplit("=", 1)[-1] | |
folder2 = folder2.rsplit("=", 1)[-1] | |
num_folder1 = int(folder1) | |
num_folder2 = int(folder2) | |
if num_folder1 == num_folder2: | |
return 0 | |
else: | |
return 1 if num_folder1 > num_folder2 else -1 | |
except Exception: | |
# If folder name is not a number then do string comparison | |
if folder1 == folder2: | |
return 0 | |
else: | |
return 1 if folder1 > folder2 else -1 |
Tools
Ruff
162-163: Use a single
if
statement instead of nestedif
statements(SIM102)
def get_fields(self, table_data: TableData, path_spec: PathSpec) -> List: | ||
if self.is_abs_platform(): | ||
if self.source_config.azure_config is None: | ||
raise ValueError("Azure config is required for ABS file sources") | ||
|
||
abs_client = self.source_config.azure_config.get_blob_service_client() | ||
file = smart_open( | ||
f"azure://{self.source_config.azure_config.container_name}/{table_data.rel_path}", | ||
"rb", | ||
transport_params={"client": abs_client}, | ||
) | ||
else: | ||
# We still use smart_open here to take advantage of the compression | ||
# capabilities of smart_open. | ||
file = smart_open(table_data.full_path, "rb") | ||
|
||
fields = [] | ||
|
||
extension = pathlib.Path(table_data.full_path).suffix | ||
from datahub.ingestion.source.data_lake_common.path_spec import ( | ||
SUPPORTED_COMPRESSIONS, | ||
) | ||
|
||
if path_spec.enable_compression and (extension[1:] in SUPPORTED_COMPRESSIONS): | ||
# Removing the compression extension and using the one before that like .json.gz -> .json | ||
extension = pathlib.Path(table_data.full_path).with_suffix("").suffix | ||
if extension == "" and path_spec.default_extension: | ||
extension = f".{path_spec.default_extension}" | ||
|
||
try: | ||
if extension == ".parquet": | ||
fields = parquet.ParquetInferrer().infer_schema(file) | ||
elif extension == ".csv": | ||
fields = csv_tsv.CsvInferrer( | ||
max_rows=self.source_config.max_rows | ||
).infer_schema(file) | ||
elif extension == ".tsv": | ||
fields = csv_tsv.TsvInferrer( | ||
max_rows=self.source_config.max_rows | ||
).infer_schema(file) | ||
elif extension == ".json": | ||
fields = json.JsonInferrer().infer_schema(file) | ||
elif extension == ".avro": | ||
fields = avro.AvroInferrer().infer_schema(file) | ||
else: | ||
self.report.report_warning( | ||
table_data.full_path, | ||
f"file {table_data.full_path} has unsupported extension", | ||
) | ||
file.close() | ||
except Exception as e: | ||
self.report.report_warning( | ||
table_data.full_path, | ||
f"could not infer schema for file {table_data.full_path}: {e}", | ||
) | ||
file.close() | ||
logger.debug(f"Extracted fields in schema: {fields}") | ||
fields = sorted(fields, key=lambda f: f.fieldPath) | ||
|
||
if self.source_config.add_partition_columns_to_schema: | ||
self.add_partition_columns_to_schema( | ||
fields=fields, path_spec=path_spec, full_path=table_data.full_path | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Method: get_fields
This method is complex and handles various file types. Refactoring to break this method into smaller, more focused methods could improve readability and maintainability. Additionally, ensure that error handling is robust, especially around file operations.
Consider the following refactor suggestion:
- try:
- if extension == ".parquet":
- fields = parquet.ParquetInferrer().infer_schema(file)
- elif extension == ".csv":
- fields = csv_tsv.CsvInferrer(
- max_rows=self.source_config.max_rows
- ).infer_schema(file)
- elif extension == ".tsv":
- fields = csv_tsv.TsvInferrer(
- max_rows=self.source_config.max_rows
- ).infer_schema(file)
- elif extension == ".json":
- fields = json.JsonInferrer().infer_schema(file)
- elif extension == ".avro":
- fields = avro.AvroInferrer().infer_schema(file)
- else:
- self.report.report_warning(
- table_data.full_path,
- f"file {table_data.full_path} has unsupported extension",
- )
- file.close()
- except Exception as e:
- self.report.report_warning(
- table_data.full_path,
- f"could not infer schema for file {table_data.full_path}: {e}",
- )
- file.close()
+ fields = self.handle_file_inference(file, extension, table_data)
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def get_fields(self, table_data: TableData, path_spec: PathSpec) -> List: | |
if self.is_abs_platform(): | |
if self.source_config.azure_config is None: | |
raise ValueError("Azure config is required for ABS file sources") | |
abs_client = self.source_config.azure_config.get_blob_service_client() | |
file = smart_open( | |
f"azure://{self.source_config.azure_config.container_name}/{table_data.rel_path}", | |
"rb", | |
transport_params={"client": abs_client}, | |
) | |
else: | |
# We still use smart_open here to take advantage of the compression | |
# capabilities of smart_open. | |
file = smart_open(table_data.full_path, "rb") | |
fields = [] | |
extension = pathlib.Path(table_data.full_path).suffix | |
from datahub.ingestion.source.data_lake_common.path_spec import ( | |
SUPPORTED_COMPRESSIONS, | |
) | |
if path_spec.enable_compression and (extension[1:] in SUPPORTED_COMPRESSIONS): | |
# Removing the compression extension and using the one before that like .json.gz -> .json | |
extension = pathlib.Path(table_data.full_path).with_suffix("").suffix | |
if extension == "" and path_spec.default_extension: | |
extension = f".{path_spec.default_extension}" | |
try: | |
if extension == ".parquet": | |
fields = parquet.ParquetInferrer().infer_schema(file) | |
elif extension == ".csv": | |
fields = csv_tsv.CsvInferrer( | |
max_rows=self.source_config.max_rows | |
).infer_schema(file) | |
elif extension == ".tsv": | |
fields = csv_tsv.TsvInferrer( | |
max_rows=self.source_config.max_rows | |
).infer_schema(file) | |
elif extension == ".json": | |
fields = json.JsonInferrer().infer_schema(file) | |
elif extension == ".avro": | |
fields = avro.AvroInferrer().infer_schema(file) | |
else: | |
self.report.report_warning( | |
table_data.full_path, | |
f"file {table_data.full_path} has unsupported extension", | |
) | |
file.close() | |
except Exception as e: | |
self.report.report_warning( | |
table_data.full_path, | |
f"could not infer schema for file {table_data.full_path}: {e}", | |
) | |
file.close() | |
logger.debug(f"Extracted fields in schema: {fields}") | |
fields = sorted(fields, key=lambda f: f.fieldPath) | |
if self.source_config.add_partition_columns_to_schema: | |
self.add_partition_columns_to_schema( | |
fields=fields, path_spec=path_spec, full_path=table_data.full_path | |
) | |
def get_fields(self, table_data: TableData, path_spec: PathSpec) -> List: | |
if self.is_abs_platform(): | |
if self.source_config.azure_config is None: | |
raise ValueError("Azure config is required for ABS file sources") | |
abs_client = self.source_config.azure_config.get_blob_service_client() | |
file = smart_open( | |
f"azure://{self.source_config.azure_config.container_name}/{table_data.rel_path}", | |
"rb", | |
transport_params={"client": abs_client}, | |
) | |
else: | |
# We still use smart_open here to take advantage of the compression | |
# capabilities of smart_open. | |
file = smart_open(table_data.full_path, "rb") | |
fields = [] | |
extension = pathlib.Path(table_data.full_path).suffix | |
from datahub.ingestion.source.data_lake_common.path_spec import ( | |
SUPPORTED_COMPRESSIONS, | |
) | |
if path_spec.enable_compression and (extension[1:] in SUPPORTED_COMPRESSIONS): | |
# Removing the compression extension and using the one before that like .json.gz -> .json | |
extension = pathlib.Path(table_data.full_path).with_suffix("").suffix | |
if extension == "" and path_spec.default_extension: | |
extension = f".{path_spec.default_extension}" | |
fields = self.handle_file_inference(file, extension, table_data) | |
logger.debug(f"Extracted fields in schema: {fields}") | |
fields = sorted(fields, key=lambda f: f.fieldPath) | |
if self.source_config.add_partition_columns_to_schema: | |
self.add_partition_columns_to_schema( | |
fields=fields, path_spec=path_spec, full_path=table_data.full_path | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py (1 hunks)
Files skipped from review as they are similar to previous changes (1)
- metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, thanks for the contribution
* feat(forms) Handle deleting forms references when hard deleting forms (datahub-project#10820) * refactor(ui): Misc improvements to the setup ingestion flow (ingest uplift 1/2) (datahub-project#10764) Co-authored-by: John Joyce <[email protected]> Co-authored-by: John Joyce <[email protected]> * fix(ingestion/airflow-plugin): pipeline tasks discoverable in search (datahub-project#10819) * feat(ingest/transformer): tags to terms transformer (datahub-project#10758) Co-authored-by: Aseem Bansal <[email protected]> * fix(ingestion/unity-catalog): fixed issue with profiling with GE turned on (datahub-project#10752) Co-authored-by: Aseem Bansal <[email protected]> * feat(forms) Add java SDK for form entity PATCH + CRUD examples (datahub-project#10822) * feat(SDK) Add java SDK for structuredProperty entity PATCH + CRUD examples (datahub-project#10823) * feat(SDK) Add StructuredPropertyPatchBuilder in python sdk and provide sample CRUD files (datahub-project#10824) * feat(forms) Add CRUD endpoints to GraphQL for Form entities (datahub-project#10825) * add flag for includeSoftDeleted in scroll entities API (datahub-project#10831) * feat(deprecation) Return actor entity with deprecation aspect (datahub-project#10832) * feat(structuredProperties) Add CRUD graphql APIs for structured property entities (datahub-project#10826) * add scroll parameters to openapi v3 spec (datahub-project#10833) * fix(ingest): correct profile_day_of_week implementation (datahub-project#10818) * feat(ingest/glue): allow ingestion of empty databases from Glue (datahub-project#10666) Co-authored-by: Harshal Sheth <[email protected]> * feat(cli): add more details to get cli (datahub-project#10815) * fix(ingestion/glue): ensure date formatting works on all platforms for aws glue (datahub-project#10836) * fix(ingestion): fix datajob patcher (datahub-project#10827) * fix(smoke-test): add suffix in temp file creation (datahub-project#10841) * feat(ingest/glue): add helper method to permit user or group ownership (datahub-project#10784) * feat(): Show data platform instances in policy modal if they are set on the policy (datahub-project#10645) Co-authored-by: Hendrik Richert <[email protected]> * docs(patch): add patch documentation for how implementation works (datahub-project#10010) Co-authored-by: John Joyce <[email protected]> * fix(jar): add missing custom-plugin-jar task (datahub-project#10847) * fix(): also check exceptions/stack trace when filtering log messages (datahub-project#10391) Co-authored-by: John Joyce <[email protected]> * docs(): Update posts.md (datahub-project#9893) Co-authored-by: Hyejin Yoon <[email protected]> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * chore(ingest): update acryl-datahub-classify version (datahub-project#10844) * refactor(ingest): Refactor structured logging to support infos, warnings, and failures structured reporting to UI (datahub-project#10828) Co-authored-by: John Joyce <[email protected]> Co-authored-by: Harshal Sheth <[email protected]> * fix(restli): log aspect-not-found as a warning rather than as an error (datahub-project#10834) * fix(ingest/nifi): remove duplicate upstream jobs (datahub-project#10849) * fix(smoke-test): test access to create/revoke personal access tokens (datahub-project#10848) * fix(smoke-test): missing test for move domain (datahub-project#10837) * ci: update usernames to not considered for community (datahub-project#10851) * env: change defaults for data contract visibility (datahub-project#10854) * fix(ingest/tableau): quote special characters in external URL (datahub-project#10842) * fix(smoke-test): fix flakiness of auto complete test * ci(ingest): pin dask dependency for feast (datahub-project#10865) * fix(ingestion/lookml): liquid template resolution and view-to-view cll (datahub-project#10542) * feat(ingest/audit): add client id and version in system metadata props (datahub-project#10829) * chore(ingest): Mypy 1.10.1 pin (datahub-project#10867) * docs: use acryl-datahub-actions as expected python package to install (datahub-project#10852) * docs: add new js snippet (datahub-project#10846) * refactor(ingestion): remove company domain for security reason (datahub-project#10839) * fix(ingestion/spark): Platform instance and column level lineage fix (datahub-project#10843) Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * feat(ingestion/tableau): optionally ingest multiple sites and create site containers (datahub-project#10498) Co-authored-by: Yanik Häni <[email protected]> * fix(ingestion/looker): Add sqlglot dependency and remove unused sqlparser (datahub-project#10874) * fix(manage-tokens): fix manage access token policy (datahub-project#10853) * Batch get entity endpoints (datahub-project#10880) * feat(system): support conditional write semantics (datahub-project#10868) * fix(build): upgrade vercel builds to Node 20.x (datahub-project#10890) * feat(ingest/lookml): shallow clone repos (datahub-project#10888) * fix(ingest/looker): add missing dependency (datahub-project#10876) * fix(ingest): only populate audit stamps where accurate (datahub-project#10604) * fix(ingest/dbt): always encode tag urns (datahub-project#10799) * fix(ingest/redshift): handle multiline alter table commands (datahub-project#10727) * fix(ingestion/looker): column name missing in explore (datahub-project#10892) * fix(lineage) Fix lineage source/dest filtering with explored per hop limit (datahub-project#10879) * feat(conditional-writes): misc updates and fixes (datahub-project#10901) * feat(ci): update outdated action (datahub-project#10899) * feat(rest-emitter): adding async flag to rest emitter (datahub-project#10902) Co-authored-by: Gabe Lyons <[email protected]> * feat(ingest): add snowflake-queries source (datahub-project#10835) * fix(ingest): improve `auto_materialize_referenced_tags_terms` error handling (datahub-project#10906) * docs: add new company to adoption list (datahub-project#10909) * refactor(redshift): Improve redshift error handling with new structured reporting system (datahub-project#10870) Co-authored-by: John Joyce <[email protected]> Co-authored-by: Harshal Sheth <[email protected]> * feat(ui) Finalize support for all entity types on forms (datahub-project#10915) * Index ExecutionRequestResults status field (datahub-project#10811) * feat(ingest): grafana connector (datahub-project#10891) Co-authored-by: Shirshanka Das <[email protected]> Co-authored-by: Harshal Sheth <[email protected]> * fix(gms) Add Form entity type to EntityTypeMapper (datahub-project#10916) * feat(dataset): add support for external url in Dataset (datahub-project#10877) * docs(saas-overview) added missing features to observe section (datahub-project#10913) Co-authored-by: John Joyce <[email protected]> * fix(ingest/spark): Fixing Micrometer warning (datahub-project#10882) * fix(structured properties): allow application of structured properties without schema file (datahub-project#10918) * fix(data-contracts-web) handle other schedule types (datahub-project#10919) * fix(ingestion/tableau): human-readable message for PERMISSIONS_MODE_SWITCHED error (datahub-project#10866) Co-authored-by: Harshal Sheth <[email protected]> * Add feature flag for view defintions (datahub-project#10914) Co-authored-by: Ethan Cartwright <[email protected]> * feat(ingest/BigQuery): refactor+parallelize dataset metadata extraction (datahub-project#10884) * fix(airflow): add error handling around render_template() (datahub-project#10907) * feat(ingestion/sqlglot): add optional `default_dialect` parameter to sqlglot lineage (datahub-project#10830) * feat(mcp-mutator): new mcp mutator plugin (datahub-project#10904) * fix(ingest/bigquery): changes helper function to decode unicode scape sequences (datahub-project#10845) * feat(ingest/postgres): fetch table sizes for profile (datahub-project#10864) * feat(ingest/abs): Adding azure blob storage ingestion source (datahub-project#10813) * fix(ingest/redshift): reduce severity of SQL parsing issues (datahub-project#10924) * fix(build): fix lint fix web react (datahub-project#10896) * fix(ingest/bigquery): handle quota exceeded for project.list requests (datahub-project#10912) * feat(ingest): report extractor failures more loudly (datahub-project#10908) * feat(ingest/snowflake): integrate snowflake-queries into main source (datahub-project#10905) * fix(ingest): fix docs build (datahub-project#10926) * fix(ingest/snowflake): fix test connection (datahub-project#10927) * fix(ingest/lookml): add view load failures to cache (datahub-project#10923) * docs(slack) overhauled setup instructions and screenshots (datahub-project#10922) Co-authored-by: John Joyce <[email protected]> * fix(airflow): Add comma parsing of owners to DataJobs (datahub-project#10903) * fix(entityservice): fix merging sideeffects (datahub-project#10937) * feat(ingest): Support System Ingestion Sources, Show and hide system ingestion sources with Command-S (datahub-project#10938) Co-authored-by: John Joyce <[email protected]> * chore() Set a default lineage filtering end time on backend when a start time is present (datahub-project#10925) Co-authored-by: John Joyce <[email protected]> Co-authored-by: John Joyce <[email protected]> * Added relationships APIs to V3. Added these generic APIs to V3 swagger doc. (datahub-project#10939) * docs: add learning center to docs (datahub-project#10921) * doc: Update hubspot form id (datahub-project#10943) * chore(airflow): add python 3.11 w/ Airflow 2.9 to CI (datahub-project#10941) * fix(ingest/Glue): column upstream lineage between S3 and Glue (datahub-project#10895) * fix(ingest/abs): split abs utils into multiple files (datahub-project#10945) * doc(ingest/looker): fix doc for sql parsing documentation (datahub-project#10883) Co-authored-by: Harshal Sheth <[email protected]> * fix(ingest/bigquery): Adding missing BigQuery types (datahub-project#10950) * fix(ingest/setup): feast and abs source setup (datahub-project#10951) * fix(connections) Harden adding /gms to connections in backend (datahub-project#10942) * feat(siblings) Add flag to prevent combining siblings in the UI (datahub-project#10952) * fix(docs): make graphql doc gen more automated (datahub-project#10953) * feat(ingest/athena): Add option for Athena partitioned profiling (datahub-project#10723) * fix(spark-lineage): default timeout for future responses (datahub-project#10947) * feat(datajob/flow): add environment filter using info aspects (datahub-project#10814) * fix(ui/ingest): correct privilege used to show tab (datahub-project#10483) Co-authored-by: Kunal-kankriya <[email protected]> * feat(ingest/looker): include dashboard urns in browse v2 (datahub-project#10955) * add a structured type to batchGet in OpenAPI V3 spec (datahub-project#10956) * fix(ui): scroll on the domain sidebar to show all domains (datahub-project#10966) * fix(ingest/sagemaker): resolve incorrect variable assignment for SageMaker API call (datahub-project#10965) * fix(airflow/build): Pinning mypy (datahub-project#10972) * Fixed a bug where the OpenAPI V3 spec was incorrect. The bug was introduced in datahub-project#10939. (datahub-project#10974) * fix(ingest/test): Fix for mssql integration tests (datahub-project#10978) * fix(entity-service) exist check correctly extracts status (datahub-project#10973) * fix(structuredProps) casing bug in StructuredPropertiesValidator (datahub-project#10982) * bugfix: use anyOf instead of allOf when creating references in openapi v3 spec (datahub-project#10986) * fix(ui): Remove ant less imports (datahub-project#10988) * feat(ingest/graph): Add get_results_by_filter to DataHubGraph (datahub-project#10987) * feat(ingest/cli): init does not actually support environment variables (datahub-project#10989) * fix(ingest/graph): Update get_results_by_filter graphql query (datahub-project#10991) * feat(ingest/spark): Promote beta plugin (datahub-project#10881) Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * feat(ingest): support domains in meta -> "datahub" section (datahub-project#10967) * feat(ingest): add `check server-config` command (datahub-project#10990) * feat(cli): Make consistent use of DataHubGraphClientConfig (datahub-project#10466) Deprecates get_url_and_token() in favor of a more complete option: load_graph_config() that returns a full DatahubClientConfig. This change was then propagated across previous usages of get_url_and_token so that connections to DataHub server from the client respect the full breadth of configuration specified by DatahubClientConfig. I.e: You can now specify disable_ssl_verification: true in your ~/.datahubenv file so that all cli functions to the server work when ssl certification is disabled. Fixes datahub-project#9705 * fix(ingest/s3): Fixing container creation when there is no folder in path (datahub-project#10993) * fix(ingest/looker): support platform instance for dashboards & charts (datahub-project#10771) * feat(ingest/bigquery): improve handling of information schema in sql parser (datahub-project#10985) * feat(ingest): improve `ingest deploy` command (datahub-project#10944) * fix(backend): allow excluding soft-deleted entities in relationship-queries; exclude soft-deleted members of groups (datahub-project#10920) - allow excluding soft-deleted entities in relationship-queries - exclude soft-deleted members of groups * fix(ingest/looker): downgrade missing chart type log level (datahub-project#10996) * doc(acryl-cloud): release docs for 0.3.4.x (datahub-project#10984) Co-authored-by: John Joyce <[email protected]> Co-authored-by: RyanHolstien <[email protected]> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: Pedro Silva <[email protected]> * fix(protobuf/build): Fix protobuf check jar script (datahub-project#11006) * fix(ui/ingest): Support invalid cron jobs (datahub-project#10998) * fix(ingest): fix graph config loading (datahub-project#11002) Co-authored-by: Pedro Silva <[email protected]> * feat(docs): Document __DATAHUB_TO_FILE_ directive (datahub-project#10968) Co-authored-by: Harshal Sheth <[email protected]> * fix(graphql/upsertIngestionSource): Validate cron schedule; parse error in CLI (datahub-project#11011) * feat(ece): support custom ownership type urns in ECE generation (datahub-project#10999) * feat(assertion-v2): changed Validation tab to Quality and created new Governance tab (datahub-project#10935) * fix(ingestion/glue): Add support for missing config options for profiling in Glue (datahub-project#10858) * feat(propagation): Add models for schema field docs, tags, terms (datahub-project#2959) (datahub-project#11016) Co-authored-by: Chris Collins <[email protected]> * docs: standardize terminology to DataHub Cloud (datahub-project#11003) * fix(ingestion/transformer): replace the externalUrl container (datahub-project#11013) * docs(slack) troubleshoot docs (datahub-project#11014) * feat(propagation): Add graphql API (datahub-project#11030) Co-authored-by: Chris Collins <[email protected]> * feat(propagation): Add models for Action feature settings (datahub-project#11029) * docs(custom properties): Remove duplicate from sidebar (datahub-project#11033) * feat(models): Introducing Dataset Partitions Aspect (datahub-project#10997) Co-authored-by: John Joyce <[email protected]> Co-authored-by: John Joyce <[email protected]> * feat(propagation): Add Documentation Propagation Settings (datahub-project#11038) * fix(models): chart schema fields mapping, add dataHubAction entity, t… (datahub-project#11040) * fix(ci): smoke test lint failures (datahub-project#11044) * docs: fix learning center color scheme & typo (datahub-project#11043) * feat: add cloud main page (datahub-project#11017) Co-authored-by: Jay <[email protected]> * feat(restore-indices): add additional step to also clear system metadata service (datahub-project#10662) Co-authored-by: John Joyce <[email protected]> * docs: fix typo (datahub-project#11046) * fix(lint): apply spotless (datahub-project#11050) * docs(airflow): example query to get datajobs for a dataflow (datahub-project#11034) * feat(cli): Add run-id option to put sub-command (datahub-project#11023) Adds an option to assign run-id to a given put command execution. This is useful when transformers do not exist for a given ingestion payload, we can follow up with custom metadata and assign it to an ingestion pipeline. * fix(ingest): improve sql error reporting calls (datahub-project#11025) * fix(airflow): fix CI setup (datahub-project#11031) * feat(ingest/dbt): add experimental `prefer_sql_parser_lineage` flag (datahub-project#11039) * fix(ingestion/lookml): enable stack-trace in lookml logs (datahub-project#10971) * (chore): Linting fix (datahub-project#11015) * chore(ci): update deprecated github actions (datahub-project#10977) * Fix ALB configuration example (datahub-project#10981) * chore(ingestion-base): bump base image packages (datahub-project#11053) * feat(cli): Trim report of dataHubExecutionRequestResult to max GMS size (datahub-project#11051) * fix(ingestion/lookml): emit dummy sql condition for lookml custom condition tag (datahub-project#11008) Co-authored-by: Harshal Sheth <[email protected]> * fix(ingestion/powerbi): fix issue with broken report lineage (datahub-project#10910) * feat(ingest/tableau): add retry on timeout (datahub-project#10995) * change generate kafka connect properties from env (datahub-project#10545) Co-authored-by: david-leifker <[email protected]> * fix(ingest): fix oracle cronjob ingestion (datahub-project#11001) Co-authored-by: david-leifker <[email protected]> * chore(ci): revert update deprecated github actions (datahub-project#10977) (datahub-project#11062) * feat(ingest/dbt-cloud): update metadata_endpoint inference (datahub-project#11041) * build: Reduce size of datahub-frontend-react image by 50-ish% (datahub-project#10878) Co-authored-by: david-leifker <[email protected]> * fix(ci): Fix lint issue in datahub_ingestion_run_summary_provider.py (datahub-project#11063) * docs(ingest): update developing-a-transformer.md (datahub-project#11019) * feat(search-test): update search tests from datahub-project#10408 (datahub-project#11056) * feat(cli): add aspects parameter to DataHubGraph.get_entity_semityped (datahub-project#11009) Co-authored-by: Harshal Sheth <[email protected]> * docs(airflow): update min version for plugin v2 (datahub-project#11065) * doc(ingestion/tableau): doc update for derived permission (datahub-project#11054) Co-authored-by: Pedro Silva <[email protected]> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: Harshal Sheth <[email protected]> * fix(py): remove dep on types-pkg_resources (datahub-project#11076) * feat(ingest/mode): add option to exclude restricted (datahub-project#11081) * fix(ingest): set lastObserved in sdk when unset (datahub-project#11071) * doc(ingest): Update capabilities (datahub-project#11072) * chore(vulnerability): Log Injection (datahub-project#11090) * chore(vulnerability): Information exposure through a stack trace (datahub-project#11091) * chore(vulnerability): Comparison of narrow type with wide type in loop condition (datahub-project#11089) * chore(vulnerability): Insertion of sensitive information into log files (datahub-project#11088) * chore(vulnerability): Risky Cryptographic Algorithm (datahub-project#11059) * chore(vulnerability): Overly permissive regex range (datahub-project#11061) Co-authored-by: Harshal Sheth <[email protected]> * fix: update customer data (datahub-project#11075) * fix(models): fixing the datasetPartition models (datahub-project#11085) Co-authored-by: John Joyce <[email protected]> * fix(ui): Adding view, forms GraphQL query, remove showing a fallback error message on unhandled GraphQL error (datahub-project#11084) Co-authored-by: John Joyce <[email protected]> * feat(docs-site): hiding learn more from cloud page (datahub-project#11097) * fix(docs): Add correct usage of orFilters in search API docs (datahub-project#11082) Co-authored-by: Jay <[email protected]> * fix(ingest/mode): Regexp in mode name matcher didn't allow underscore (datahub-project#11098) * docs: Refactor customer stories section (datahub-project#10869) Co-authored-by: Jeff Merrick <[email protected]> * fix(release): fix full/slim suffix on tag (datahub-project#11087) * feat(config): support alternate hashing algorithm for doc id (datahub-project#10423) Co-authored-by: david-leifker <[email protected]> Co-authored-by: John Joyce <[email protected]> * fix(emitter): fix typo in get method of java kafka emitter (datahub-project#11007) * fix(ingest): use correct native data type in all SQLAlchemy sources by compiling data type using dialect (datahub-project#10898) Co-authored-by: Harshal Sheth <[email protected]> * chore: Update contributors list in PR labeler (datahub-project#11105) * feat(ingest): tweak stale entity removal messaging (datahub-project#11064) * fix(ingestion): enforce lastObserved timestamps in SystemMetadata (datahub-project#11104) * fix(ingest/powerbi): fix broken lineage between chart and dataset (datahub-project#11080) * feat(ingest/lookml): CLL support for sql set in sql_table_name attribute of lookml view (datahub-project#11069) * docs: update graphql docs on forms & structured properties (datahub-project#11100) * test(search): search openAPI v3 test (datahub-project#11049) * fix(ingest/tableau): prevent empty site content urls (datahub-project#11057) Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * feat(entity-client): implement client batch interface (datahub-project#11106) * fix(snowflake): avoid reporting warnings/info for sys tables (datahub-project#11114) * fix(ingest): downgrade column type mapping warning to info (datahub-project#11115) * feat(api): add AuditStamp to the V3 API entity/aspect response (datahub-project#11118) * fix(ingest/redshift): replace r'\n' with '\n' to avoid token error redshift serverless… (datahub-project#11111) * fix(entiy-client): handle null entityUrn case for restli (datahub-project#11122) * fix(sql-parser): prevent bad urns from alter table lineage (datahub-project#11092) * fix(ingest/bigquery): use small batch size if use_tables_list_query_v2 is set (datahub-project#11121) * fix(graphql): add missing entities to EntityTypeMapper and EntityTypeUrnMapper (datahub-project#10366) * feat(ui): Changes to allow editable dataset name (datahub-project#10608) Co-authored-by: Jay Kadambi <[email protected]> * fix: remove saxo (datahub-project#11127) * feat(mcl-processor): Update mcl processor hooks (datahub-project#11134) * fix(openapi): fix openapi v2 endpoints & v3 documentation update * Revert "fix(openapi): fix openapi v2 endpoints & v3 documentation update" This reverts commit 573c1cb. * docs(policies): updates to policies documentation (datahub-project#11073) * fix(openapi): fix openapi v2 and v3 docs update (datahub-project#11139) * feat(auth): grant type and acr values custom oidc parameters support (datahub-project#11116) * fix(mutator): mutator hook fixes (datahub-project#11140) * feat(search): support sorting on multiple fields (datahub-project#10775) * feat(ingest): various logging improvements (datahub-project#11126) * fix(ingestion/lookml): fix for sql parsing error (datahub-project#11079) Co-authored-by: Harshal Sheth <[email protected]> * feat(docs-site) cloud page spacing and content polishes (datahub-project#11141) * feat(ui) Enable editing structured props on fields (datahub-project#11042) * feat(tests): add md5 and last computed to testResult model (datahub-project#11117) * test(openapi): openapi regression smoke tests (datahub-project#11143) * fix(airflow): fix tox tests + update docs (datahub-project#11125) * docs: add chime to adoption stories (datahub-project#11142) * fix(ingest/databricks): Updating code to work with Databricks sdk 0.30 (datahub-project#11158) * fix(kafka-setup): add missing script to image (datahub-project#11190) * fix(config): fix hash algo config (datahub-project#11191) * test(smoke-test): updates to smoke-tests (datahub-project#11152) * fix(elasticsearch): refactor idHashAlgo setting (datahub-project#11193) * chore(kafka): kafka version bump (datahub-project#11211) * readd UsageStatsWorkUnit * fix merge problems * change logo --------- Co-authored-by: Chris Collins <[email protected]> Co-authored-by: John Joyce <[email protected]> Co-authored-by: John Joyce <[email protected]> Co-authored-by: John Joyce <[email protected]> Co-authored-by: dushayntAW <[email protected]> Co-authored-by: sagar-salvi-apptware <[email protected]> Co-authored-by: Aseem Bansal <[email protected]> Co-authored-by: Kevin Chun <[email protected]> Co-authored-by: jordanjeremy <[email protected]> Co-authored-by: skrydal <[email protected]> Co-authored-by: Harshal Sheth <[email protected]> Co-authored-by: david-leifker <[email protected]> Co-authored-by: sid-acryl <[email protected]> Co-authored-by: Julien Jehannet <[email protected]> Co-authored-by: Hendrik Richert <[email protected]> Co-authored-by: Hendrik Richert <[email protected]> Co-authored-by: RyanHolstien <[email protected]> Co-authored-by: Felix Lüdin <[email protected]> Co-authored-by: Pirry <[email protected]> Co-authored-by: Hyejin Yoon <[email protected]> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: cburroughs <[email protected]> Co-authored-by: ksrinath <[email protected]> Co-authored-by: Mayuri Nehate <[email protected]> Co-authored-by: Kunal-kankriya <[email protected]> Co-authored-by: Shirshanka Das <[email protected]> Co-authored-by: ipolding-cais <[email protected]> Co-authored-by: Tamas Nemeth <[email protected]> Co-authored-by: Shubham Jagtap <[email protected]> Co-authored-by: haeniya <[email protected]> Co-authored-by: Yanik Häni <[email protected]> Co-authored-by: Gabe Lyons <[email protected]> Co-authored-by: Gabe Lyons <[email protected]> Co-authored-by: 808OVADOZE <[email protected]> Co-authored-by: noggi <[email protected]> Co-authored-by: Nicholas Pena <[email protected]> Co-authored-by: Jay <[email protected]> Co-authored-by: ethan-cartwright <[email protected]> Co-authored-by: Ethan Cartwright <[email protected]> Co-authored-by: Nadav Gross <[email protected]> Co-authored-by: Patrick Franco Braz <[email protected]> Co-authored-by: pie1nthesky <[email protected]> Co-authored-by: Joel Pinto Mata (KPN-DSH-DEX team) <[email protected]> Co-authored-by: Ellie O'Neil <[email protected]> Co-authored-by: Ajoy Majumdar <[email protected]> Co-authored-by: deepgarg-visa <[email protected]> Co-authored-by: Tristan Heisler <[email protected]> Co-authored-by: Andrew Sikowitz <[email protected]> Co-authored-by: Davi Arnaut <[email protected]> Co-authored-by: Pedro Silva <[email protected]> Co-authored-by: amit-apptware <[email protected]> Co-authored-by: Sam Black <[email protected]> Co-authored-by: Raj Tekal <[email protected]> Co-authored-by: Steffen Grohsschmiedt <[email protected]> Co-authored-by: jaegwon.seo <[email protected]> Co-authored-by: Renan F. Lima <[email protected]> Co-authored-by: Matt Exchange <[email protected]> Co-authored-by: Jonny Dixon <[email protected]> Co-authored-by: Pedro Silva <[email protected]> Co-authored-by: Pinaki Bhattacharjee <[email protected]> Co-authored-by: Jeff Merrick <[email protected]> Co-authored-by: skrydal <[email protected]> Co-authored-by: AndreasHegerNuritas <[email protected]> Co-authored-by: jayasimhankv <[email protected]> Co-authored-by: Jay Kadambi <[email protected]> Co-authored-by: David Leifker <[email protected]>
Checklist
Summary by CodeRabbit
New Features
Documentation
Chores