feat(ingest/abs): Adding azure blob storage ingestion source #10813

joelmataKPN · 2024-07-01T10:10:58Z

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

Summary by CodeRabbit

New Features
- Introduced a connector for ingesting Azure Blob Storage datasets into DataHub.
- Added support for defining datasets using Path Specs.
- Implemented utility functions for handling Azure Blob Storage operations.
Documentation
- Updated documentation for ingesting AWS S3 datasets to include references to "Path Specs."
- Added detailed guides for configuring and using the Azure Blob Storage connector.
Chores
- Updated dependencies to include Azure-specific libraries.

coderabbitai · 2024-07-01T10:33:43Z

Walkthrough

The latest updates introduce a new connector for ingesting Azure Blob Storage (ABS) datasets into DataHub. This enhancement includes comprehensive configurations, profiling options, and utilities for managing ABS URIs. Additionally, the documentation has been updated to reflect the new features, along with minor adjustments for AWS S3.

Changes

File/Path	Change Summary
`metadata-ingestion/docs/sources/abs/README.md`	Introduces a new connector for ingesting Azure Blob Storage datasets into DataHub and maps files/folders to datasets, supporting various file types.
`metadata-ingestion/docs/sources/abs/abs.md`	Introduces the concept of Path Specs for defining datasets, including structure, inclusion/exclusion of paths, and examples.
`metadata-ingestion/docs/sources/s3/README.md`	Minor modification in the documentation for ingesting AWS S3 datasets into DataHub, referencing "Path Specs."
`metadata-ingestion/src/datahub/ingestion/source/abs/config.py`	Introduces a `DataLakeSourceConfig` class for defining configurations for a data lake source, including platform information, Azure configuration, ingestion settings, profiling options, Spark configurations, schema inference, and path specification validation logic.
`metadata-ingestion/src/datahub/ingestion/source/abs/datalake_profiler_config.py`	Introduces a `DataLakeProfilerConfig` model for data lake profiling, including settings for enabling profiling, specifying operation configs, defining various profiling options, and logic to normalize field-level settings based on table-level profiling.
`metadata-ingestion/src/datahub/ingestion/source/abs/report.py`	Introduces a `DataLakeSourceReport` class to track files scanned and filtered, along with methods to report file scanning and dropped files.
`metadata-ingestion/src/datahub/ingestion/source/abs/source.py`	Introduces the `ABSSource` class for ingesting data from Azure Blob Storage (ABS), providing capabilities for browsing containers, extracting table schema information, creating metadata change proposals, handling telemetry, and supporting schema inference for various file formats.
`metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py`	Introduces utility functions for handling Azure Blob Storage URIs, extracting metadata properties, managing tags, and listing folders within containers. It includes functions for URI validation, prefix extraction, URN creation, property retrieval, tag addition, and folder listing.
`metadata-ingestion/src/datahub/ingestion/source/common/subtypes.py`	Adds new subtype `ABS_CONTAINER` to the `DatasetContainerSubTypes` enum.
`metadata-ingestion/src/datahub/ingestion/source/data_lake_common/data_lake_utils.py`	Introduces support for Azure Blob Storage (ABS) by adding functions to handle ABS URIs, retrieve container names, and determine container subtypes. These changes enable identifying ABS platforms, retrieving ABS-specific information, and creating container hierarchies for ABS datasets.
`metadata-ingestion/setup.py`	Adds dependencies for `abs_base` set, modifies `delta_lake` set to include `*abs_base`, and includes `"abs"` as a source in the source mapping.

Poem

In the cloud where data flows,
Azure Blob now clearly shows,
With DataHub we sync and weave,
Files and folders, a web we leave.

AWS paths still align,
Together they form a grand design,
Here’s to code that gleams and streams,
In the realm of data dreams!

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai generate interesting stats about this repository and render them as a table.
- @coderabbitai show all the console.log statements in this repository.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 23

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 8edc94d and af3e161.

Files selected for processing (14)

metadata-ingestion/docs/sources/abs/README.md (1 hunks)
metadata-ingestion/docs/sources/abs/abs.md (1 hunks)
metadata-ingestion/docs/sources/abs/abs_recipe.yml (1 hunks)
metadata-ingestion/setup.py (3 hunks)
metadata-ingestion/src/datahub/ingestion/source/abs/config.py (1 hunks)
metadata-ingestion/src/datahub/ingestion/source/abs/datalake_profiler_config.py (1 hunks)
metadata-ingestion/src/datahub/ingestion/source/abs/profiling.py (1 hunks)
metadata-ingestion/src/datahub/ingestion/source/abs/report.py (1 hunks)
metadata-ingestion/src/datahub/ingestion/source/abs/source.py (1 hunks)
metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py (1 hunks)
metadata-ingestion/src/datahub/ingestion/source/azure/azure_common.py (1 hunks)
metadata-ingestion/src/datahub/ingestion/source/common/subtypes.py (1 hunks)
metadata-ingestion/src/datahub/ingestion/source/data_lake_common/data_lake_utils.py (4 hunks)
metadata-ingestion/src/datahub/ingestion/source/data_lake_common/path_spec.py (4 hunks)

Files not summarized due to errors (7)

metadata-ingestion/src/datahub/ingestion/source/common/subtypes.py: Error: Server error. Please try again later.
metadata-ingestion/src/datahub/ingestion/source/abs/report.py: Error: Server error. Please try again later.
metadata-ingestion/src/datahub/ingestion/source/abs/source.py: Error: Server error. Please try again later.
metadata-ingestion/setup.py: Error: Server error. Please try again later.
metadata-ingestion/src/datahub/ingestion/source/data_lake_common/path_spec.py: Error: Server error. Please try again later.
metadata-ingestion/src/datahub/ingestion/source/abs/datalake_profiler_config.py: Error: Server error. Please try again later.
metadata-ingestion/src/datahub/ingestion/source/azure/azure_common.py: Error: Server error. Please try again later.

Additional context used

LanguageTool

metadata-ingestion/docs/sources/abs/README.md

[style] ~34-~34: ‘on the basis of’ might be wordy. Consider a shorter alternative.
Context: ...etails)) JSON file schemas are inferred on the basis of the entire file (given the difficulty i...

(EN_WORDINESS_PREMIUM_ON_THE_BASIS_OF)

metadata-ingestion/docs/sources/abs/abs.md

[uncategorized] ~4-~4: Possible missing comma found.
Context: ...) is a list of Path Spec (path_spec) objects where each individual path_spec` repre...

(AI_HYDRA_LEO_MISSING_COMMA)

[uncategorized] ~4-~4: Possible missing article found.
Context: ...e path (path_spec.include) represents formatted path to the dataset. This path must end...

(AI_HYDRA_LEO_MISSING_A)

[uncategorized] ~4-~4: Possible missing comma found.
Context: ...o represent leaf level. If *.[ext] is provided then files with only specified extensio...

(AI_HYDRA_LEO_MISSING_COMMA)

[uncategorized] ~4-~4: Possible missing article found.
Context: ...[ext] is provided then files with only specified extension type will be scanned. ".[ext...

(AI_HYDRA_LEO_MISSING_THE)

[uncategorized] ~6-~6: Possible missing article found.
Context: ...ent a folder level and avoid specifying exact folder name. To map folder as a dataset...

(AI_HYDRA_LEO_MISSING_AN)

[uncategorized] ~6-~6: Possible missing article found.
Context: ...id specifying exact folder name. To map folder as a dataset, use {table} placeholder...

(AI_HYDRA_LEO_MISSING_A)

[uncategorized] ~6-~6: Possible missing article found.
Context: ... use {table} placeholder to represent folder level for which dataset is to be create...

(AI_HYDRA_LEO_MISSING_THE)

[uncategorized] ~6-~6: Possible missing article found.
Context: ...der to represent folder level for which dataset is to be created. For a partitioned dat...

(AI_HYDRA_LEO_MISSING_THE)

[grammar] ~17-~17: You’ve repeated a verb. Did you mean to only write one of them?
Context: ...in named variables. ### Path Specs - Examples #### Example 1 - Individual file as Dataset Contain...

(REPEATED_VERBS)

Markdownlint

metadata-ingestion/docs/sources/abs/README.md

40-40: null
Files should end with a single newline character

(MD047, single-trailing-newline)

33-33: null
Link fragments should be valid

(MD051, link-fragments)

metadata-ingestion/docs/sources/abs/abs.md

55-55: Expected: 0 or 2; Actual: 1
Trailing spaces

(MD009, no-trailing-spaces)

202-202: Expected: 0 or 2; Actual: 1
Trailing spaces

(MD009, no-trailing-spaces)

16-16: Expected: 1; Actual: 2
Multiple consecutive blank lines

(MD012, no-multiple-blanks)

107-107: Expected: 1; Actual: 2
Multiple consecutive blank lines

(MD012, no-multiple-blanks)

169-169: Expected: 1; Actual: 2
Multiple consecutive blank lines

(MD012, no-multiple-blanks)

170-170: Expected: 1; Actual: 3
Multiple consecutive blank lines

(MD012, no-multiple-blanks)

17-17: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)

18-18: Expected: 1; Actual: 0; Above
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)

30-30: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

34-34: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

40-40: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

49-49: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

59-59: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

74-74: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

84-84: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

100-100: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

111-111: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

131-131: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

13-13: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)

141-141: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)

164-164: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)

165-165: null
Bare URL used

(MD034, no-bare-urls)

167-167: null
Bare URL used

(MD034, no-bare-urls)

12-12: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

147-147: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

163-163: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

22-22: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

30-30: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

40-40: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

49-49: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

59-59: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

74-74: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

84-84: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

100-100: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

111-111: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

131-131: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

204-204: null
Files should end with a single newline character

(MD047, single-trailing-newline)

4-4: null
Link fragments should be valid

(MD051, link-fragments)

Ruff

metadata-ingestion/src/datahub/ingestion/source/abs/config.py

148-150: Use dict.get() without default value

Remove default value

(SIM910)

metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py

2-2: Redefinition of unused logging from line 1

Remove definition: logging

(F811)

155-155: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

280-280: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

metadata-ingestion/src/datahub/ingestion/source/abs/profiling.py

126-130: Use a single if statement instead of nested if statements

(SIM102)

metadata-ingestion/src/datahub/ingestion/source/abs/source.py

162-163: Use a single if statement instead of nested if statements

(SIM102)

693-693: Loop control variable guid not used within loop body

Rename unused guid to _guid

(B007)

Additional comments not posted (42)

metadata-ingestion/docs/sources/abs/abs_recipe.yml (1)
1-11: Ensure sensitive information is secured.

The configuration contains sensitive information such as account_name and sas_token. Ensure these are secured and not hard-coded in production environments.
[SECURITY]
Consider using environment variables or a secrets management service.
azure_config:
  account_name: "${{ secrets.AZURE_ACCOUNT_NAME }}"
  sas_token: "${{ secrets.AZURE_SAS_TOKEN }}"
metadata-ingestion/src/datahub/ingestion/source/abs/report.py (1)

1-19: LGTM!

The data class DataLakeSourceReport and its methods report_file_scanned and report_file_dropped are well-defined and follow best practices.

metadata-ingestion/src/datahub/ingestion/source/common/subtypes.py (1)

38-38: LGTM!

The new subtype ABS_CONTAINER has been correctly added to the DatasetContainerSubTypes enum.
metadata-ingestion/docs/sources/abs/README.md (3)
1-35: Consider rephrasing for conciseness.

The phrase "on the basis of" in line 34 might be wordy. Consider using "based on" instead.
- JSON file schemas are inferred on the basis of the entire file (given the difficulty in extracting only the first few objects of the file), which may impact performance.
+ JSON file schemas are inferred based on the entire file (given the difficulty in extracting only the first few objects of the file), which may impact performance.
Tools

LanguageTool

[style] ~34-~34: ‘on the basis of’ might be wordy. Consider a shorter alternative.
Context: ...etails)) JSON file schemas are inferred on the basis of the entire file (given the difficulty i...

(EN_WORDINESS_PREMIUM_ON_THE_BASIS_OF)

Markdownlint

33-33: null
Link fragments should be valid

(MD051, link-fragments)

33-33: Ensure link fragments are valid.

The link fragment in line 33 should be valid.

Please ensure the max_rows link fragment is correct and points to the appropriate section.

Tools

Markdownlint

33-33: null
Link fragments should be valid

(MD051, link-fragments)

40-40: Add a newline at the end of the file.

Files should end with a single newline character.
- Profiling is not available in the current release.
+ Profiling is not available in the current release.
+
Tools

Markdownlint

40-40: null
Files should end with a single newline character

(MD047, single-trailing-newline)
metadata-ingestion/src/datahub/ingestion/source/abs/datalake_profiler_config.py (2)
26-28: Consider renaming _allow_deny_patterns.

The private attribute _allow_deny_patterns could be more descriptive, indicating it is used for profiling patterns.
-_allow_deny_patterns: AllowDenyPattern = pydantic.PrivateAttr(
+_profiling_allow_deny_patterns: AllowDenyPattern = pydantic.PrivateAttr(
75-92: Improve the root validator method name.

The root validator method ensure_field_level_settings_are_normalized could be renamed to better reflect its purpose.
-    def ensure_field_level_settings_are_normalized(
+    def normalize_field_level_settings(
metadata-ingestion/src/datahub/ingestion/source/data_lake_common/data_lake_utils.py (1)
38-38: Use consistent naming for platform constants.

The platform constants should use consistent naming conventions.
-PLATFORM_ABS = "abs"
+PLATFORM_AZURE_BLOB_STORAGE = "abs"
metadata-ingestion/src/datahub/ingestion/source/abs/config.py (1)
99-101: Use dict.get() without default value.

Remove the default value in the dict.get() method.
-    _rename_path_spec_to_plural = pydantic_renamed_field(
-        "path_spec", "path_specs", lambda path_spec: [path_spec]
-    )
+    _rename_path_spec_to_plural = pydantic_renamed_field("path_spec", "path_specs", lambda path_spec: [path_spec])
Likely invalid or redundant comment.
metadata-ingestion/src/datahub/ingestion/source/data_lake_common/path_spec.py (1)

218-220: LGTM!

The code changes are approved.

metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py (7)

31-32: LGTM!

The code changes are approved.

35-39: LGTM!

The code changes are approved.

42-50: LGTM!

The code changes are approved.

53-65: LGTM!

The code changes are approved.

68-73: LGTM!

The code changes are approved.

76-81: LGTM!

The code changes are approved.

162-195: LGTM!

The code changes are approved.

metadata-ingestion/src/datahub/ingestion/source/abs/profiling.py (1)

55-57: LGTM!

The code changes are approved.

metadata-ingestion/src/datahub/ingestion/source/abs/source.py (19)

204-204: Class ABSSource Implementation Looks Good

The class is well-structured and adheres to the project conventions.

210-227: Constructor Initialization and Telemetry Data

The constructor correctly initializes the attributes and sends telemetry data.

229-233: Class Method create Looks Good

The method correctly parses the configuration dictionary and creates an instance of ABSSource.

235-299: Method get_fields Looks Good

The method correctly handles both ABS and local file sources and infers the schema accurately. Error handling is also properly implemented.

301-324: Method add_partition_columns_to_schema Looks Good

The method correctly derives partition keys and adds them to the schema fields.

325-334: Method _create_table_operation_aspect Looks Good

The method correctly creates the operation aspect with the appropriate timestamps and operation type.

336-437: Method ingest_table Looks Good

The method correctly handles metadata extraction, schema inference, and aspect creation.

439-444: Method get_prefix Looks Good

The method correctly extracts the prefix from a relative path using regex.

446-449: Method extract_table_name Looks Good

The method correctly extracts the table name from the path specification and named variables.

451-472: Method extract_table_data Looks Good

The method correctly extracts table data including display name, path, timestamp, and size.

474-490: Method resolve_templated_folders Looks Good

The method correctly resolves templated folders recursively.

491-520: Method get_dir_to_process Looks Good

The method correctly retrieves the directory to process, handling nested directories and sorting them.

522-615: Method abs_browser Looks Good

The method correctly handles ABS browsing, including error handling and sampling logic.

616-623: Method create_abs_path Looks Good

The method correctly constructs the ABS path using the account name and container name.

625-648: Method local_browser Looks Good

The method correctly handles local file browsing, including file and directory scanning.

649-695: Method get_workunits_internal Looks Good

The method correctly handles work unit generation, including ABS and local file sources.

Tools

Ruff

693-693: Loop control variable guid not used within loop body

Rename unused guid to _guid

(B007)

696-702: Method get_workunit_processors Looks Good

The method correctly retrieves the work unit processors, including the stale entity removal handler.

704-705: Method is_abs_platform Looks Good

The method correctly checks if the platform is ABS.

707-708: Method get_report Looks Good

The method correctly retrieves the report.

metadata-ingestion/setup.py (4)

261-266: Addition of ABS Base Dependencies Looks Good

The ABS base dependencies are correctly added to the abs_base dictionary.

275-275: Addition of ABS Base to Delta Lake Dependencies Looks Good

The ABS base dependencies are correctly included in the delta_lake dictionary.

416-416: Addition of ABS Plugin Dependencies Looks Good

The ABS plugin dependencies are correctly added to the plugins dictionary.

693-693: Addition of ABS Source Entry Point Looks Good

The ABS source entry point is correctly added to the entry_points dictionary.

coderabbitai · 2024-07-01T10:44:04Z

metadata-ingestion/src/datahub/ingestion/source/azure/azure_common.py

+    base_path: str = Field(
+        default="/",
+        description="Base folder in hierarchical namespaces to start from.",
+    )


Consider adding a validator for base_path.

The base_path field could benefit from a validator to ensure it starts with a /.

@pydantic.validator("base_path") def validate_base_path(cls, value): if not value.startswith("/"): raise ValueError("base_path must start with a '/'") return value

coderabbitai · 2024-07-01T10:44:04Z

metadata-ingestion/src/datahub/ingestion/source/azure/azure_common.py

+    def get_abfss_url(self, folder_path: str = "") -> str:
+        if not folder_path.startswith("/"):
+            folder_path = f"/{folder_path}"
+        return f"abfss://{self.container_name}@{self.account_name}.dfs.core.windows.net{folder_path}"


Simplify get_abfss_url method.

The get_abfss_url method can be simplified by using str.lstrip.

- if not folder_path.startswith("/"): - folder_path = f"/{folder_path}" + folder_path = folder_path.lstrip("/")

Committable suggestion was skipped due to low confidence.

coderabbitai · 2024-07-01T10:44:04Z

metadata-ingestion/src/datahub/ingestion/source/data_lake_common/data_lake_utils.py

+        elif is_abs_uri(path):
+            return get_container_name(path)


Handle invalid ABS URIs gracefully.

Add error handling for invalid ABS URIs in get_bucket_name.

- elif is_abs_uri(path): - return get_container_name(path) + elif is_abs_uri(path): + try: + return get_container_name(path) + except ValueError: + raise ValueError(f"Invalid ABS URI: {path}")

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

elif is_abs_uri(path):

return get_container_name(path)

elif is_abs_uri(path):

try:

return get_container_name(path)

except ValueError:

raise ValueError(f"Invalid ABS URI: {path}")

coderabbitai · 2024-07-01T10:44:04Z

metadata-ingestion/src/datahub/ingestion/source/data_lake_common/data_lake_utils.py

+        elif is_abs_uri(path):
+            protocol = get_abs_prefix(path)


Handle invalid ABS URIs gracefully.

Add error handling for invalid ABS URIs in get_protocol.

- elif is_abs_uri(path): - protocol = get_abs_prefix(path) + elif is_abs_uri(path): + try: + protocol = get_abs_prefix(path) + except ValueError: + raise ValueError(f"Invalid ABS URI: {path}")

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

elif is_abs_uri(path):

protocol = get_abs_prefix(path)

elif is_abs_uri(path):

try:

protocol = get_abs_prefix(path)

except ValueError:

raise ValueError(f"Invalid ABS URI: {path}")

coderabbitai · 2024-07-01T10:44:04Z

metadata-ingestion/src/datahub/ingestion/source/data_lake_common/data_lake_utils.py

+    def get_sub_types(self) -> str:
+        if self.platform == "s3":
+            return DatasetContainerSubTypes.S3_BUCKET
+        elif self.platform == "gcs":
+            return DatasetContainerSubTypes.GCS_BUCKET
+        elif self.platform == "abs":
+            return DatasetContainerSubTypes.ABS_CONTAINER
+        raise ValueError(f"Unable to sub type for platform: {self.platform}")


Use constants for platform strings.

Replace hardcoded platform strings with constants.

- if self.platform == "s3": + if self.platform == PLATFORM_S3: - elif self.platform == "gcs": + elif self.platform == PLATFORM_GCS: - elif self.platform == "abs": + elif self.platform == PLATFORM_ABS:

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def get_sub_types(self) -> str:

if self.platform == "s3":

return DatasetContainerSubTypes.S3_BUCKET

elif self.platform == "gcs":

return DatasetContainerSubTypes.GCS_BUCKET

elif self.platform == "abs":

return DatasetContainerSubTypes.ABS_CONTAINER

raise ValueError(f"Unable to sub type for platform: {self.platform}")

def get_sub_types(self) -> str:

if self.platform == PLATFORM_S3:

return DatasetContainerSubTypes.S3_BUCKET

elif self.platform == PLATFORM_GCS:

return DatasetContainerSubTypes.GCS_BUCKET

elif self.platform == PLATFORM_ABS:

return DatasetContainerSubTypes.ABS_CONTAINER

raise ValueError(f"Unable to sub type for platform: {self.platform}")

coderabbitai · 2024-07-01T10:44:05Z

metadata-ingestion/docs/sources/abs/abs.md

+
+Path Specs (`path_specs`) is a list of Path Spec (`path_spec`) objects where each individual `path_spec` represents one or more datasets. Include path (`path_spec.include`) represents formatted path to the dataset. This path must end with `*.*` or `*.[ext]` to represent leaf level. If `*.[ext]` is provided then files with only specified extension type will be scanned. "`.[ext]`" can be any of [supported file types](#supported-file-types). Refer [example 1](#example-1---individual-file-as-dataset) below for more details.
+
+All folder levels need to be specified in include path. You can use `/*/` to represent a folder level and avoid specifying exact folder name. To map folder as a dataset, use `{table}` placeholder to represent folder level for which dataset is to be created. For a partitioned dataset, you can use placeholder `{partition_key[i]}` to represent name of `i`th partition and `{partition[i]}` to represent value of `i`th partition. During ingestion, `i` will be used to match partition_key to partition. Refer [example 2 and 3](#example-2---folder-of-files-as-dataset-without-partitions) below for more details.


Fix grammatical issues.

Add the missing articles for better readability.

- All folder levels need to be specified in include path. You can use `/*/` to represent a folder level and avoid specifying exact folder name. To map folder as a dataset, use `{table}` placeholder to represent folder level for which dataset is to be created. For a partitioned dataset, you can use placeholder `{partition_key[i]}` to represent name of `i`th partition and `{partition[i]}` to represent value of `i`th partition. During ingestion, `i` will be used to match partition_key to partition. Refer [example 2 and 3](#example-2---folder-of-files-as-dataset-without-partitions) below for more details. + All folder levels need to be specified in the include path. You can use `/*/` to represent a folder level and avoid specifying the exact folder name. To map a folder as a dataset, use the `{table}` placeholder to represent the folder level for which the dataset is to be created. For a partitioned dataset, you can use the placeholder `{partition_key[i]}` to represent the name of the `i`th partition and `{partition[i]}` to represent the value of the `i`th partition. During ingestion, `i` will be used to match the partition_key to the partition. Refer to [example 2 and 3](#example-2---folder-of-files-as-dataset-without-partitions) below for more details.

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

All folder levels need to be specified in include path. You can use `/*/` to represent a folder level and avoid specifying exact folder name. To map folder as a dataset, use `{table}` placeholder to represent folder level for which dataset is to be created. For a partitioned dataset, you can use placeholder `{partition_key[i]}` to represent name of `i`th partition and `{partition[i]}` to represent value of `i`th partition. During ingestion, `i` will be used to match partition_key to partition. Refer [example 2 and 3](#example-2---folder-of-files-as-dataset-without-partitions) below for more details.

All folder levels need to be specified in the include path. You can use `/*/` to represent a folder level and avoid specifying the exact folder name. To map a folder as a dataset, use the `{table}` placeholder to represent the folder level for which the dataset is to be created. For a partitioned dataset, you can use the placeholder `{partition_key[i]}` to represent the name of the `i`th partition and `{partition[i]}` to represent the value of the `i`th partition. During ingestion, `i` will be used to match the partition_key to the partition. Refer to [example 2 and 3](#example-2---folder-of-files-as-dataset-without-partitions) below for more details.

Tools

LanguageTool

[uncategorized] ~6-~6: Possible missing article found.
Context: ...ent a folder level and avoid specifying exact folder name. To map folder as a dataset...

(AI_HYDRA_LEO_MISSING_AN)

[uncategorized] ~6-~6: Possible missing article found.
Context: ...id specifying exact folder name. To map folder as a dataset, use {table} placeholder...

(AI_HYDRA_LEO_MISSING_A)

[uncategorized] ~6-~6: Possible missing article found.
Context: ... use {table} placeholder to represent folder level for which dataset is to be create...

(AI_HYDRA_LEO_MISSING_THE)

[uncategorized] ~6-~6: Possible missing article found.
Context: ...der to represent folder level for which dataset is to be created. For a partitioned dat...

(AI_HYDRA_LEO_MISSING_THE)

coderabbitai · 2024-07-01T10:44:05Z

metadata-ingestion/src/datahub/ingestion/source/data_lake_common/path_spec.py

+        is_abs = is_abs_uri(values.get("include") or "")
+        if not is_s3 and not is_gcs and not is_abs:


Tip
Codebase Verification

The turn_off_sampling_for_non_s3 method appears to be unused.

The method turn_off_sampling_for_non_s3 is defined in metadata-ingestion/src/datahub/ingestion/source/data_lake_common/path_spec.py but is not invoked anywhere in the codebase. This suggests that it might be dead code.

metadata-ingestion/src/datahub/ingestion/source/data_lake_common/path_spec.py: def turn_off_sampling_for_non_s3(cls, v, values)

Consider removing this method if it is no longer needed.

Analysis chain

Ensure sampling is correctly handled for Azure Blob Storage.

The sampling logic has been updated to handle Azure Blob Storage URIs. Ensure that this change is consistent with the rest of the codebase.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash # Description: Verify the usage of the `turn_off_sampling_for_non_s3` validator in the codebase. # Test: Search for the usage of the `turn_off_sampling_for_non_s3` validator. Expect: Consistent handling of Azure Blob Storage URIs. rg --type python $'turn_off_sampling_for_non_s3'

Length of output: 78

Script:

#!/bin/bash # Description: Verify the usage of the `turn_off_sampling_for_non_s3` validator in the codebase. # Test: Search for the usage of the `turn_off_sampling_for_non_s3` validator. Expect: Consistent handling of Azure Blob Storage URIs. rg -t py 'turn_off_sampling_for_non_s3'

Length of output: 171

Script:

#!/bin/bash # Description: Verify the usage of the `turn_off_sampling_for_non_s3` validator within the path_spec.py file. # Test: Search for invocations of the `turn_off_sampling_for_non_s3` method within the path_spec.py file. rg 'turn_off_sampling_for_non_s3' metadata-ingestion/src/datahub/ingestion/source/data_lake_common/path_spec.py

Length of output: 165

Script:

#!/bin/bash # Description: Verify the usage of the `turn_off_sampling_for_non_s3` validator across the entire codebase. # Test: Search for invocations of the `turn_off_sampling_for_non_s3` method across the entire codebase. rg 'turn_off_sampling_for_non_s3'

Length of output: 165

coderabbitai · 2024-07-01T10:44:05Z

metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py

+            continue
+
+        folder_name = f"{prefix}{folder_name}"
+        if folder_name in this_dict.keys():


Remove unnecessary .keys() method.

Use key in dict instead of key in dict.keys() for better performance.

- if folder_name in this_dict.keys(): + if folder_name in this_dict:

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if folder_name in this_dict.keys():

if folder_name in this_dict:

Tools

Ruff

280-280: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

coderabbitai · 2024-07-01T10:44:05Z

metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py

+    if azure_config is None:
+        raise ValueError("No container_client available.")


Improve the error message.

The error message for azure_config being None could be more descriptive.

- raise ValueError("No container_client available.") + raise ValueError("Azure configuration is not provided. Cannot retrieve container client.")

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if azure_config is None:

raise ValueError("No container_client available.")

if azure_config is None:

raise ValueError("Azure configuration is not provided. Cannot retrieve container client.")

coderabbitai · 2024-07-01T10:44:06Z

metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py

+def add_property(
+    key: str, value: str, custom_properties: Dict[str, str], resource_name: str
+) -> Dict[str, str]:
+    if key in custom_properties.keys():


Remove unnecessary .keys() method.

Use key in dict instead of key in dict.keys() for better performance.

- if key in custom_properties.keys(): + if key in custom_properties:

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if key in custom_properties.keys():

if key in custom_properties:

Tools

Ruff

155-155: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

coderabbitai

Actionable comments posted: 18

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 8edc94d and af3e161.

Files selected for processing (14)

metadata-ingestion/docs/sources/abs/README.md (1 hunks)
metadata-ingestion/docs/sources/abs/abs.md (1 hunks)
metadata-ingestion/docs/sources/abs/abs_recipe.yml (1 hunks)
metadata-ingestion/setup.py (3 hunks)
metadata-ingestion/src/datahub/ingestion/source/abs/config.py (1 hunks)
metadata-ingestion/src/datahub/ingestion/source/abs/datalake_profiler_config.py (1 hunks)
metadata-ingestion/src/datahub/ingestion/source/abs/profiling.py (1 hunks)
metadata-ingestion/src/datahub/ingestion/source/abs/report.py (1 hunks)
metadata-ingestion/src/datahub/ingestion/source/abs/source.py (1 hunks)
metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py (1 hunks)
metadata-ingestion/src/datahub/ingestion/source/azure/azure_common.py (1 hunks)
metadata-ingestion/src/datahub/ingestion/source/common/subtypes.py (1 hunks)
metadata-ingestion/src/datahub/ingestion/source/data_lake_common/data_lake_utils.py (4 hunks)
metadata-ingestion/src/datahub/ingestion/source/data_lake_common/path_spec.py (4 hunks)

Files not summarized due to errors (4)

metadata-ingestion/src/datahub/ingestion/source/abs/profiling.py: Error: Server error. Please try again later.
metadata-ingestion/setup.py: Error: Server error. Please try again later.
metadata-ingestion/src/datahub/ingestion/source/abs/source.py: Error: Server error. Please try again later.
metadata-ingestion/docs/sources/abs/README.md: Error: Server error. Please try again later.

Files skipped from review due to trivial changes (2)

metadata-ingestion/docs/sources/abs/abs_recipe.yml
metadata-ingestion/src/datahub/ingestion/source/common/subtypes.py

Additional context used

LanguageTool

metadata-ingestion/docs/sources/abs/README.md

[uncategorized] ~3-~3: You might be missing the article “the” here.
Context: ...group of files that form a dataset, use path_specs configuration in ingestion r...

(AI_EN_LECTOR_MISSING_DETERMINER_THE)

[uncategorized] ~3-~3: You might be missing the article “the” here.
Context: ...aset, use path_specs configuration in ingestion recipe. Refer section [Path Specs](http...

(AI_EN_LECTOR_MISSING_DETERMINER_THE)

[style] ~34-~34: ‘on the basis of’ might be wordy. Consider a shorter alternative.
Context: ...etails)) JSON file schemas are inferred on the basis of the entire file (given the difficulty i...

(EN_WORDINESS_PREMIUM_ON_THE_BASIS_OF)

metadata-ingestion/docs/sources/abs/abs.md

[uncategorized] ~4-~4: Possible missing comma found.
Context: ...) is a list of Path Spec (path_spec) objects where each individual path_spec` repre...

(AI_HYDRA_LEO_MISSING_COMMA)

[uncategorized] ~4-~4: Possible missing article found.
Context: ...e path (path_spec.include) represents formatted path to the dataset. This path must end...

(AI_HYDRA_LEO_MISSING_A)

[uncategorized] ~4-~4: Possible missing comma found.
Context: ...o represent leaf level. If *.[ext] is provided then files with only specified extensio...

(AI_HYDRA_LEO_MISSING_COMMA)

[uncategorized] ~4-~4: Possible missing article found.
Context: ...[ext] is provided then files with only specified extension type will be scanned. ".[ext...

(AI_HYDRA_LEO_MISSING_THE)

[uncategorized] ~6-~6: Possible missing article found.
Context: ...id specifying exact folder name. To map folder as a dataset, use {table} placeholder...

(AI_HYDRA_LEO_MISSING_A)

[uncategorized] ~6-~6: Possible missing article found.
Context: ... use {table} placeholder to represent folder level for which dataset is to be create...

(AI_HYDRA_LEO_MISSING_THE)

[uncategorized] ~6-~6: Possible missing article found.
Context: ...der to represent folder level for which dataset is to be created. For a partitioned dat...

(AI_HYDRA_LEO_MISSING_THE)

[grammar] ~17-~17: You’ve repeated a verb. Did you mean to only write one of them?
Context: ...in named variables. ### Path Specs - Examples #### Example 1 - Individual file as Dataset Contain...

(REPEATED_VERBS)

[uncategorized] ~54-~54: Possible missing article found.
Context: ...table}/*.avro ``` {table} represents folder for which dataset will be created. ##...

(AI_HYDRA_LEO_MISSING_A)

[uncategorized] ~54-~54: Possible missing article found.
Context: ... {table} represents folder for which dataset will be created. #### Example 3 - Fol...

(AI_HYDRA_LEO_MISSING_A)

Markdownlint

metadata-ingestion/docs/sources/abs/README.md

40-40: null
Files should end with a single newline character

(MD047, single-trailing-newline)

33-33: null
Link fragments should be valid

(MD051, link-fragments)

metadata-ingestion/docs/sources/abs/abs.md

55-55: Expected: 0 or 2; Actual: 1
Trailing spaces

(MD009, no-trailing-spaces)

202-202: Expected: 0 or 2; Actual: 1
Trailing spaces

(MD009, no-trailing-spaces)

16-16: Expected: 1; Actual: 2
Multiple consecutive blank lines

(MD012, no-multiple-blanks)

107-107: Expected: 1; Actual: 2
Multiple consecutive blank lines

(MD012, no-multiple-blanks)

169-169: Expected: 1; Actual: 2
Multiple consecutive blank lines

(MD012, no-multiple-blanks)

170-170: Expected: 1; Actual: 3
Multiple consecutive blank lines

(MD012, no-multiple-blanks)

17-17: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)

18-18: Expected: 1; Actual: 0; Above
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)

30-30: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

34-34: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

40-40: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

49-49: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

59-59: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

74-74: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

84-84: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

100-100: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

111-111: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

131-131: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

13-13: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)

141-141: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)

164-164: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)

165-165: null
Bare URL used

(MD034, no-bare-urls)

167-167: null
Bare URL used

(MD034, no-bare-urls)

12-12: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

147-147: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

163-163: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

22-22: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

30-30: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

40-40: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

49-49: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

59-59: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

74-74: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

84-84: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

100-100: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

111-111: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

131-131: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

204-204: null
Files should end with a single newline character

(MD047, single-trailing-newline)

4-4: null
Link fragments should be valid

(MD051, link-fragments)

Ruff

metadata-ingestion/src/datahub/ingestion/source/abs/config.py

148-150: Use dict.get() without default value

Remove default value

(SIM910)

metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py

2-2: Redefinition of unused logging from line 1

Remove definition: logging

(F811)

155-155: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

280-280: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

metadata-ingestion/src/datahub/ingestion/source/abs/profiling.py

126-130: Use a single if statement instead of nested if statements

(SIM102)

metadata-ingestion/src/datahub/ingestion/source/abs/source.py

162-163: Use a single if statement instead of nested if statements

(SIM102)

693-693: Loop control variable guid not used within loop body

Rename unused guid to _guid

(B007)

Additional comments not posted (28)

metadata-ingestion/src/datahub/ingestion/source/abs/report.py (1)

10-19: LGTM!

The class DataLakeSourceReport correctly extends StaleEntityRemovalSourceReport and adds functionality to track scanned and filtered files.
metadata-ingestion/docs/sources/abs/README.md (1)
34-34: Simplify the phrase "on the basis of".

The phrase "on the basis of" can be simplified to "based on".
- JSON file schemas are inferred on the basis of the entire file (given the difficulty in extracting only the first few objects of the file), which may impact performance.
+ JSON file schemas are inferred based on the entire file (given the difficulty in extracting only the first few objects of the file), which may impact performance.
Tools

LanguageTool

[style] ~34-~34: ‘on the basis of’ might be wordy. Consider a shorter alternative.
Context: ...etails)) JSON file schemas are inferred on the basis of the entire file (given the difficulty i...

(EN_WORDINESS_PREMIUM_ON_THE_BASIS_OF)
metadata-ingestion/src/datahub/ingestion/source/abs/datalake_profiler_config.py (1)

75-92: LGTM!

The root validator ensure_field_level_settings_are_normalized correctly ensures that field-level settings are normalized based on the profile_table_level_only flag.

metadata-ingestion/src/datahub/ingestion/source/azure/azure_common.py (1)

84-98: LGTM!

The root validator _check_credential_values correctly ensures that the necessary credential values are provided.

metadata-ingestion/src/datahub/ingestion/source/data_lake_common/data_lake_utils.py (7)

19-24: Imports look good!

The import statements for Azure Blob Storage utilities are correctly added.

38-38: Constant declaration looks good!

The PLATFORM_ABS constant is correctly introduced.

95-96: ABS URI handling looks good!

The get_protocol method is correctly updated to handle ABS URIs.

111-113: ABS URI handling looks good!

The get_bucket_name method is correctly updated to handle ABS URIs.

115-122: ABS container subtype handling looks good!

The get_sub_types method is correctly updated to return ABS container subtype.

124-129: ABS URI handling looks good!

The get_base_full_path method is correctly updated to handle ABS URIs.

137-148: ABS platform handling looks good!

The create_container_hierarchy method is correctly updated to handle ABS platform.

metadata-ingestion/src/datahub/ingestion/source/abs/config.py (4)

1-17: Imports look good!

The import statements for Azure Blob Storage configurations and utilities are correctly added.

28-107: Class definition looks good!

The DataLakeSourceConfig class is correctly introduced with necessary fields and methods.

108-145: ABS platform handling looks good!

The check_path_specs_and_infer_platform method is correctly updated to handle ABS platform.

156-163: Profiling pattern handling looks good!

The ensure_profiling_pattern_is_passed_to_profiling method is correctly implemented to handle profiling patterns.

metadata-ingestion/src/datahub/ingestion/source/data_lake_common/path_spec.py (3)

14-14: Imports look good!

The import statement for Azure Blob Storage utility is correctly added.

173-174: ABS URI handling looks good!

The sample_files validator is correctly updated to handle ABS URIs.

218-221: ABS URI check looks good!

The is_abs cached property is correctly implemented to check for ABS URIs.

metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py (6)

31-32: LGTM!

The function correctly checks if a URI matches the Azure Blob Storage pattern.

35-39: LGTM!

The function correctly extracts the prefix from an Azure Blob Storage URI.

42-50: LGTM!

The function correctly removes the Azure Blob Storage prefix from a URI.

53-65: LGTM!

The function correctly creates a URN for an Azure Blob Storage URI.

68-73: LGTM!

The function correctly extracts the container name from an Azure Blob Storage URI.

76-81: LGTM!

The function correctly extracts the key prefix from an Azure Blob Storage URI.

metadata-ingestion/src/datahub/ingestion/source/abs/profiling.py (1)

60-73: LGTM!

The class correctly defines the specification for profiling a single column.

metadata-ingestion/setup.py (3)

261-266: LGTM! Adding Azure Blob Storage dependencies.

The new dependencies for Azure Blob Storage are appropriate for the intended functionality.

275-275: LGTM! Including Azure Blob Storage dependencies in delta_lake plugin.

The addition of abs_base to the delta_lake plugin is logical and aligns with the objective of adding Azure Blob ingestion functionality.

416-416: LGTM! Adding abs plugin for Azure Blob Storage ingestion.

The new abs plugin with abs_base dependencies aligns with the PR objective of adding Azure Blob ingestion functionality.

coderabbitai · 2024-07-01T10:50:00Z

metadata-ingestion/src/datahub/ingestion/source/abs/config.py

+        inferred_platform = values.get(
+            "platform", None
+        )  # we may have inferred it above


Simplify the platform inference logic.

Use dict.get() without the default value as suggested by static analysis tools.

- inferred_platform = values.get("platform", None) + inferred_platform = values.get("platform")

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

inferred_platform = values.get(

"platform", None

) # we may have inferred it above

inferred_platform = values.get("platform")

Tools

Ruff

148-150: Use dict.get() without default value

Remove default value

(SIM910)

coderabbitai · 2024-07-01T10:50:01Z

metadata-ingestion/docs/sources/abs/abs.md

+Path Specs (`path_specs`) is a list of Path Spec (`path_spec`) objects where each individual `path_spec` represents one or more datasets. Include path (`path_spec.include`) represents formatted path to the dataset. This path must end with `*.*` or `*.[ext]` to represent leaf level. If `*.[ext]` is provided then files with only specified extension type will be scanned. "`.[ext]`" can be any of [supported file types](#supported-file-types). Refer [example 1](#example-1---individual-file-as-dataset) below for more details.
+
+All folder levels need to be specified in include path. You can use `/*/` to represent a folder level and avoid specifying exact folder name. To map folder as a dataset, use `{table}` placeholder to represent folder level for which dataset is to be created. For a partitioned dataset, you can use placeholder `{partition_key[i]}` to represent name of `i`th partition and `{partition[i]}` to represent value of `i`th partition. During ingestion, `i` will be used to match partition_key to partition. Refer [example 2 and 3](#example-2---folder-of-files-as-dataset-without-partitions) below for more details.
+
+Exclude paths (`path_spec.exclude`) can be used to ignore paths that are not relevant to current `path_spec`. This path cannot have named variables ( `{}` ). Exclude path can have `**` to represent multiple folder levels. Refer [example 4](#example-4---folder-of-files-as-dataset-with-partitions-and-exclude-filter) below for more details.


Fix grammatical issues and formatting inconsistencies.

There are grammatical issues and formatting inconsistencies that need to be addressed for better readability.

- Path Specs (`path_specs`) is a list of Path Spec (`path_spec`) objects where each individual `path_spec` represents one or more datasets. Include path (`path_spec.include`) represents formatted path to the dataset. This path must end with `*.*` or `*.[ext]` to represent leaf level. If `*.[ext]` is provided then files with only specified extension type will be scanned. "`.[ext]`" can be any of [supported file types](#supported-file-types). Refer [example 1](#example-1---individual-file-as-dataset) below for more details. + Path Specs (`path_specs`) is a list of Path Spec (`path_spec`) objects, where each individual `path_spec` represents one or more datasets. The include path (`path_spec.include`) represents a formatted path to the dataset. This path must end with `*.*` or `*.[ext]` to represent the leaf level. If `*.[ext]` is provided, then only files with the specified extension type will be scanned. "`.[ext]`" can be any of the [supported file types](#supported-file-types). Refer to [example 1](#example-1---individual-file-as-dataset) below for more details. - All folder levels need to be specified in include path. You can use `/*/` to represent a folder level and avoid specifying exact folder name. To map folder as a dataset, use `{table}` placeholder to represent folder level for which dataset is to be created. For a partitioned dataset, you can use placeholder `{partition_key[i]}` to represent name of `i`th partition and `{partition[i]}` to represent value of `i`th partition. During ingestion, `i` will be used to match partition_key to partition. Refer [example 2 and 3](#example-2---folder-of-files-as-dataset-without-partitions) below for more details. + All folder levels need to be specified in the include path. You can use `/*/` to represent a folder level and avoid specifying the exact folder name. To map a folder as a dataset, use the `{table}` placeholder to represent the folder level for which the dataset is to be created. For a partitioned dataset, you can use the placeholder `{partition_key[i]}` to represent the name of the `i`th partition and `{partition[i]}` to represent the value of the `i`th partition. During ingestion, `i` will be used to match the partition_key to the partition. Refer to [examples 2 and 3](#example-2---folder-of-files-as-dataset-without-partitions) below for more details. - Exclude paths (`path_spec.exclude`) can be used to ignore paths that are not relevant to current `path_spec`. This path cannot have named variables ( `{}` ). Exclude path can have `**` to represent multiple folder levels. Refer [example 4](#example-4---folder-of-files-as-dataset-with-partitions-and-exclude-filter) below for more details. + Exclude paths (`path_spec.exclude`) can be used to ignore paths that are not relevant to the current `path_spec`. This path cannot have named variables (`{}`). The exclude path can have `**` to represent multiple folder levels. Refer to [example 4](#example-4---folder-of-files-as-dataset-with-partitions-and-exclude-filter) below for more details. - Refer [example 5](#example-5---advanced---either-individual-file-or-folder-of-files-as-dataset) if your container has more complex dataset representation. + Refer to [example 5](#example-5---advanced---either-individual-file-or-folder-of-files-as-dataset) if your container has a more complex dataset representation.

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

Path Specs (`path_specs`) is a list of Path Spec (`path_spec`) objects where each individual `path_spec` represents one or more datasets. Include path (`path_spec.include`) represents formatted path to the dataset. This path must end with `*.*` or `*.[ext]` to represent leaf level. If `*.[ext]` is provided then files with only specified extension type will be scanned. "`.[ext]`" can be any of [supported file types](#supported-file-types). Refer [example 1](#example-1---individual-file-as-dataset) below for more details.

All folder levels need to be specified in include path. You can use `/*/` to represent a folder level and avoid specifying exact folder name. To map folder as a dataset, use `{table}` placeholder to represent folder level for which dataset is to be created. For a partitioned dataset, you can use placeholder `{partition_key[i]}` to represent name of `i`th partition and `{partition[i]}` to represent value of `i`th partition. During ingestion, `i` will be used to match partition_key to partition. Refer [example 2 and 3](#example-2---folder-of-files-as-dataset-without-partitions) below for more details.

Exclude paths (`path_spec.exclude`) can be used to ignore paths that are not relevant to current `path_spec`. This path cannot have named variables ( `{}` ). Exclude path can have `**` to represent multiple folder levels. Refer [example 4](#example-4---folder-of-files-as-dataset-with-partitions-and-exclude-filter) below for more details.

Path Specs (`path_specs`) is a list of Path Spec (`path_spec`) objects, where each individual `path_spec` represents one or more datasets. The include path (`path_spec.include`) represents a formatted path to the dataset. This path must end with `*.*` or `*.[ext]` to represent the leaf level. If `*.[ext]` is provided, then only files with the specified extension type will be scanned. "`.[ext]`" can be any of the [supported file types](#supported-file-types). Refer to [example 1](#example-1---individual-file-as-dataset) below for more details.

All folder levels need to be specified in the include path. You can use `/*/` to represent a folder level and avoid specifying the exact folder name. To map a folder as a dataset, use the `{table}` placeholder to represent the folder level for which the dataset is to be created. For a partitioned dataset, you can use the placeholder `{partition_key[i]}` to represent the name of the `i`th partition and `{partition[i]}` to represent the value of the `i`th partition. During ingestion, `i` will be used to match the partition_key to the partition. Refer to [examples 2 and 3](#example-2---folder-of-files-as-dataset-without-partitions) below for more details.

Exclude paths (`path_spec.exclude`) can be used to ignore paths that are not relevant to the current `path_spec`. This path cannot have named variables (`{}`). The exclude path can have `**` to represent multiple folder levels. Refer to [example 4](#example-4---folder-of-files-as-dataset-with-partitions-and-exclude-filter) below for more details.

Refer to [example 5](#example-5---advanced---either-individual-file-or-folder-of-files-as-dataset) if your container has a more complex dataset representation.

Tools

LanguageTool

[uncategorized] ~4-~4: Possible missing comma found.
Context: ...) is a list of Path Spec (path_spec) objects where each individual path_spec` repre...

(AI_HYDRA_LEO_MISSING_COMMA)

[uncategorized] ~4-~4: Possible missing article found.
Context: ...e path (path_spec.include) represents formatted path to the dataset. This path must end...

(AI_HYDRA_LEO_MISSING_A)

[uncategorized] ~4-~4: Possible missing comma found.
Context: ...o represent leaf level. If *.[ext] is provided then files with only specified extensio...

(AI_HYDRA_LEO_MISSING_COMMA)

[uncategorized] ~4-~4: Possible missing article found.
Context: ...[ext] is provided then files with only specified extension type will be scanned. ".[ext...

(AI_HYDRA_LEO_MISSING_THE)

[uncategorized] ~6-~6: Possible missing article found.
Context: ...id specifying exact folder name. To map folder as a dataset, use {table} placeholder...

(AI_HYDRA_LEO_MISSING_A)

[uncategorized] ~6-~6: Possible missing article found.
Context: ... use {table} placeholder to represent folder level for which dataset is to be create...

(AI_HYDRA_LEO_MISSING_THE)

[uncategorized] ~6-~6: Possible missing article found.
Context: ...der to represent folder level for which dataset is to be created. For a partitioned dat...

(AI_HYDRA_LEO_MISSING_THE)

Markdownlint

4-4: null
Link fragments should be valid

(MD051, link-fragments)

coderabbitai · 2024-07-01T10:50:01Z

metadata-ingestion/docs/sources/abs/abs.md

+**Valid path_specs.include**
+
+```python
+https://storageaccountname.blob.core.windows.net/my-container/foo/tests/bar.avro # single file table   
+https://storageaccountname.blob.core.windows.net/my-container/foo/tests/*.* # mulitple file level tables
+https://storageaccountname.blob.core.windows.net/my-container/foo/tests/{table}/*.avro #table without partition
+https://storageaccountname.blob.core.windows.net/my-container/foo/tests/{table}/*/*.avro #table where partitions are not specified
+https://storageaccountname.blob.core.windows.net/my-container/foo/tests/{table}/*.* # table where no partitions as well as data type specified
+https://storageaccountname.blob.core.windows.net/my-container/{dept}/tests/{table}/*.avro # specifying keywords to be used in display name
+https://storageaccountname.blob.core.windows.net/my-container/{dept}/tests/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/*.avro # specify partition key and value format
+https://storageaccountname.blob.core.windows.net/my-container/{dept}/tests/{table}/{partition[0]}/{partition[1]}/{partition[2]}/*.avro # specify partition value only format
+https://storageaccountname.blob.core.windows.net/my-container/{dept}/tests/{table}/{partition[0]}/{partition[1]}/{partition[2]}/*.* # for all extensions
+https://storageaccountname.blob.core.windows.net/my-container/*/{table}/{partition[0]}/{partition[1]}/{partition[2]}/*.* # table is present at 2 levels down in container
+https://storageaccountname.blob.core.windows.net/my-container/*/*/{table}/{partition[0]}/{partition[1]}/{partition[2]}/*.* # table is present at 3 levels down in container
+```


Fix grammatical issues and formatting inconsistencies.

There are grammatical issues and formatting inconsistencies that need to be addressed for better readability.

- https://storageaccountname.blob.core.windows.net/my-container/foo/tests/bar.avro # single file table - https://storageaccountname.blob.core.windows.net/my-container/foo/tests/*.* # mulitple file level tables + https://storageaccountname.blob.core.windows.net/my-container/foo/tests/bar.avro # single file table + https://storageaccountname.blob.core.windows.net/my-container/foo/tests/*.* # multiple file level tables

Committable suggestion was skipped due to low confidence.

Tools

Markdownlint

147-147: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

coderabbitai · 2024-07-01T10:50:01Z

metadata-ingestion/docs/sources/abs/abs.md

+### Path Specs -  Examples
+#### Example 1 - Individual file as Dataset
+
+Container structure:
+
+```
+test-container
+├── employees.csv
+├── departments.json
+└── food_items.csv
+```
+
+Path specs config to ingest `employees.csv` and `food_items.csv` as datasets:
+```
+path_specs:
+    - include: https://storageaccountname.blob.core.windows.net/test-container/*.csv
+
+```
+This will automatically ignore `departments.json` file. To include it, use `*.*` instead of `*.csv`.
+
+#### Example 2 - Folder of files as Dataset (without Partitions)
+
+Container structure:
+```
+test-container
+└──  offers
+     ├── 1.avro
+     └── 2.avro
+
+```
+
+Path specs config to ingest folder `offers` as dataset:
+```
+path_specs:
+    - include: https://storageaccountname.blob.core.windows.net/test-container/{table}/*.avro
+```
+
+`{table}` represents folder for which dataset will be created.
+


Fix grammatical issues and formatting inconsistencies.

There are grammatical issues and formatting inconsistencies that need to be addressed for better readability.

- `{table}` represents folder for which dataset will be created. + `{table}` represents the folder for which the dataset will be created.

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

### Path Specs - Examples

#### Example 1 - Individual file as Dataset

Container structure:

```

test-container

├── employees.csv

├── departments.json

└── food_items.csv

```

Path specs config to ingest `employees.csv` and `food_items.csv` as datasets:

```

path_specs:

- include: https://storageaccountname.blob.core.windows.net/test-container/*.csv

```

This will automatically ignore `departments.json` file. To include it, use `*.*` instead of `*.csv`.

#### Example 2 - Folder of files as Dataset (without Partitions)

Container structure:

```

test-container

└── offers

├── 1.avro

└── 2.avro

```

Path specs config to ingest folder `offers` as dataset:

```

path_specs:

- include: https://storageaccountname.blob.core.windows.net/test-container/{table}/*.avro

```

`{table}` represents folder for which dataset will be created.

`{table}` represents the folder for which the dataset will be created.

Tools

LanguageTool

[grammar] ~17-~17: You’ve repeated a verb. Did you mean to only write one of them?
Context: ...in named variables. ### Path Specs - Examples #### Example 1 - Individual file as Dataset Contain...

(REPEATED_VERBS)

[uncategorized] ~54-~54: Possible missing article found.
Context: ...table}/*.avro ``` {table} represents folder for which dataset will be created. ##...

(AI_HYDRA_LEO_MISSING_A)

[uncategorized] ~54-~54: Possible missing article found.
Context: ... {table} represents folder for which dataset will be created. #### Example 3 - Fol...

(AI_HYDRA_LEO_MISSING_A)

Markdownlint

55-55: Expected: 0 or 2; Actual: 1
Trailing spaces

(MD009, no-trailing-spaces)

17-17: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)

18-18: Expected: 1; Actual: 0; Above
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)

30-30: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

34-34: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

40-40: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

49-49: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

22-22: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

30-30: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

40-40: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

49-49: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

coderabbitai · 2024-07-01T10:50:01Z

metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py

@@ -0,0 +1,289 @@
+import logging
+import logging


Remove redundant import.

The logging module is imported twice.

-import logging

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

import logging

Tools

Ruff

2-2: Redefinition of unused logging from line 1

Remove definition: logging

(F811)

coderabbitai · 2024-07-01T10:50:02Z

metadata-ingestion/docs/sources/abs/README.md

@@ -0,0 +1,40 @@
+This connector ingests Azure Blob Storage (abbreviated to abs) datasets into DataHub. It allows mapping an individual
+file or a folder of files to a dataset in DataHub.
+To specify the group of files that form a dataset, use `path_specs` configuration in ingestion recipe. Refer


Add missing article "the".

The sentence is missing the article "the" before "section."

- Refer section [Path Specs](https://datahubproject.io/docs/generated/ingestion/sources/s3/#path-specs) for more details. + Refer to the section [Path Specs](https://datahubproject.io/docs/generated/ingestion/sources/s3/#path-specs) for more details.

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

To specify the group of files that form a dataset, use `path_specs` configuration in ingestion recipe. Refer

To specify the group of files that form a dataset, use `path_specs` configuration in ingestion recipe. Refer to the section [Path Specs](https://datahubproject.io/docs/generated/ingestion/sources/s3/#path-specs) for more details.

Tools

LanguageTool

[uncategorized] ~3-~3: You might be missing the article “the” here.
Context: ...group of files that form a dataset, use path_specs configuration in ingestion r...

(AI_EN_LECTOR_MISSING_DETERMINER_THE)

[uncategorized] ~3-~3: You might be missing the article “the” here.
Context: ...aset, use path_specs configuration in ingestion recipe. Refer section [Path Specs](http...

(AI_EN_LECTOR_MISSING_DETERMINER_THE)

coderabbitai · 2024-07-01T10:50:02Z

metadata-ingestion/docs/sources/abs/abs.md

+#### Example 5 - Advanced - Either Individual file OR Folder of files as Dataset
+
+Container structure:
+```
+test-container
+├── customers
+│   ├── part1.json
+│   ├── part2.json
+│   ├── part3.json
+│   └── part4.json
+├── employees.csv
+├── food_items.csv
+├── tmp_10101000.csv
+└──  orders
+     └── year=2022
+         └── month=2
+             ├── 1.parquet
+             ├── 2.parquet
+             └── 3.parquet
+
+```
+
+Path specs config:
+```
+path_specs:
+    - include: https://storageaccountname.blob.core.windows.net/test-container/*.csv
+      exclude:
+        - **/tmp_10101000.csv
+    - include: https://storageaccountname.blob.core.windows.net/test-container/{table}/*.json
+    - include: https://storageaccountname.blob.core.windows.net/test-container/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/*.parquet
+```
+
+Above config has 3 path_specs and will ingest following datasets
+- `employees.csv` - Single File as Dataset
+- `food_items.csv` - Single File as Dataset
+- `customers` - Folder as Dataset
+- `orders` - Folder as Dataset
+  and will ignore file `tmp_10101000.csv`
+


Fix grammatical issues and formatting inconsistencies.

There are grammatical issues and formatting inconsistencies that need to be addressed for better readability.

- Above config has 3 path_specs and will ingest following datasets + The above config has 3 path_specs and will ingest the following datasets:

Tools

Markdownlint

111-111: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

131-131: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

141-141: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)

111-111: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

131-131: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

coderabbitai · 2024-07-01T10:50:02Z

metadata-ingestion/docs/sources/abs/README.md

+
+### Profiling
+
+Profiling is not available in the current release.


Ensure the file ends with a newline.

Markdown files should end with a single newline character.

+

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

Profiling is not available in the current release.

Profiling is not available in the current release.

Tools

Markdownlint

40-40: null
Files should end with a single newline character

(MD047, single-trailing-newline)

coderabbitai · 2024-07-01T10:50:02Z

metadata-ingestion/src/datahub/ingestion/source/abs/datalake_profiler_config.py

+from typing import Any, Dict, Optional
+


Remove unused import.

The Optional import from typing is not used in the file.

- from typing import Any, Dict, Optional + from typing import Any, Dict

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

from typing import Any, Dict, Optional

from typing import Any, Dict

coderabbitai · 2024-07-01T10:50:02Z

metadata-ingestion/src/datahub/ingestion/source/azure/azure_common.py

+            folder_path = f"/{folder_path}"
+        return f"abfss://{self.container_name}@{self.account_name}.dfs.core.windows.net{folder_path}"
+
+    # TODO DEX-1010


Remove TODO comment or create an issue.

The TODO comment should either be removed or an issue should be created to track this task.

Do you want me to open a GitHub issue to track this task?

coderabbitai

Actionable comments posted: 4

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between af3e161 and c7f59d3.

Files selected for processing (2)

metadata-ingestion/src/datahub/ingestion/source/abs/source.py (1 hunks)
metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py (1 hunks)

Additional context used

Ruff

metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py

2-2: Redefinition of unused logging from line 1

Remove definition: logging

(F811)

146-146: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

272-272: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

metadata-ingestion/src/datahub/ingestion/source/abs/source.py

162-163: Use a single if statement instead of nested if statements

(SIM102)

685-685: Loop control variable guid not used within loop body

Rename unused guid to _guid

(B007)

Additional comments not posted (26)

metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py (9)
22-23: LGTM!

The function correctly checks if a URI matches the Azure Blob Storage pattern.

26-30: LGTM!

The function correctly extracts the prefix from an Azure Blob Storage URI.

33-41: LGTM!

The function correctly removes the Azure Blob Storage prefix from a URI.

44-56: LGTM!

The function correctly creates a URN for an Azure Blob Storage dataset.

59-64: LGTM!

The function correctly extracts the container name from an Azure Blob Storage URI.

67-72: LGTM!

The function correctly extracts the key prefix from an Azure Blob Storage URI.

153-187: LGTM!

The function correctly processes properties and adds them to the custom properties dictionary.

189-237: LGTM!

The function correctly retrieves tags from an Azure Blob Storage object or container.

86-87: Improve the error message.

The error message for azure_config being None could be more descriptive.
- raise ValueError("No container_client available.")
+ raise ValueError("Azure configuration is not provided. Cannot retrieve container client.")
Likely invalid or redundant comment.
metadata-ingestion/src/datahub/ingestion/source/abs/source.py (17)

127-147: LGTM!

The function correctly maps Spark column types to DataHub types.

235-299: LGTM!

The function correctly retrieves fields from a table based on its file extension.

301-324: LGTM!

The function correctly adds partition columns to the schema.

325-334: LGTM!

The function correctly creates an operation aspect for a table.

336-437: LGTM!

The function correctly ingests a table and generates metadata work units.

439-444: LGTM!

The function correctly extracts the prefix from a relative path.

446-449: LGTM!

The function correctly extracts the table name from a path specification and named variables.

451-472: LGTM!

The function correctly extracts table data from a path specification.

474-490: LGTM!

The function correctly resolves templated folders in a container.

491-520: LGTM!

The function correctly retrieves the directory to process based on a path specification.

522-606: LGTM!

The function correctly browses Azure Blob Storage and retrieves metadata for files.

608-615: LGTM!

The function correctly creates an Azure Blob Storage path from a key.

617-639: LGTM!

The function correctly browses the local file system and retrieves metadata for files.

641-687: LGTM!

The function correctly retrieves metadata work units.

Tools

Ruff

685-685: Loop control variable guid not used within loop body

Rename unused guid to _guid

(B007)

688-694: LGTM!

The function correctly retrieves work unit processors.

696-697: LGTM!

The function correctly checks if the platform is Azure Blob Storage.

699-700: LGTM!

The function correctly retrieves the report.

coderabbitai · 2024-07-01T12:28:03Z

metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py

+def list_folders(
+    container_name: str, prefix: str, azure_config: Optional[AzureConnectionConfig]
+) -> Iterable[str]:
+    if azure_config is None:
+        raise ValueError("azure_config not set. Cannot browse Azure Blob Storage")
+
+    abs_blob_service_client = azure_config.get_blob_service_client()
+    container_client = abs_blob_service_client.get_container_client(container_name)
+
+    current_level = prefix.count("/")
+    blob_list = container_client.list_blobs(name_starts_with=prefix)
+
+    this_dict = {}
+    for blob in blob_list:
+        blob_name = blob.name[: blob.name.rfind("/") + 1]
+        folder_structure_arr = blob_name.split("/")
+
+        folder_name = ""
+        if len(folder_structure_arr) > current_level:
+            folder_name = f"{folder_name}/{folder_structure_arr[current_level]}"
+        else:
+            continue
+
+        folder_name = folder_name[1 : len(folder_name)]
+
+        if folder_name.endswith("/"):
+            folder_name = folder_name[:-1]
+
+        if folder_name == "":
+            continue
+
+        folder_name = f"{prefix}{folder_name}"
+        if folder_name in this_dict.keys():
+            continue
+        else:
+            this_dict[folder_name] = folder_name
+
+        yield f"{folder_name}"


Remove unnecessary .keys() method.

Use key in dict instead of key in dict.keys() for better performance.

- if folder_name in this_dict.keys(): + if folder_name in this_dict:

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def list_folders(

container_name: str, prefix: str, azure_config: Optional[AzureConnectionConfig]

) -> Iterable[str]:

if azure_config is None:

raise ValueError("azure_config not set. Cannot browse Azure Blob Storage")

abs_blob_service_client = azure_config.get_blob_service_client()

container_client = abs_blob_service_client.get_container_client(container_name)

current_level = prefix.count("/")

blob_list = container_client.list_blobs(name_starts_with=prefix)

this_dict = {}

for blob in blob_list:

blob_name = blob.name[: blob.name.rfind("/") + 1]

folder_structure_arr = blob_name.split("/")

folder_name = ""

if len(folder_structure_arr) > current_level:

folder_name = f"{folder_name}/{folder_structure_arr[current_level]}"

else:

continue

folder_name = folder_name[1 : len(folder_name)]

if folder_name.endswith("/"):

folder_name = folder_name[:-1]

if folder_name == "":

continue

folder_name = f"{prefix}{folder_name}"

if folder_name in this_dict.keys():

continue

else:

this_dict[folder_name] = folder_name

yield f"{folder_name}"

if folder_name in this_dict:

Tools

Ruff

272-272: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

coderabbitai · 2024-07-01T12:28:04Z

metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py

+def get_abs_properties(
+    container_name: str,
+    blob_name: Optional[str],
+    full_path: str,
+    number_of_files: int,
+    size_in_bytes: int,
+    sample_files: bool,
+    azure_config: Optional[AzureConnectionConfig],
+    use_abs_container_properties: Optional[bool] = False,
+    use_abs_blob_properties: Optional[bool] = False,
+) -> Dict[str, str]:
+    if azure_config is None:
+        raise ValueError("No container_client available.")


Add error handling for get_blob_properties and get_container_properties.

The function can be improved by adding error handling for the get_blob_properties and get_container_properties calls to handle potential errors gracefully.

- blob_properties = blob_client.get_blob_properties() + try: + blob_properties = blob_client.get_blob_properties() + except Exception as e: + logger.error(f"Failed to get blob properties: {e}") + blob_properties = None - container_properties = container_client.get_container_properties() + try: + container_properties = container_client.get_container_properties() + except Exception as e: + logger.error(f"Failed to get container properties: {e}") + container_properties = None

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def get_abs_properties(

container_name: str,

blob_name: Optional[str],

full_path: str,

number_of_files: int,

size_in_bytes: int,

sample_files: bool,

azure_config: Optional[AzureConnectionConfig],

use_abs_container_properties: Optional[bool] = False,

use_abs_blob_properties: Optional[bool] = False,

) -> Dict[str, str]:

if azure_config is None:

raise ValueError("No container_client available.")

def get_abs_properties(

container_name: str,

blob_name: Optional[str],

full_path: str,

number_of_files: int,

size_in_bytes: int,

sample_files: bool,

azure_config: Optional[AzureConnectionConfig],

use_abs_container_properties: Optional[bool] = False,

use_abs_blob_properties: Optional[bool] = False,

) -> Dict[str, str]:

if azure_config is None:

raise ValueError("No container_client available.")

try:

blob_properties = blob_client.get_blob_properties()

except Exception as e:

logger.error(f"Failed to get blob properties: {e}")

blob_properties = None

try:

container_properties = container_client.get_container_properties()

except Exception as e:

logger.error(f"Failed to get container properties: {e}")

container_properties = None

coderabbitai

Actionable comments posted: 2

Outside diff range and nitpick comments (2)

metadata-ingestion/docs/sources/s3/README.md (2)
Line range hint 41-44: Fix unordered list indentation.

Ensure proper indentation for unordered lists.
- - Row and column counts for each dataset
- - For each column, if profiling is enabled:
-     - null counts and proportions
-     - distinct counts and proportions
-     - minimum, maximum, mean, median, standard deviation, some quantile values
-     - histograms or frequencies of unique values
+ - Row and column counts for each dataset
+ - For each column, if profiling is enabled:
+   - null counts and proportions
+   - distinct counts and proportions
+   - minimum, maximum, mean, median, standard deviation, some quantile values
+   - histograms or frequencies of unique values
Tools

Markdownlint

1-1: Expected: 0 or 2; Actual: 1
Trailing spaces

(MD009, no-trailing-spaces)

Line range hint 21-21: Fix invalid link fragments.

The link fragments should be valid.
- max_rows` recipe parameter (see [below](#config-details))
+ max_rows` recipe parameter (see [below](#profiling))

- see [compatibility](#compatibility) for more details
+ see [compatibility](#compatibility) for more details
Also applies to: 46-46

Tools

Markdownlint

1-1: Expected: 0 or 2; Actual: 1
Trailing spaces

(MD009, no-trailing-spaces)

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between c7f59d3 and f6025a1.

Files selected for processing (5)

metadata-ingestion/docs/sources/abs/README.md (1 hunks)
metadata-ingestion/docs/sources/abs/abs.md (1 hunks)
metadata-ingestion/docs/sources/s3/README.md (1 hunks)
metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py (1 hunks)
metadata-ingestion/src/datahub/ingestion/source/data_lake_common/data_lake_utils.py (4 hunks)

Files not reviewed due to errors (1)

metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py (no review received)

Files skipped from review as they are similar to previous changes (1)

metadata-ingestion/src/datahub/ingestion/source/data_lake_common/data_lake_utils.py

Additional context used

LanguageTool

metadata-ingestion/docs/sources/abs/README.md

[uncategorized] ~3-~3: You might be missing the article “the” here.
Context: ...group of files that form a dataset, use path_specs configuration in ingestion r...

(AI_EN_LECTOR_MISSING_DETERMINER_THE)

[uncategorized] ~3-~3: You might be missing the article “the” here.
Context: ...aset, use path_specs configuration in ingestion recipe. Refer section [Path Specs](http...

(AI_EN_LECTOR_MISSING_DETERMINER_THE)

[style] ~34-~34: ‘on the basis of’ might be wordy. Consider a shorter alternative.
Context: ...etails)) JSON file schemas are inferred on the basis of the entire file (given the difficulty i...

(EN_WORDINESS_PREMIUM_ON_THE_BASIS_OF)

metadata-ingestion/docs/sources/s3/README.md

[style] ~22-~22: ‘on the basis of’ might be wordy. Consider a shorter alternative.
Context: ...etails)) JSON file schemas are inferred on the basis of the entire file (given the difficulty i...

(EN_WORDINESS_PREMIUM_ON_THE_BASIS_OF)

[style] ~46-~46: This phrase is redundant. Consider using “outside”.
Context: ...to interface with AWS (schema inference outside of profiling requires s3:// access). Enabl...

(OUTSIDE_OF)

metadata-ingestion/docs/sources/abs/abs.md

[grammar] ~4-~4: The verb ‘include’ does not usually follow articles like ‘The’. Check that ‘include’ is spelled correctly; using ‘include’ as a noun may be non-standard.
Context: ..._spec represents one or more datasets. The include path (path_spec.include`) represents a...

(A_INFINITIVE)

[grammar] ~17-~17: You’ve repeated a verb. Did you mean to only write one of them?
Context: ...in named variables. ### Path Specs - Examples #### Example 1 - Individual file as Dataset Contain...

(REPEATED_VERBS)

[uncategorized] ~202-~202: This verb does not appear to agree with the subject. Consider using a different form.
Context: ...ames that don't start with a letter and contains other character than letters, number, a...

(AI_EN_LECTOR_REPLACEMENT_VERB_AGREEMENT)

[uncategorized] ~202-~202: The grammatical number of this noun doesn’t look right. Consider replacing it.
Context: ... contains other character than letters, number, and underscore. [https://github.com/ap...

(AI_EN_LECTOR_REPLACEMENT_NOUN_NUMBER)

Markdownlint

metadata-ingestion/docs/sources/abs/README.md

33-33: null
Link fragments should be valid

(MD051, link-fragments)

metadata-ingestion/docs/sources/s3/README.md

41-41: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)

42-42: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)

43-43: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)

44-44: Expected: 2; Actual: 4
Unordered list indentation

(MD007, ul-indent)

1-1: Expected: 0 or 2; Actual: 1
Trailing spaces

(MD009, no-trailing-spaces)

9-9: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)

39-39: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)

21-21: null
Link fragments should be valid

(MD051, link-fragments)

46-46: null
Link fragments should be valid

(MD051, link-fragments)

metadata-ingestion/docs/sources/abs/abs.md

55-55: Expected: 0 or 2; Actual: 1
Trailing spaces

(MD009, no-trailing-spaces)

202-202: Expected: 0 or 2; Actual: 1
Trailing spaces

(MD009, no-trailing-spaces)

16-16: Expected: 1; Actual: 2
Multiple consecutive blank lines

(MD012, no-multiple-blanks)

107-107: Expected: 1; Actual: 2
Multiple consecutive blank lines

(MD012, no-multiple-blanks)

169-169: Expected: 1; Actual: 2
Multiple consecutive blank lines

(MD012, no-multiple-blanks)

170-170: Expected: 1; Actual: 3
Multiple consecutive blank lines

(MD012, no-multiple-blanks)

17-17: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)

18-18: Expected: 1; Actual: 0; Above
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)

30-30: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

34-34: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

40-40: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

49-49: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

59-59: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

74-74: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

84-84: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

100-100: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

111-111: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

131-131: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

13-13: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)

141-141: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)

164-164: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)

165-165: null
Bare URL used

(MD034, no-bare-urls)

167-167: null
Bare URL used

(MD034, no-bare-urls)

12-12: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

147-147: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

163-163: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

22-22: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

30-30: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

40-40: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

49-49: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

59-59: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

74-74: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

84-84: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

100-100: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

111-111: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

131-131: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

204-204: null
Files should end with a single newline character

(MD047, single-trailing-newline)

4-4: null
Link fragments should be valid

(MD051, link-fragments)

Ruff

metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py

145-145: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

271-271: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

Additional comments not posted (10)

metadata-ingestion/docs/sources/abs/README.md (2)
3-3: Add missing article "the".

The sentence is missing the article "the" before "section."
- Refer section [Path Specs](https://datahubproject.io/docs/generated/ingestion/sources/s3/#path-specs) for more details.
+ Refer to the section [Path Specs](https://datahubproject.io/docs/generated/ingestion/sources/s3/#path-specs) for more details.
Tools

LanguageTool

[uncategorized] ~3-~3: You might be missing the article “the” here.
Context: ...group of files that form a dataset, use path_specs configuration in ingestion r...

(AI_EN_LECTOR_MISSING_DETERMINER_THE)

[uncategorized] ~3-~3: You might be missing the article “the” here.
Context: ...aset, use path_specs configuration in ingestion recipe. Refer section [Path Specs](http...

(AI_EN_LECTOR_MISSING_DETERMINER_THE)

34-34: Consider a shorter alternative to "on the basis of".

The phrase "on the basis of" might be wordy. Consider using "based on" instead.
- JSON file schemas are inferred on the basis of the entire file (given the difficulty in extracting only the first few objects of the file), which may impact performance.
+ JSON file schemas are inferred based on the entire file (given the difficulty in extracting only the first few objects of the file), which may impact performance.
Tools

LanguageTool

[style] ~34-~34: ‘on the basis of’ might be wordy. Consider a shorter alternative.
Context: ...etails)) JSON file schemas are inferred on the basis of the entire file (given the difficulty i...

(EN_WORDINESS_PREMIUM_ON_THE_BASIS_OF)
metadata-ingestion/docs/sources/s3/README.md (5)
Line range hint 22-22: Consider a shorter alternative to "on the basis of".

The phrase "on the basis of" might be wordy. Consider using "based on" instead.
- JSON file schemas are inferred on the basis of the entire file (given the difficulty in extracting only the first few objects of the file), which may impact performance.
+ JSON file schemas are inferred based on the entire file (given the difficulty in extracting only the first few objects of the file), which may impact performance.
Tools

Markdownlint

1-1: Expected: 0 or 2; Actual: 1
Trailing spaces

(MD009, no-trailing-spaces)

Line range hint 46-46: Consider using "outside" instead of "outside of".

The phrase "outside of" is redundant. Consider using "outside" instead.
- schema inference outside of profiling requires s3:// access
+ schema inference outside profiling requires s3:// access
Tools

Markdownlint

1-1: Expected: 0 or 2; Actual: 1
Trailing spaces

(MD009, no-trailing-spaces)

1-1: Remove trailing spaces.

Trailing spaces should be removed for better formatting.
- This connector ingests AWS S3 datasets into DataHub. It allows mapping an individual file or a folder of files to a dataset in DataHub. 
+ This connector ingests AWS S3 datasets into DataHub. It allows mapping an individual file or a folder of files to a dataset in DataHub.
Tools

Markdownlint

1-1: Expected: 0 or 2; Actual: 1
Trailing spaces

(MD009, no-trailing-spaces)

Line range hint 9-9: Add blank lines around headings.

Headings should be surrounded by blank lines for better readability.
- ### Supported file types
+ 
+ ### Supported file types
Tools

Markdownlint

1-1: Expected: 0 or 2; Actual: 1
Trailing spaces

(MD009, no-trailing-spaces)

Line range hint 39-39: Add blank lines around lists.

Lists should be surrounded by blank lines for better readability.
- - Row and column counts for each dataset
+ 
+ - Row and column counts for each dataset
Tools

Markdownlint

1-1: Expected: 0 or 2; Actual: 1
Trailing spaces

(MD009, no-trailing-spaces)
metadata-ingestion/docs/sources/abs/abs.md (3)
17-18: Add blank lines around headings.

Headings should be surrounded by blank lines for better readability.
- ### Path Specs -  Examples
- #### Example 1 - Individual file as Dataset
+ ### Path Specs - Examples

+ #### Example 1 - Individual file as Dataset
Tools

LanguageTool

[grammar] ~17-~17: You’ve repeated a verb. Did you mean to only write one of them?
Context: ...in named variables. ### Path Specs - Examples #### Example 1 - Individual file as Dataset Contain...

(REPEATED_VERBS)

Markdownlint

17-17: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)

18-18: Expected: 1; Actual: 0; Above
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)

147-147: Use headings instead of emphasis for better structure.

The emphasis should be replaced with a heading for better readability and structure.
- **Valid path_specs.include**
+ ### Valid path_specs.include

- **Valid path_specs.exclude**
+ ### Valid path_specs.exclude
Also applies to: 163-163

Tools

Markdownlint

147-147: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

164-168: Add blank lines around lists.

Lists should be surrounded by blank lines for better readability.
- - \**/tests/**
+ 
+ - \**/tests/**
Tools

Markdownlint

164-164: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)

165-165: null
Bare URL used

(MD034, no-bare-urls)

167-167: null
Bare URL used

(MD034, no-bare-urls)

coderabbitai · 2024-07-01T14:01:47Z

metadata-ingestion/docs/sources/abs/README.md

+Schemas for Parquet and Avro files are extracted as provided.
+
+Schemas for schemaless formats (CSV, TSV, JSONL, JSON) are inferred. For CSV, TSV and JSONL files, we consider the first
+100 rows by default, which can be controlled via the `max_rows` recipe parameter (see [below](#config-details))


Fix invalid link fragment.

The link fragment #config-details should be valid.

- max_rows` recipe parameter (see [below](#config-details)) + max_rows` recipe parameter (see [below](#profiling))

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

100 rows by default, which can be controlled via the `max_rows` recipe parameter (see [below](#config-details))

100 rows by default, which can be controlled via the `max_rows` recipe parameter (see [below](#profiling))

Tools

Markdownlint

33-33: null
Link fragments should be valid

(MD051, link-fragments)

coderabbitai · 2024-07-01T14:01:47Z

metadata-ingestion/docs/sources/abs/abs.md

+
+Refer to [example 5](#example-5---advanced---either-individual-file-or-folder-of-files-as-dataset) if your container has a more complex dataset representation.
+
+**Additional points to note**


Use headings instead of emphasis for better structure.

The emphasis should be replaced with a heading for better readability and structure.

- **Additional points to note** + ### Additional Points to Note

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

**Additional points to note**

### Additional Points to Note

Tools

Markdownlint

12-12: null
Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

coderabbitai

Actionable comments posted: 4

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between f6025a1 and 3f0d854.

Files selected for processing (1)

metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py (1 hunks)

Additional context used

Ruff

metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py

147-147: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

277-277: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

Additional comments not posted (8)

metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py (8)

21-22: LGTM!

The function correctly checks if a given URI matches the Azure Blob Storage pattern.

25-29: LGTM!

The function correctly retrieves the prefix of an Azure Blob Storage URI.

32-40: LGTM!

The function correctly removes the Azure Blob Storage prefix from a URI.

43-55: LGTM!

The function correctly generates a URN for an Azure Blob Storage dataset.

58-63: LGTM!

The function correctly retrieves the container name from an Azure Blob Storage URI.

66-71: LGTM!

The function correctly retrieves the key prefix from an Azure Blob Storage URI.

154-188: LGTM!

The function correctly creates properties recursively.

285-286: LGTM!

The function correctly retrieves the relative path of a container.

coderabbitai · 2024-07-09T08:20:47Z

metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py

+def list_folders(
+    container_name: str, prefix: str, azure_config: Optional[AzureConnectionConfig]
+) -> Iterable[str]:
+    if azure_config is None:
+        raise ValueError(
+            "Azure configuration is not provided. Cannot retrieve container client."
+        )
+
+    abs_blob_service_client = azure_config.get_blob_service_client()
+    container_client = abs_blob_service_client.get_container_client(container_name)
+
+    current_level = prefix.count("/")
+    blob_list = container_client.list_blobs(name_starts_with=prefix)
+
+    this_dict = {}
+    for blob in blob_list:
+        blob_name = blob.name[: blob.name.rfind("/") + 1]
+        folder_structure_arr = blob_name.split("/")
+
+        folder_name = ""
+        if len(folder_structure_arr) > current_level:
+            folder_name = f"{folder_name}/{folder_structure_arr[current_level]}"
+        else:
+            continue
+
+        folder_name = folder_name[1 : len(folder_name)]
+
+        if folder_name.endswith("/"):
+            folder_name = folder_name[:-1]
+
+        if folder_name == "":
+            continue
+
+        folder_name = f"{prefix}{folder_name}"
+        if folder_name in this_dict.keys():
+            continue
+        else:
+            this_dict[folder_name] = folder_name
+
+        yield f"{folder_name}"


Remove unnecessary .keys() method.

Use key in dict instead of key in dict.keys() for better performance.

- if folder_name in this_dict.keys(): + if folder_name in this_dict:

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def list_folders(

container_name: str, prefix: str, azure_config: Optional[AzureConnectionConfig]

) -> Iterable[str]:

if azure_config is None:

raise ValueError(

"Azure configuration is not provided. Cannot retrieve container client."

)

abs_blob_service_client = azure_config.get_blob_service_client()

container_client = abs_blob_service_client.get_container_client(container_name)

current_level = prefix.count("/")

blob_list = container_client.list_blobs(name_starts_with=prefix)

this_dict = {}

for blob in blob_list:

blob_name = blob.name[: blob.name.rfind("/") + 1]

folder_structure_arr = blob_name.split("/")

folder_name = ""

if len(folder_structure_arr) > current_level:

folder_name = f"{folder_name}/{folder_structure_arr[current_level]}"

else:

continue

folder_name = folder_name[1 : len(folder_name)]

if folder_name.endswith("/"):

folder_name = folder_name[:-1]

if folder_name == "":

continue

folder_name = f"{prefix}{folder_name}"

if folder_name in this_dict.keys():

continue

else:

this_dict[folder_name] = folder_name

yield f"{folder_name}"

if folder_name in this_dict:

continue

else:

this_dict[folder_name] = folder_name

Tools

Ruff

277-277: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

coderabbitai · 2024-07-09T08:20:47Z

metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py

+def get_abs_tags(
+    container_name: str,
+    key_name: Optional[str],
+    dataset_urn: str,
+    azure_config: Optional[AzureConnectionConfig],
+    ctx: PipelineContext,
+    use_abs_blob_tags: Optional[bool] = False,
+) -> Optional[GlobalTagsClass]:
+    # Todo add the service_client, when building out this get_abs_tags
+    if azure_config is None:
+        raise ValueError(
+            "Azure configuration is not provided. Cannot retrieve container client."
+        )
+
+    tags_to_add: List[str] = []
+    blob_service_client = azure_config.get_blob_service_client()
+    container_client = blob_service_client.get_container_client(container_name)
+    blob_client = container_client.get_blob_client(blob=key_name)
+
+    if use_abs_blob_tags and key_name is not None:
+        tag_set = blob_client.get_blob_tags()
+        if tag_set:
+            tags_to_add.extend(
+                make_tag_urn(f"""{key}:{value}""") for key, value in tag_set.items()
+            )
+        else:
+            # Unlike container tags, if an object does not have tags, it will just return an empty array
+            # as opposed to an exception.
+            logger.info(f"No tags found for container={container_name} key={key_name}")
+
+    if len(tags_to_add) == 0:
+        return None
+
+    if ctx.graph is not None:
+        logger.debug("Connected to DatahubApi, grabbing current tags to maintain.")
+        current_tags: Optional[GlobalTagsClass] = ctx.graph.get_aspect(
+            entity_urn=dataset_urn,
+            aspect_type=GlobalTagsClass,
+        )
+        if current_tags:
+            tags_to_add.extend([current_tag.tag for current_tag in current_tags.tags])
+    else:
+        logger.warning("Could not connect to DatahubApi. No current tags to maintain")
+
+    # Sort existing tags
+    tags_to_add = sorted(list(set(tags_to_add)))
+    # Remove duplicate tags
+    new_tags = GlobalTagsClass(
+        tags=[TagAssociationClass(tag_to_add) for tag_to_add in tags_to_add]
+    )
+    return new_tags


Add error handling for get_blob_tags.

The function can be improved by adding error handling for the get_blob_tags call to handle potential errors gracefully.

- tag_set = blob_client.get_blob_tags() + try: + tag_set = blob_client.get_blob_tags() + except Exception as e: + logger.error(f"Failed to get blob tags: {e}") + tag_set = None

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def get_abs_tags(

container_name: str,

key_name: Optional[str],

dataset_urn: str,

azure_config: Optional[AzureConnectionConfig],

ctx: PipelineContext,

use_abs_blob_tags: Optional[bool] = False,

) -> Optional[GlobalTagsClass]:

# Todo add the service_client, when building out this get_abs_tags

if azure_config is None:

raise ValueError(

"Azure configuration is not provided. Cannot retrieve container client."

)

tags_to_add: List[str] = []

blob_service_client = azure_config.get_blob_service_client()

container_client = blob_service_client.get_container_client(container_name)

blob_client = container_client.get_blob_client(blob=key_name)

if use_abs_blob_tags and key_name is not None:

tag_set = blob_client.get_blob_tags()

if tag_set:

tags_to_add.extend(

make_tag_urn(f"""{key}:{value}""") for key, value in tag_set.items()

)

else:

# Unlike container tags, if an object does not have tags, it will just return an empty array

# as opposed to an exception.

logger.info(f"No tags found for container={container_name} key={key_name}")

if len(tags_to_add) == 0:

return None

if ctx.graph is not None:

logger.debug("Connected to DatahubApi, grabbing current tags to maintain.")

current_tags: Optional[GlobalTagsClass] = ctx.graph.get_aspect(

entity_urn=dataset_urn,

aspect_type=GlobalTagsClass,

)

if current_tags:

tags_to_add.extend([current_tag.tag for current_tag in current_tags.tags])

else:

logger.warning("Could not connect to DatahubApi. No current tags to maintain")

# Sort existing tags

tags_to_add = sorted(list(set(tags_to_add)))

# Remove duplicate tags

new_tags = GlobalTagsClass(

tags=[TagAssociationClass(tag_to_add) for tag_to_add in tags_to_add]

)

return new_tags

tag_set = None

try:

tag_set = blob_client.get_blob_tags()

except Exception as e:

logger.error(f"Failed to get blob tags: {e}")

coderabbitai · 2024-07-09T08:20:47Z

metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py

+def add_property(
+    key: str, value: str, custom_properties: Dict[str, str], resource_name: str
+) -> Dict[str, str]:
+    if key in custom_properties.keys():
+        key = f"{key}_{resource_name}"
+    if value is not None:
+        custom_properties[key] = str(value)
+    return custom_properties


Remove unnecessary .keys() method.

Use key in dict instead of key in dict.keys() for better performance.

- if key in custom_properties.keys(): + if key in custom_properties:

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def add_property(

key: str, value: str, custom_properties: Dict[str, str], resource_name: str

) -> Dict[str, str]:

if key in custom_properties.keys():

key = f"{key}_{resource_name}"

if value is not None:

custom_properties[key] = str(value)

return custom_properties

def add_property(

key: str, value: str, custom_properties: Dict[str, str], resource_name: str

) -> Dict[str, str]:

if key in custom_properties:

key = f"{key}_{resource_name}"

if value is not None:

custom_properties[key] = str(value)

return custom_properties

Tools

Ruff

147-147: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

coderabbitai · 2024-07-09T08:20:47Z

metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py

+def get_abs_properties(
+    container_name: str,
+    blob_name: Optional[str],
+    full_path: str,
+    number_of_files: int,
+    size_in_bytes: int,
+    sample_files: bool,
+    azure_config: Optional[AzureConnectionConfig],
+    use_abs_container_properties: Optional[bool] = False,
+    use_abs_blob_properties: Optional[bool] = False,
+) -> Dict[str, str]:
+    if azure_config is None:
+        raise ValueError(
+            "Azure configuration is not provided. Cannot retrieve container client."
+        )
+
+    blob_service_client = azure_config.get_blob_service_client()
+    container_client = blob_service_client.get_container_client(
+        container=container_name
+    )
+
+    custom_properties = {"schema_inferred_from": full_path}
+    if not sample_files:
+        custom_properties.update(
+            {
+                "number_of_files": str(number_of_files),
+                "size_in_bytes": str(size_in_bytes),
+            }
+        )
+
+    if use_abs_blob_properties and blob_name is not None:
+        blob_client = container_client.get_blob_client(blob=blob_name)
+        blob_properties = blob_client.get_blob_properties()
+        if blob_properties:
+            create_properties(
+                data=blob_properties,
+                prefix="blob",
+                custom_properties=custom_properties,
+                resource_name=blob_name,
+                json_properties=[
+                    "metadata",
+                    "content_settings",
+                    "lease",
+                    "copy",
+                    "immutability_policy",
+                ],
+            )
+        else:
+            logger.warning(
+                f"No blob properties found for container={container_name}, blob={blob_name}."
+            )
+
+    if use_abs_container_properties:
+        container_properties = container_client.get_container_properties()
+        if container_properties:
+            create_properties(
+                data=container_properties,
+                prefix="container",
+                custom_properties=custom_properties,
+                resource_name=container_name,
+                json_properties=["metadata"],
+            )
+        else:
+            logger.warning(
+                f"No container properties found for container={container_name}."
+            )
+
+    return custom_properties


Add error handling for get_blob_properties and get_container_properties.

The function can be improved by adding error handling for the get_blob_properties and get_container_properties calls to handle potential errors gracefully.

- blob_properties = blob_client.get_blob_properties() + try: + blob_properties = blob_client.get_blob_properties() + except Exception as e: + logger.error(f"Failed to get blob properties: {e}") + blob_properties = None - container_properties = container_client.get_container_properties() + try: + container_properties = container_client.get_container_properties() + except Exception as e: + logger.error(f"Failed to get container properties: {e}") + container_properties = None

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def get_abs_properties(

container_name: str,

blob_name: Optional[str],

full_path: str,

number_of_files: int,

size_in_bytes: int,

sample_files: bool,

azure_config: Optional[AzureConnectionConfig],

use_abs_container_properties: Optional[bool] = False,

use_abs_blob_properties: Optional[bool] = False,

) -> Dict[str, str]:

if azure_config is None:

raise ValueError(

"Azure configuration is not provided. Cannot retrieve container client."

)

blob_service_client = azure_config.get_blob_service_client()

container_client = blob_service_client.get_container_client(

container=container_name

)

custom_properties = {"schema_inferred_from": full_path}

if not sample_files:

custom_properties.update(

{

"number_of_files": str(number_of_files),

"size_in_bytes": str(size_in_bytes),

}

)

if use_abs_blob_properties and blob_name is not None:

blob_client = container_client.get_blob_client(blob=blob_name)

blob_properties = blob_client.get_blob_properties()

if blob_properties:

create_properties(

data=blob_properties,

prefix="blob",

custom_properties=custom_properties,

resource_name=blob_name,

json_properties=[

"metadata",

"content_settings",

"lease",

"copy",

"immutability_policy",

],

)

else:

logger.warning(

f"No blob properties found for container={container_name}, blob={blob_name}."

)

if use_abs_container_properties:

container_properties = container_client.get_container_properties()

if container_properties:

create_properties(

data=container_properties,

prefix="container",

custom_properties=custom_properties,

resource_name=container_name,

json_properties=["metadata"],

)

else:

logger.warning(

f"No container properties found for container={container_name}."

)

return custom_properties

def get_abs_properties(

container_name: str,

blob_name: Optional[str],

full_path: str,

number_of_files: int,

size_in_bytes: int,

sample_files: bool,

azure_config: Optional[AzureConnectionConfig],

use_abs_container_properties: Optional[bool] = False,

use_abs_blob_properties: Optional[bool] = False,

) -> Dict[str, str]:

if azure_config is None:

raise ValueError(

"Azure configuration is not provided. Cannot retrieve container client."

)

blob_service_client = azure_config.get_blob_service_client()

container_client = blob_service_client.get_container_client(

container=container_name

)

custom_properties = {"schema_inferred_from": full_path}

if not sample_files:

custom_properties.update(

{

"number_of_files": str(number_of_files),

"size_in_bytes": str(size_in_bytes),

}

)

if use_abs_blob_properties and blob_name is not None:

blob_client = container_client.get_blob_client(blob=blob_name)

try:

blob_properties = blob_client.get_blob_properties()

except Exception as e:

logger.error(f"Failed to get blob properties: {e}")

blob_properties = None

if blob_properties:

create_properties(

data=blob_properties,

prefix="blob",

custom_properties=custom_properties,

resource_name=blob_name,

json_properties=[

"metadata",

"content_settings",

"lease",

"copy",

"immutability_policy",

],

)

else:

logger.warning(

f"No blob properties found for container={container_name}, blob={blob_name}."

)

if use_abs_container_properties:

try:

container_properties = container_client.get_container_properties()

except Exception as e:

logger.error(f"Failed to get container properties: {e}")

container_properties = None

if container_properties:

create_properties(

data=container_properties,

prefix="container",

custom_properties=custom_properties,

resource_name=container_name,

json_properties=["metadata"],

)

else:

logger.warning(

f"No container properties found for container={container_name}."

)

return custom_properties

coderabbitai

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 3f0d854 and d315ecd.

Files selected for processing (1)

metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py (1 hunks)

Files skipped from review as they are similar to previous changes (1)

metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py

coderabbitai

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between d315ecd and b16d626.

Files selected for processing (1)

metadata-ingestion/src/datahub/ingestion/source/abs/source.py (1 hunks)

Additional context used

Ruff

metadata-ingestion/src/datahub/ingestion/source/abs/source.py

163-164: Use a single if statement instead of nested if statements

(SIM102)

686-686: Loop control variable guid not used within loop body

Rename unused guid to _guid

(B007)

Additional comments not posted (35)

metadata-ingestion/src/datahub/ingestion/source/abs/source.py (35)
128-148: LGTM!

The function correctly maps Spark types to DataHub types.

159-179: Simplify nested if statements.

Use a single if statement instead of nested if statements.
-        if "=" in folder1 and "=" in folder2:
-            if folder1.rsplit("=", 1)[0] == folder2.rsplit("=", 1)[0]:
+        if "=" in folder1 and "=" in folder2 and folder1.rsplit("=", 1)[0] == folder2.rsplit("=", 1)[0]:
Tools

Ruff

163-164: Use a single if statement instead of nested if statements

(SIM102)

182-193: LGTM!

The TableData dataclass correctly represents table metadata.

195-700: LGTM!

The ABSSource class is well-structured and follows the expected pattern for DataHub sources.

Tools

Ruff

686-686: Loop control variable guid not used within loop body

Rename unused guid to _guid

(B007)

642-688: LGTM!

The function correctly generates work units for the ingestion process.

Tools

Ruff

686-686: Loop control variable guid not used within loop body

Rename unused guid to _guid

(B007)

689-695: LGTM!

The function correctly returns a list of work unit processors.

697-698: LGTM!

The function correctly checks if the platform is Azure Blob Storage.

700-701: LGTM!

The function correctly returns the report object.

236-300: LGTM!

The function correctly handles different file formats and compression types.

302-325: LGTM!

The function correctly adds partition columns to the schema.

326-335: LGTM!

The function correctly creates an operation aspect for the table.

337-438: LGTM!

The function is well-structured and handles various aspects of table ingestion.

523-607: LGTM!

The function is well-structured and handles various scenarios for browsing Azure Blob Storage.

609-616: LGTM!

The function correctly creates an ABS path from a given key.

618-641: LGTM!

The function is well-structured and handles various scenarios for browsing local files.

475-491: LGTM!

The function correctly resolves folders with templates in the prefix.

492-521: LGTM!

The function correctly retrieves the directory to process based on the path specification.

447-450: LGTM!

The function correctly extracts the table name from the path specification.

452-473: LGTM!

The function correctly extracts table data from the path specification and file metadata.

440-445: LGTM!

The function correctly retrieves the prefix from the relative path.

337-438: LGTM!

The function is well-structured and handles various aspects of table ingestion.

523-607: LGTM!

The function is well-structured and handles various scenarios for browsing Azure Blob Storage.

609-616: LGTM!

The function correctly creates an ABS path from a given key.

618-641: LGTM!

The function is well-structured and handles various scenarios for browsing local files.

475-491: LGTM!

The function correctly resolves folders with templates in the prefix.

492-521: LGTM!

The function correctly retrieves the directory to process based on the path specification.

447-450: LGTM!

The function correctly extracts the table name from the path specification.

452-473: LGTM!

The function correctly extracts table data from the path specification and file metadata.

440-445: LGTM!

The function correctly retrieves the prefix from the relative path.

642-688: LGTM!

The function correctly generates work units for the ingestion process.

Tools

Ruff

686-686: Loop control variable guid not used within loop body

Rename unused guid to _guid

(B007)

689-695: LGTM!

The function correctly returns a list of work unit processors.

697-698: LGTM!

The function correctly checks if the platform is Azure Blob Storage.

700-701: LGTM!

The function correctly returns the report object.

236-300: LGTM!

The function correctly handles different file formats and compression types.

302-325: LGTM!

The function correctly adds partition columns to the schema.

coderabbitai

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between b16d626 and d45710c.

Files selected for processing (1)

metadata-ingestion/src/datahub/ingestion/source/abs/source.py (1 hunks)

Additional context used

Ruff

metadata-ingestion/src/datahub/ingestion/source/abs/source.py

162-163: Use a single if statement instead of nested if statements

(SIM102)

685-685: Loop control variable guid not used within loop body

Rename unused guid to _guid

(B007)

Additional comments not posted (24)

metadata-ingestion/src/datahub/ingestion/source/abs/source.py (24)
1-95: Imports look good.

The imports are necessary for the new Azure Blob Storage functionality.

127-147: Function looks good.

The get_column_type function correctly maps Spark types to DataHub types and includes error handling.

158-178: Simplify nested if statements.

Use a single if statement instead of nested if statements.
-        if "=" in folder1 and "=" in folder2:
-            if folder1.rsplit("=", 1)[0] == folder2.rsplit("=", 1)[0]:
+        if "=" in folder1 and "=" in folder2 and folder1.rsplit("=", 1)[0] == folder2.rsplit("=", 1)[0]:
Tools

Ruff

162-163: Use a single if statement instead of nested if statements

(SIM102)

181-191: Dataclass looks good.

The TableData dataclass correctly represents table data and includes necessary fields.

210-227: Constructor looks good.

The __init__ method correctly initializes the ABSSource class and includes telemetry reporting.

229-233: Factory method looks good.

The create method correctly creates an instance of the ABSSource class.

235-299: Method looks good.

The get_fields method correctly extracts fields from a table and includes error handling.

301-324: Method looks good.

The add_partition_columns_to_schema method correctly adds partition columns to the schema and includes necessary logging.

325-334: Method looks good.

The _create_table_operation_aspect method correctly creates a table operation aspect and follows the expected structure.

336-437: Method looks good.

The ingest_table method correctly ingests a table and yields metadata work units. It includes error handling and supports various aspects.

439-444: Method looks good.

The get_prefix method correctly gets the prefix from a relative path and includes necessary regex handling.

446-449: Method looks good.

The extract_table_name method correctly extracts the table name from a path spec and includes necessary validation.

451-472: Method looks good.

The extract_table_data method correctly extracts table data from a path spec and includes necessary logging.

474-490: Method looks good.

The resolve_templated_folders method correctly resolves templated folders and includes necessary recursion.

491-520: Method looks good.

The get_dir_to_process method correctly gets the directory to process and includes necessary sorting and recursion.

522-606: Method looks good.

The abs_browser method correctly browses Azure Blob Storage and includes error handling. It supports templated folders.

608-615: Method looks good.

The create_abs_path method correctly creates an Azure Blob Storage path and includes necessary validation.

617-640: Method looks good.

The local_browser method correctly browses local files and includes necessary sorting and recursion.

641-687: Method looks good.

The get_workunits_internal method correctly generates metadata work units and includes error handling. It supports various path specs.

Tools

Ruff

685-685: Loop control variable guid not used within loop body

Rename unused guid to _guid

(B007)

688-694: Method looks good.

The get_workunit_processors method correctly returns a list of work unit processors and includes the stale entity removal handler.

696-697: Method looks good.

The is_abs_platform method correctly checks if the platform is Azure Blob Storage and includes necessary validation.

699-700: Method looks good.

The get_report method correctly returns the report and includes necessary validation.

162-163: Simplify nested if statements.

Use a single if statement instead of nested if statements.
-        if "=" in folder1 and "=" in folder2:
-            if folder1.rsplit("=", 1)[0] == folder2.rsplit("=", 1)[0]:
+        if "=" in folder1 and "=" in folder2 and folder1.rsplit("=", 1)[0] == folder2.rsplit("=", 1)[0]:
Tools

Ruff

162-163: Use a single if statement instead of nested if statements

(SIM102)

685-685: Rename unused loop variable.

Rename the unused loop variable guid to _guid.
-                for guid, table_data in table_dict.items():
+                for _guid, table_data in table_dict.items():
Tools

Ruff

685-685: Loop control variable guid not used within loop body

Rename unused guid to _guid

(B007)

coderabbitai

Actionable comments posted: 2

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between d45710c and b4bf0a4.

Files selected for processing (2)

metadata-ingestion/src/datahub/ingestion/source/abs/source.py (1 hunks)
metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py (1 hunks)

Files skipped from review as they are similar to previous changes (1)

metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py

Additional context used

Ruff

metadata-ingestion/src/datahub/ingestion/source/abs/source.py

162-163: Use a single if statement instead of nested if statements

(SIM102)

685-685: Loop control variable guid not used within loop body

Rename unused guid to _guid

(B007)

Additional comments not posted (6)

metadata-ingestion/src/datahub/ingestion/source/abs/source.py (6)

235-299: LGTM!

The function is well-structured and handles different file formats appropriately.

301-324: LGTM!

The function is well-structured and performs the intended operations correctly.

325-334: LGTM!

The function is well-structured and performs the intended operations correctly.

336-437: LGTM!

The function is well-structured and performs the intended operations correctly.

522-607: LGTM!

The function is well-structured and performs the intended operations correctly.

617-640: LGTM!

The function is well-structured and performs the intended operations correctly.

coderabbitai · 2024-07-15T12:45:02Z

metadata-ingestion/src/datahub/ingestion/source/abs/source.py

+        if "=" in folder1 and "=" in folder2:
+            if folder1.rsplit("=", 1)[0] == folder2.rsplit("=", 1)[0]:


Simplify nested if statements.

Use a single if statement instead of nested if statements.

- if "=" in folder1 and "=" in folder2: - if folder1.rsplit("=", 1)[0] == folder2.rsplit("=", 1)[0]: + if "=" in folder1 and "=" in folder2 and folder1.rsplit("=", 1)[0] == folder2.rsplit("=", 1)[0]:

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if "=" in folder1 and "=" in folder2:

if folder1.rsplit("=", 1)[0] == folder2.rsplit("=", 1)[0]:

if "=" in folder1 and "=" in folder2 and folder1.rsplit("=", 1)[0] == folder2.rsplit("=", 1)[0]:

Tools

Ruff

162-163: Use a single if statement instead of nested if statements

(SIM102)

coderabbitai

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between b4bf0a4 and b2174cc.

Files selected for processing (1)

metadata-ingestion/setup.py (3 hunks)

Additional comments not posted (3)

metadata-ingestion/setup.py (3)

261-266: Addition of abs_base set looks good!

The dependencies listed are necessary for Azure Blob Storage functionality. Ensure that the specified versions are compatible with the rest of the project.

275-275: Modification to delta_lake set looks good!

Including abs_base ensures that Azure Blob Storage dependencies are available for Delta Lake functionality.

417-417: Addition of abs source plugin looks good!

The plugin is correctly referenced in the plugins dictionary and entry_points. Ensure that the plugin is properly integrated and tested.

Also applies to: 696-696

coderabbitai

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between b2174cc and 8f312da.

Files selected for processing (2)

metadata-ingestion/src/datahub/ingestion/source/abs/source.py (1 hunks)
metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py (1 hunks)

Files skipped from review as they are similar to previous changes (1)

metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py

Additional context used

Ruff

metadata-ingestion/src/datahub/ingestion/source/abs/source.py

162-163: Use a single if statement instead of nested if statements

(SIM102)

685-685: Loop control variable guid not used within loop body

Rename unused guid to _guid

(B007)

Additional comments not posted (4)

metadata-ingestion/src/datahub/ingestion/source/abs/source.py (4)
158-178: Simplify nested if statements.

The nested if statements can be simplified to improve readability. This change was previously suggested and is still valid.
-        if "=" in folder1 and "=" in folder2:
-            if folder1.rsplit("=", 1)[0] == folder2.rsplit("=", 1)[0]:
+        if "=" in folder1 and "=" in folder2 and folder1.rsplit("=", 1)[0] == folder2.rsplit("=", 1)[0]:
Tools

Ruff

162-163: Use a single if statement instead of nested if statements

(SIM102)

685-685: Rename unused loop variable.

The loop variable guid is not used within the loop body. It should be renamed to _guid to indicate that it is unused.
-                for guid, table_data in table_dict.items():
+                for _guid, table_data in table_dict.items():
Tools

Ruff

685-685: Loop control variable guid not used within loop body

Rename unused guid to _guid

(B007)

235-299: Review file format handling in schema extraction.

This method handles various file formats for schema extraction. Ensure that all expected file formats are supported and that the error handling is robust.

522-607: Review browsing logic and error handling in Azure Blob Storage.

This method handles browsing in Azure Blob Storage. Ensure that the path handling and error management are correctly implemented, especially the handling of specific exceptions like NoSuchBucket.

coderabbitai

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 8f312da and 6963078.

Files selected for processing (1)

metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py (1 hunks)

Files skipped from review as they are similar to previous changes (1)

metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py

coderabbitai

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 6963078 and 9b2d0cb.

Files selected for processing (1)

metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py (1 hunks)

Files skipped from review as they are similar to previous changes (1)

metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py

coderabbitai

Actionable comments posted: 2

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 9b2d0cb and 7e61de6.

Files selected for processing (1)

metadata-ingestion/src/datahub/ingestion/source/abs/source.py (1 hunks)

Additional context used

Ruff

metadata-ingestion/src/datahub/ingestion/source/abs/source.py

162-163: Use a single if statement instead of nested if statements

(SIM102)

685-685: Loop control variable guid not used within loop body

Rename unused guid to _guid

(B007)

Additional comments not posted (2)

metadata-ingestion/src/datahub/ingestion/source/abs/source.py (2)

127-148: Function get_column_type is well-implemented.

The function correctly maps Spark types to DataHub types and handles unmapped types gracefully.

158-178: Skip: Simplification of nested if statements already addressed.

The previous comment about simplifying nested if statements has been addressed.

Tools

Ruff

162-163: Use a single if statement instead of nested if statements

(SIM102)

coderabbitai · 2024-07-16T13:31:36Z

metadata-ingestion/src/datahub/ingestion/source/abs/source.py

+                prefix=f"{prefix}", results_per_page=PAGE_SIZE
+            ):
+                abs_path = self.create_abs_path(obj.name)
+                logger.debug(f"Path: {abs_path}")
+                # the following line if using the file_system_client
+                # yield abs_path, obj.last_modified, obj.content_length,
+                yield abs_path, obj.name, obj.last_modified, obj.size
+
+    def create_abs_path(self, key: str) -> str:
+        if self.source_config.azure_config:
+            account_name = self.source_config.azure_config.account_name
+            container_name = self.source_config.azure_config.container_name
+            return (
+                f"https://{account_name}.blob.core.windows.net/{container_name}/{key}"
+            )
+        return ""
+
+    def local_browser(
+        self, path_spec: PathSpec
+    ) -> Iterable[Tuple[str, str, datetime, int]]:
+        prefix = self.get_prefix(path_spec.include)
+        if os.path.isfile(prefix):
+            logger.debug(f"Scanning single local file: {prefix}")
+            file_name = prefix
+            yield prefix, file_name, datetime.utcfromtimestamp(
+                os.path.getmtime(prefix)
+            ), os.path.getsize(prefix)
+        else:
+            logger.debug(f"Scanning files under local folder: {prefix}")
+            for root, dirs, files in os.walk(prefix):
+                dirs.sort(key=functools.cmp_to_key(partitioned_folder_comparator))
+
+                for file in sorted(files):
+                    # We need to make sure the path is in posix style which is not true on windows
+                    full_path = PurePath(
+                        os.path.normpath(os.path.join(root, file))
+                    ).as_posix()
+                    yield full_path, file, datetime.utcfromtimestamp(
+                        os.path.getmtime(full_path)
+                    ), os.path.getsize(full_path)
+
+    def get_workunits_internal(self) -> Iterable[MetadataWorkUnit]:
+        self.container_WU_creator = ContainerWUCreator(
+            self.source_config.platform,
+            self.source_config.platform_instance,
+            self.source_config.env,
+        )
+        with PerfTimer():
+            assert self.source_config.path_specs
+            for path_spec in self.source_config.path_specs:
+                file_browser = (
+                    self.abs_browser(
+                        path_spec, self.source_config.number_of_files_to_sample
+                    )
+                    if self.is_abs_platform()
+                    else self.local_browser(path_spec)
+                )
+                table_dict: Dict[str, TableData] = {}
+                for file, name, timestamp, size in file_browser:
+                    if not path_spec.allowed(file):
+                        continue
+                    table_data = self.extract_table_data(
+                        path_spec, file, name, timestamp, size
+                    )
+                    if table_data.table_path not in table_dict:
+                        table_dict[table_data.table_path] = table_data
+                    else:
+                        table_dict[table_data.table_path].number_of_files = (
+                            table_dict[table_data.table_path].number_of_files + 1
+                        )
+                        table_dict[table_data.table_path].size_in_bytes = (
+                            table_dict[table_data.table_path].size_in_bytes
+                            + table_data.size_in_bytes
+                        )
+                        if (
+                            table_dict[table_data.table_path].timestamp
+                            < table_data.timestamp
+                        ) and (table_data.size_in_bytes > 0):
+                            table_dict[
+                                table_data.table_path
+                            ].full_path = table_data.full_path
+                            table_dict[
+                                table_data.table_path
+                            ].timestamp = table_data.timestamp
+
+                for guid, table_data in table_dict.items():
+                    yield from self.ingest_table(table_data, path_spec)
+
+    def get_workunit_processors(self) -> List[Optional[MetadataWorkUnitProcessor]]:
+        return [
+            *super().get_workunit_processors(),
+            StaleEntityRemovalHandler.create(
+                self, self.source_config, self.ctx
+            ).workunit_processor,
+        ]
+
+    def is_abs_platform(self):
+        return self.source_config.platform == "abs"
+
+    def get_report(self):


Consider refactoring the get_fields method for better readability.

The get_fields method is functional but complex. Breaking it into smaller, more focused methods could improve readability and maintainability.

Tools

Ruff

685-685: Loop control variable guid not used within loop body

Rename unused guid to _guid

(B007)

coderabbitai · 2024-07-16T13:31:36Z

metadata-ingestion/src/datahub/ingestion/source/abs/source.py

+                                table_data.table_path
+                            ].timestamp = table_data.timestamp
+
+                for guid, table_data in table_dict.items():


Rename unused loop variable for clarity.

The loop variable guid is unused and should be renamed to _guid to indicate it's intentionally unused.

- for guid, table_data in table_table_dict.items(): + for _guid, table_data in table_dict.items():

Committable suggestion was skipped due to low confidence.

Tools

Ruff

685-685: Loop control variable guid not used within loop body

Rename unused guid to _guid

(B007)

coderabbitai

Actionable comments posted: 5

Outside diff range, codebase verification and nitpick comments (2)

metadata-ingestion/src/datahub/ingestion/source/abs/source.py (2)

123-124: Optimization of Compression Registration

The hack to support the .gzip extension with smart_open is clear, but consider adding a comment explaining why this specific registration is necessary if it's not commonly understood.

336-438: Method: ingest_table

This method orchestrates the ingestion process and handles various metadata aspects. Ensure that all operations are atomic and transactional where necessary, especially when dealing with external systems like databases or cloud services.

Consider adding more detailed logging to trace the ingestion process step-by-step.

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 7e61de6 and 7a2c3fa.

Files selected for processing (2)

metadata-ingestion/src/datahub/ingestion/source/abs/source.py (1 hunks)
metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py (1 hunks)

Files skipped from review as they are similar to previous changes (1)

metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py

Additional context used

Ruff

metadata-ingestion/src/datahub/ingestion/source/abs/source.py

162-163: Use a single if statement instead of nested if statements

(SIM102)

685-685: Loop control variable guid not used within loop body

Rename unused guid to _guid

(B007)

Additional comments not posted (3)

metadata-ingestion/src/datahub/ingestion/source/abs/source.py (3)

1-33: Review of Imports and Global Constants

The imports and global constants are well-organized and relevant to the functionality being implemented. However, ensure that all these libraries and modules are necessary and used within the file to avoid unnecessary dependencies.

181-192: Data Class: TableData

The TableData class is well-defined with appropriate type hints. Ensure that all fields are necessary and used effectively throughout the codebase to prevent data bloat.

194-203: Class Decorators for ABSSource

The use of decorators to define capabilities and configuration requirements is clear and enhances modularity. Ensure that all capabilities like DATA_PROFILING and TAGS are implemented correctly within the class methods.

coderabbitai · 2024-07-16T22:04:50Z

metadata-ingestion/src/datahub/ingestion/source/abs/source.py

+def get_column_type(
+    report: SourceReport, dataset_name: str, column_type: str
+) -> SchemaFieldDataType:
+    """
+    Maps known Spark types to datahub types
+    """
+    TypeClass: Any = None
+
+    for field_type, type_class in _field_type_mapping.items():
+        if isinstance(column_type, field_type):
+            TypeClass = type_class
+            break
+
+    # if still not found, report the warning
+    if TypeClass is None:
+        report.report_warning(
+            dataset_name, f"unable to map type {column_type} to metadata schema"
+        )
+        TypeClass = NullTypeClass
+
+    return SchemaFieldDataType(type=TypeClass())
+


Function: get_column_type

This function maps Spark types to DataHub types. It's crucial to ensure that all possible Spark types are handled, and any unmapped types are reported correctly. The use of a warning when a type cannot be mapped is a good practice.

Consider adding unit tests to cover all branches of this function, especially the default case where a type cannot be mapped.

Would you like me to help create unit tests for this function?

coderabbitai · 2024-07-16T22:04:50Z

metadata-ingestion/src/datahub/ingestion/source/abs/source.py

+    def __init__(self, config: DataLakeSourceConfig, ctx: PipelineContext):
+        super().__init__(config, ctx)
+        self.source_config = config
+        self.report = DataLakeSourceReport()
+        self.profiling_times_taken = []
+        config_report = {
+            config_option: config.dict().get(config_option)
+            for config_option in config_options_to_report
+        }
+        config_report = {
+            **config_report,
+            "profiling_enabled": config.is_profiling_enabled(),
+        }
+
+        telemetry.telemetry_instance.ping(
+            "data_lake_config",
+            config_report,
+        )
+


Constructor of ABSSource

The constructor initializes several configurations and reports telemetry. It's important to ensure that telemetry data is handled securely and does not contain sensitive information.

Consider reviewing what data is being sent in telemetry to ensure compliance with privacy laws.

coderabbitai · 2024-07-16T22:04:50Z

metadata-ingestion/src/datahub/ingestion/source/abs/source.py

+    def abs_browser(
+        self, path_spec: PathSpec, sample_size: int
+    ) -> Iterable[Tuple[str, str, datetime, int]]:
+        if self.source_config.azure_config is None:
+            raise ValueError("azure_config not set. Cannot browse Azure Blob Storage")
+        abs_blob_service_client = (
+            self.source_config.azure_config.get_blob_service_client()
+        )
+        container_client = abs_blob_service_client.get_container_client(
+            self.source_config.azure_config.container_name
+        )
+
+        container_name = self.source_config.azure_config.container_name
+        logger.debug(f"Scanning container: {container_name}")
+
+        prefix = self.get_prefix(get_container_relative_path(path_spec.include))
+        logger.debug(f"Scanning objects with prefix:{prefix}")
+
+        matches = re.finditer(r"{\s*\w+\s*}", path_spec.include, re.MULTILINE)
+        matches_list = list(matches)
+        if matches_list and path_spec.sample_files:
+            max_start: int = -1
+            include: str = path_spec.include
+            max_match: str = ""
+            for match in matches_list:
+                pos = include.find(match.group())
+                if pos > max_start:
+                    if max_match:
+                        include = include.replace(max_match, "*")
+                    max_start = match.start()
+                    max_match = match.group()
+
+            table_index = include.find(max_match)
+
+            for folder in self.resolve_templated_folders(
+                container_name,
+                get_container_relative_path(include[:table_index]),
+            ):
+                try:
+                    for f in list_folders(
+                        container_name, f"{folder}", self.source_config.azure_config
+                    ):
+                        logger.info(f"Processing folder: {f}")
+                        protocol = ContainerWUCreator.get_protocol(path_spec.include)
+                        dir_to_process = self.get_dir_to_process(
+                            container_name=container_name,
+                            folder=f + "/",
+                            path_spec=path_spec,
+                            protocol=protocol,
+                        )
+                        logger.info(f"Getting files from folder: {dir_to_process}")
+                        dir_to_process = dir_to_process.rstrip("\\")
+                        for obj in container_client.list_blobs(
+                            name_starts_with=f"{dir_to_process}",
+                            results_per_page=PAGE_SIZE,
+                        ):
+                            abs_path = self.create_abs_path(obj.name)
+                            logger.debug(f"Sampling file: {abs_path}")
+                            yield abs_path, obj.name, obj.last_modified, obj.size,
+                except Exception as e:
+                    # This odd check if being done because boto does not have a proper exception to catch
+                    # The exception that appears in stacktrace cannot actually be caught without a lot more work
+                    # https://github.com/boto/boto3/issues/1195
+                    if "NoSuchBucket" in repr(e):
+                        logger.debug(
+                            f"Got NoSuchBucket exception for {container_name}", e
+                        )
+                        self.get_report().report_warning(
+                            "Missing bucket", f"No bucket found {container_name}"
+                        )
+                    else:
+                        raise e
+        else:
+            logger.debug(
+                "No template in the pathspec can't do sampling, fallbacking to do full scan"
+            )
+            path_spec.sample_files = False
+            for obj in container_client.list_blobs(
+                prefix=f"{prefix}", results_per_page=PAGE_SIZE
+            ):
+                abs_path = self.create_abs_path(obj.name)
+                logger.debug(f"Path: {abs_path}")
+                # the following line if using the file_system_client
+                # yield abs_path, obj.last_modified, obj.content_length,
+                yield abs_path, obj.name, obj.last_modified, obj.size
+


Method: abs_browser

This method handles the browsing of ABS paths and is crucial for the ingestion process. Ensure that error handling is robust, especially for network-related errors or API limitations.

The handling of exceptions related to NoSuchBucket could be improved by using more specific exception types if available.

Consider using more specific exception types for better error handling.

coderabbitai · 2024-07-16T22:04:50Z

metadata-ingestion/src/datahub/ingestion/source/abs/source.py

+def partitioned_folder_comparator(folder1: str, folder2: str) -> int:
+    # Try to convert to number and compare if the folder name is a number
+    try:
+        # Stripping = from the folder names as it most probably partition name part like year=2021
+        if "=" in folder1 and "=" in folder2:
+            if folder1.rsplit("=", 1)[0] == folder2.rsplit("=", 1)[0]:
+                folder1 = folder1.rsplit("=", 1)[-1]
+                folder2 = folder2.rsplit("=", 1)[-1]
+
+        num_folder1 = int(folder1)
+        num_folder2 = int(folder2)
+        if num_folder1 == num_folder2:
+            return 0
+        else:
+            return 1 if num_folder1 > num_folder2 else -1
+    except Exception:
+        # If folder name is not a number then do string comparison
+        if folder1 == folder2:
+            return 0
+        else:
+            return 1 if folder1 > folder2 else -1


Function: partitioned_folder_comparator Review

The nested if statements have been flagged by both a previous reviewer and static analysis tools. The suggestion to simplify these into a single if statement is valid and should be implemented to improve readability.

- if "=" in folder1 and "=" in folder2: - if folder1.rsplit("=", 1)[0] == folder2.rsplit("=", 1)[0]: + if "=" in folder1 and "=" in folder2 and folder1.rsplit("=", 1)[0] == folder2.rsplit("=", 1)[0]:

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def partitioned_folder_comparator(folder1: str, folder2: str) -> int:

# Try to convert to number and compare if the folder name is a number

try:

# Stripping = from the folder names as it most probably partition name part like year=2021

if "=" in folder1 and "=" in folder2:

if folder1.rsplit("=", 1)[0] == folder2.rsplit("=", 1)[0]:

folder1 = folder1.rsplit("=", 1)[-1]

folder2 = folder2.rsplit("=", 1)[-1]

num_folder1 = int(folder1)

num_folder2 = int(folder2)

if num_folder1 == num_folder2:

return 0

else:

return 1 if num_folder1 > num_folder2 else -1

except Exception:

# If folder name is not a number then do string comparison

if folder1 == folder2:

return 0

else:

return 1 if folder1 > folder2 else -1

def partitioned_folder_comparator(folder1: str, folder2: str) -> int:

# Try to convert to number and compare if the folder name is a number

try:

# Stripping = from the folder names as it most probably partition name part like year=2021

if "=" in folder1 and "=" in folder2 and folder1.rsplit("=", 1)[0] == folder2.rsplit("=", 1)[0]:

folder1 = folder1.rsplit("=", 1)[-1]

folder2 = folder2.rsplit("=", 1)[-1]

num_folder1 = int(folder1)

num_folder2 = int(folder2)

if num_folder1 == num_folder2:

return 0

else:

return 1 if num_folder1 > num_folder2 else -1

except Exception:

# If folder name is not a number then do string comparison

if folder1 == folder2:

return 0

else:

return 1 if folder1 > folder2 else -1

Tools

Ruff

162-163: Use a single if statement instead of nested if statements

(SIM102)

coderabbitai · 2024-07-16T22:04:50Z

metadata-ingestion/src/datahub/ingestion/source/abs/source.py

+    def get_fields(self, table_data: TableData, path_spec: PathSpec) -> List:
+        if self.is_abs_platform():
+            if self.source_config.azure_config is None:
+                raise ValueError("Azure config is required for ABS file sources")
+
+            abs_client = self.source_config.azure_config.get_blob_service_client()
+            file = smart_open(
+                f"azure://{self.source_config.azure_config.container_name}/{table_data.rel_path}",
+                "rb",
+                transport_params={"client": abs_client},
+            )
+        else:
+            # We still use smart_open here to take advantage of the compression
+            # capabilities of smart_open.
+            file = smart_open(table_data.full_path, "rb")
+
+        fields = []
+
+        extension = pathlib.Path(table_data.full_path).suffix
+        from datahub.ingestion.source.data_lake_common.path_spec import (
+            SUPPORTED_COMPRESSIONS,
+        )
+
+        if path_spec.enable_compression and (extension[1:] in SUPPORTED_COMPRESSIONS):
+            # Removing the compression extension and using the one before that like .json.gz -> .json
+            extension = pathlib.Path(table_data.full_path).with_suffix("").suffix
+        if extension == "" and path_spec.default_extension:
+            extension = f".{path_spec.default_extension}"
+
+        try:
+            if extension == ".parquet":
+                fields = parquet.ParquetInferrer().infer_schema(file)
+            elif extension == ".csv":
+                fields = csv_tsv.CsvInferrer(
+                    max_rows=self.source_config.max_rows
+                ).infer_schema(file)
+            elif extension == ".tsv":
+                fields = csv_tsv.TsvInferrer(
+                    max_rows=self.source_config.max_rows
+                ).infer_schema(file)
+            elif extension == ".json":
+                fields = json.JsonInferrer().infer_schema(file)
+            elif extension == ".avro":
+                fields = avro.AvroInferrer().infer_schema(file)
+            else:
+                self.report.report_warning(
+                    table_data.full_path,
+                    f"file {table_data.full_path} has unsupported extension",
+                )
+            file.close()
+        except Exception as e:
+            self.report.report_warning(
+                table_data.full_path,
+                f"could not infer schema for file {table_data.full_path}: {e}",
+            )
+            file.close()
+        logger.debug(f"Extracted fields in schema: {fields}")
+        fields = sorted(fields, key=lambda f: f.fieldPath)
+
+        if self.source_config.add_partition_columns_to_schema:
+            self.add_partition_columns_to_schema(
+                fields=fields, path_spec=path_spec, full_path=table_data.full_path
+            )
+


Method: get_fields

This method is complex and handles various file types. Refactoring to break this method into smaller, more focused methods could improve readability and maintainability. Additionally, ensure that error handling is robust, especially around file operations.

Consider the following refactor suggestion:

- try: - if extension == ".parquet": - fields = parquet.ParquetInferrer().infer_schema(file) - elif extension == ".csv": - fields = csv_tsv.CsvInferrer( - max_rows=self.source_config.max_rows - ).infer_schema(file) - elif extension == ".tsv": - fields = csv_tsv.TsvInferrer( - max_rows=self.source_config.max_rows - ).infer_schema(file) - elif extension == ".json": - fields = json.JsonInferrer().infer_schema(file) - elif extension == ".avro": - fields = avro.AvroInferrer().infer_schema(file) - else: - self.report.report_warning( - table_data.full_path, - f"file {table_data.full_path} has unsupported extension", - ) - file.close() - except Exception as e: - self.report.report_warning( - table_data.full_path, - f"could not infer schema for file {table_data.full_path}: {e}", - ) - file.close() + fields = self.handle_file_inference(file, extension, table_data)

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def get_fields(self, table_data: TableData, path_spec: PathSpec) -> List:

if self.is_abs_platform():

if self.source_config.azure_config is None:

raise ValueError("Azure config is required for ABS file sources")

abs_client = self.source_config.azure_config.get_blob_service_client()

file = smart_open(

f"azure://{self.source_config.azure_config.container_name}/{table_data.rel_path}",

"rb",

transport_params={"client": abs_client},

)

else:

# We still use smart_open here to take advantage of the compression

# capabilities of smart_open.

file = smart_open(table_data.full_path, "rb")

fields = []

extension = pathlib.Path(table_data.full_path).suffix

from datahub.ingestion.source.data_lake_common.path_spec import (

SUPPORTED_COMPRESSIONS,

)

if path_spec.enable_compression and (extension[1:] in SUPPORTED_COMPRESSIONS):

# Removing the compression extension and using the one before that like .json.gz -> .json

extension = pathlib.Path(table_data.full_path).with_suffix("").suffix

if extension == "" and path_spec.default_extension:

extension = f".{path_spec.default_extension}"

try:

if extension == ".parquet":

fields = parquet.ParquetInferrer().infer_schema(file)

elif extension == ".csv":

fields = csv_tsv.CsvInferrer(

max_rows=self.source_config.max_rows

).infer_schema(file)

elif extension == ".tsv":

fields = csv_tsv.TsvInferrer(

max_rows=self.source_config.max_rows

).infer_schema(file)

elif extension == ".json":

fields = json.JsonInferrer().infer_schema(file)

elif extension == ".avro":

fields = avro.AvroInferrer().infer_schema(file)

else:

self.report.report_warning(

table_data.full_path,

f"file {table_data.full_path} has unsupported extension",

)

file.close()

except Exception as e:

self.report.report_warning(

table_data.full_path,

f"could not infer schema for file {table_data.full_path}: {e}",

)

file.close()

logger.debug(f"Extracted fields in schema: {fields}")

fields = sorted(fields, key=lambda f: f.fieldPath)

if self.source_config.add_partition_columns_to_schema:

self.add_partition_columns_to_schema(

fields=fields, path_spec=path_spec, full_path=table_data.full_path

)

def get_fields(self, table_data: TableData, path_spec: PathSpec) -> List:

if self.is_abs_platform():

if self.source_config.azure_config is None:

raise ValueError("Azure config is required for ABS file sources")

abs_client = self.source_config.azure_config.get_blob_service_client()

file = smart_open(

f"azure://{self.source_config.azure_config.container_name}/{table_data.rel_path}",

"rb",

transport_params={"client": abs_client},

)

else:

# We still use smart_open here to take advantage of the compression

# capabilities of smart_open.

file = smart_open(table_data.full_path, "rb")

fields = []

extension = pathlib.Path(table_data.full_path).suffix

from datahub.ingestion.source.data_lake_common.path_spec import (

SUPPORTED_COMPRESSIONS,

)

if path_spec.enable_compression and (extension[1:] in SUPPORTED_COMPRESSIONS):

# Removing the compression extension and using the one before that like .json.gz -> .json

extension = pathlib.Path(table_data.full_path).with_suffix("").suffix

if extension == "" and path_spec.default_extension:

extension = f".{path_spec.default_extension}"

fields = self.handle_file_inference(file, extension, table_data)

logger.debug(f"Extracted fields in schema: {fields}")

fields = sorted(fields, key=lambda f: f.fieldPath)

if self.source_config.add_partition_columns_to_schema:

self.add_partition_columns_to_schema(

fields=fields, path_spec=path_spec, full_path=table_data.full_path

)

coderabbitai

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 7a2c3fa and 240d116.

Files selected for processing (1)

metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py (1 hunks)

Files skipped from review as they are similar to previous changes (1)

metadata-ingestion/src/datahub/ingestion/source/azure/abs_util.py

treff7es

lgtm, thanks for the contribution

…-project#10813)

* feat(forms) Handle deleting forms references when hard deleting forms (datahub-project#10820) * refactor(ui): Misc improvements to the setup ingestion flow (ingest uplift 1/2) (datahub-project#10764) Co-authored-by: John Joyce <[email protected]> Co-authored-by: John Joyce <[email protected]> * fix(ingestion/airflow-plugin): pipeline tasks discoverable in search (datahub-project#10819) * feat(ingest/transformer): tags to terms transformer (datahub-project#10758) Co-authored-by: Aseem Bansal <[email protected]> * fix(ingestion/unity-catalog): fixed issue with profiling with GE turned on (datahub-project#10752) Co-authored-by: Aseem Bansal <[email protected]> * feat(forms) Add java SDK for form entity PATCH + CRUD examples (datahub-project#10822) * feat(SDK) Add java SDK for structuredProperty entity PATCH + CRUD examples (datahub-project#10823) * feat(SDK) Add StructuredPropertyPatchBuilder in python sdk and provide sample CRUD files (datahub-project#10824) * feat(forms) Add CRUD endpoints to GraphQL for Form entities (datahub-project#10825) * add flag for includeSoftDeleted in scroll entities API (datahub-project#10831) * feat(deprecation) Return actor entity with deprecation aspect (datahub-project#10832) * feat(structuredProperties) Add CRUD graphql APIs for structured property entities (datahub-project#10826) * add scroll parameters to openapi v3 spec (datahub-project#10833) * fix(ingest): correct profile_day_of_week implementation (datahub-project#10818) * feat(ingest/glue): allow ingestion of empty databases from Glue (datahub-project#10666) Co-authored-by: Harshal Sheth <[email protected]> * feat(cli): add more details to get cli (datahub-project#10815) * fix(ingestion/glue): ensure date formatting works on all platforms for aws glue (datahub-project#10836) * fix(ingestion): fix datajob patcher (datahub-project#10827) * fix(smoke-test): add suffix in temp file creation (datahub-project#10841) * feat(ingest/glue): add helper method to permit user or group ownership (datahub-project#10784) * feat(): Show data platform instances in policy modal if they are set on the policy (datahub-project#10645) Co-authored-by: Hendrik Richert <[email protected]> * docs(patch): add patch documentation for how implementation works (datahub-project#10010) Co-authored-by: John Joyce <[email protected]> * fix(jar): add missing custom-plugin-jar task (datahub-project#10847) * fix(): also check exceptions/stack trace when filtering log messages (datahub-project#10391) Co-authored-by: John Joyce <[email protected]> * docs(): Update posts.md (datahub-project#9893) Co-authored-by: Hyejin Yoon <[email protected]> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * chore(ingest): update acryl-datahub-classify version (datahub-project#10844) * refactor(ingest): Refactor structured logging to support infos, warnings, and failures structured reporting to UI (datahub-project#10828) Co-authored-by: John Joyce <[email protected]> Co-authored-by: Harshal Sheth <[email protected]> * fix(restli): log aspect-not-found as a warning rather than as an error (datahub-project#10834) * fix(ingest/nifi): remove duplicate upstream jobs (datahub-project#10849) * fix(smoke-test): test access to create/revoke personal access tokens (datahub-project#10848) * fix(smoke-test): missing test for move domain (datahub-project#10837) * ci: update usernames to not considered for community (datahub-project#10851) * env: change defaults for data contract visibility (datahub-project#10854) * fix(ingest/tableau): quote special characters in external URL (datahub-project#10842) * fix(smoke-test): fix flakiness of auto complete test * ci(ingest): pin dask dependency for feast (datahub-project#10865) * fix(ingestion/lookml): liquid template resolution and view-to-view cll (datahub-project#10542) * feat(ingest/audit): add client id and version in system metadata props (datahub-project#10829) * chore(ingest): Mypy 1.10.1 pin (datahub-project#10867) * docs: use acryl-datahub-actions as expected python package to install (datahub-project#10852) * docs: add new js snippet (datahub-project#10846) * refactor(ingestion): remove company domain for security reason (datahub-project#10839) * fix(ingestion/spark): Platform instance and column level lineage fix (datahub-project#10843) Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * feat(ingestion/tableau): optionally ingest multiple sites and create site containers (datahub-project#10498) Co-authored-by: Yanik Häni <[email protected]> * fix(ingestion/looker): Add sqlglot dependency and remove unused sqlparser (datahub-project#10874) * fix(manage-tokens): fix manage access token policy (datahub-project#10853) * Batch get entity endpoints (datahub-project#10880) * feat(system): support conditional write semantics (datahub-project#10868) * fix(build): upgrade vercel builds to Node 20.x (datahub-project#10890) * feat(ingest/lookml): shallow clone repos (datahub-project#10888) * fix(ingest/looker): add missing dependency (datahub-project#10876) * fix(ingest): only populate audit stamps where accurate (datahub-project#10604) * fix(ingest/dbt): always encode tag urns (datahub-project#10799) * fix(ingest/redshift): handle multiline alter table commands (datahub-project#10727) * fix(ingestion/looker): column name missing in explore (datahub-project#10892) * fix(lineage) Fix lineage source/dest filtering with explored per hop limit (datahub-project#10879) * feat(conditional-writes): misc updates and fixes (datahub-project#10901) * feat(ci): update outdated action (datahub-project#10899) * feat(rest-emitter): adding async flag to rest emitter (datahub-project#10902) Co-authored-by: Gabe Lyons <[email protected]> * feat(ingest): add snowflake-queries source (datahub-project#10835) * fix(ingest): improve `auto_materialize_referenced_tags_terms` error handling (datahub-project#10906) * docs: add new company to adoption list (datahub-project#10909) * refactor(redshift): Improve redshift error handling with new structured reporting system (datahub-project#10870) Co-authored-by: John Joyce <[email protected]> Co-authored-by: Harshal Sheth <[email protected]> * feat(ui) Finalize support for all entity types on forms (datahub-project#10915) * Index ExecutionRequestResults status field (datahub-project#10811) * feat(ingest): grafana connector (datahub-project#10891) Co-authored-by: Shirshanka Das <[email protected]> Co-authored-by: Harshal Sheth <[email protected]> * fix(gms) Add Form entity type to EntityTypeMapper (datahub-project#10916) * feat(dataset): add support for external url in Dataset (datahub-project#10877) * docs(saas-overview) added missing features to observe section (datahub-project#10913) Co-authored-by: John Joyce <[email protected]> * fix(ingest/spark): Fixing Micrometer warning (datahub-project#10882) * fix(structured properties): allow application of structured properties without schema file (datahub-project#10918) * fix(data-contracts-web) handle other schedule types (datahub-project#10919) * fix(ingestion/tableau): human-readable message for PERMISSIONS_MODE_SWITCHED error (datahub-project#10866) Co-authored-by: Harshal Sheth <[email protected]> * Add feature flag for view defintions (datahub-project#10914) Co-authored-by: Ethan Cartwright <[email protected]> * feat(ingest/BigQuery): refactor+parallelize dataset metadata extraction (datahub-project#10884) * fix(airflow): add error handling around render_template() (datahub-project#10907) * feat(ingestion/sqlglot): add optional `default_dialect` parameter to sqlglot lineage (datahub-project#10830) * feat(mcp-mutator): new mcp mutator plugin (datahub-project#10904) * fix(ingest/bigquery): changes helper function to decode unicode scape sequences (datahub-project#10845) * feat(ingest/postgres): fetch table sizes for profile (datahub-project#10864) * feat(ingest/abs): Adding azure blob storage ingestion source (datahub-project#10813) * fix(ingest/redshift): reduce severity of SQL parsing issues (datahub-project#10924) * fix(build): fix lint fix web react (datahub-project#10896) * fix(ingest/bigquery): handle quota exceeded for project.list requests (datahub-project#10912) * feat(ingest): report extractor failures more loudly (datahub-project#10908) * feat(ingest/snowflake): integrate snowflake-queries into main source (datahub-project#10905) * fix(ingest): fix docs build (datahub-project#10926) * fix(ingest/snowflake): fix test connection (datahub-project#10927) * fix(ingest/lookml): add view load failures to cache (datahub-project#10923) * docs(slack) overhauled setup instructions and screenshots (datahub-project#10922) Co-authored-by: John Joyce <[email protected]> * fix(airflow): Add comma parsing of owners to DataJobs (datahub-project#10903) * fix(entityservice): fix merging sideeffects (datahub-project#10937) * feat(ingest): Support System Ingestion Sources, Show and hide system ingestion sources with Command-S (datahub-project#10938) Co-authored-by: John Joyce <[email protected]> * chore() Set a default lineage filtering end time on backend when a start time is present (datahub-project#10925) Co-authored-by: John Joyce <[email protected]> Co-authored-by: John Joyce <[email protected]> * Added relationships APIs to V3. Added these generic APIs to V3 swagger doc. (datahub-project#10939) * docs: add learning center to docs (datahub-project#10921) * doc: Update hubspot form id (datahub-project#10943) * chore(airflow): add python 3.11 w/ Airflow 2.9 to CI (datahub-project#10941) * fix(ingest/Glue): column upstream lineage between S3 and Glue (datahub-project#10895) * fix(ingest/abs): split abs utils into multiple files (datahub-project#10945) * doc(ingest/looker): fix doc for sql parsing documentation (datahub-project#10883) Co-authored-by: Harshal Sheth <[email protected]> * fix(ingest/bigquery): Adding missing BigQuery types (datahub-project#10950) * fix(ingest/setup): feast and abs source setup (datahub-project#10951) * fix(connections) Harden adding /gms to connections in backend (datahub-project#10942) * feat(siblings) Add flag to prevent combining siblings in the UI (datahub-project#10952) * fix(docs): make graphql doc gen more automated (datahub-project#10953) * feat(ingest/athena): Add option for Athena partitioned profiling (datahub-project#10723) * fix(spark-lineage): default timeout for future responses (datahub-project#10947) * feat(datajob/flow): add environment filter using info aspects (datahub-project#10814) * fix(ui/ingest): correct privilege used to show tab (datahub-project#10483) Co-authored-by: Kunal-kankriya <[email protected]> * feat(ingest/looker): include dashboard urns in browse v2 (datahub-project#10955) * add a structured type to batchGet in OpenAPI V3 spec (datahub-project#10956) * fix(ui): scroll on the domain sidebar to show all domains (datahub-project#10966) * fix(ingest/sagemaker): resolve incorrect variable assignment for SageMaker API call (datahub-project#10965) * fix(airflow/build): Pinning mypy (datahub-project#10972) * Fixed a bug where the OpenAPI V3 spec was incorrect. The bug was introduced in datahub-project#10939. (datahub-project#10974) * fix(ingest/test): Fix for mssql integration tests (datahub-project#10978) * fix(entity-service) exist check correctly extracts status (datahub-project#10973) * fix(structuredProps) casing bug in StructuredPropertiesValidator (datahub-project#10982) * bugfix: use anyOf instead of allOf when creating references in openapi v3 spec (datahub-project#10986) * fix(ui): Remove ant less imports (datahub-project#10988) * feat(ingest/graph): Add get_results_by_filter to DataHubGraph (datahub-project#10987) * feat(ingest/cli): init does not actually support environment variables (datahub-project#10989) * fix(ingest/graph): Update get_results_by_filter graphql query (datahub-project#10991) * feat(ingest/spark): Promote beta plugin (datahub-project#10881) Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * feat(ingest): support domains in meta -> "datahub" section (datahub-project#10967) * feat(ingest): add `check server-config` command (datahub-project#10990) * feat(cli): Make consistent use of DataHubGraphClientConfig (datahub-project#10466) Deprecates get_url_and_token() in favor of a more complete option: load_graph_config() that returns a full DatahubClientConfig. This change was then propagated across previous usages of get_url_and_token so that connections to DataHub server from the client respect the full breadth of configuration specified by DatahubClientConfig. I.e: You can now specify disable_ssl_verification: true in your ~/.datahubenv file so that all cli functions to the server work when ssl certification is disabled. Fixes datahub-project#9705 * fix(ingest/s3): Fixing container creation when there is no folder in path (datahub-project#10993) * fix(ingest/looker): support platform instance for dashboards & charts (datahub-project#10771) * feat(ingest/bigquery): improve handling of information schema in sql parser (datahub-project#10985) * feat(ingest): improve `ingest deploy` command (datahub-project#10944) * fix(backend): allow excluding soft-deleted entities in relationship-queries; exclude soft-deleted members of groups (datahub-project#10920) - allow excluding soft-deleted entities in relationship-queries - exclude soft-deleted members of groups * fix(ingest/looker): downgrade missing chart type log level (datahub-project#10996) * doc(acryl-cloud): release docs for 0.3.4.x (datahub-project#10984) Co-authored-by: John Joyce <[email protected]> Co-authored-by: RyanHolstien <[email protected]> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: Pedro Silva <[email protected]> * fix(protobuf/build): Fix protobuf check jar script (datahub-project#11006) * fix(ui/ingest): Support invalid cron jobs (datahub-project#10998) * fix(ingest): fix graph config loading (datahub-project#11002) Co-authored-by: Pedro Silva <[email protected]> * feat(docs): Document __DATAHUB_TO_FILE_ directive (datahub-project#10968) Co-authored-by: Harshal Sheth <[email protected]> * fix(graphql/upsertIngestionSource): Validate cron schedule; parse error in CLI (datahub-project#11011) * feat(ece): support custom ownership type urns in ECE generation (datahub-project#10999) * feat(assertion-v2): changed Validation tab to Quality and created new Governance tab (datahub-project#10935) * fix(ingestion/glue): Add support for missing config options for profiling in Glue (datahub-project#10858) * feat(propagation): Add models for schema field docs, tags, terms (datahub-project#2959) (datahub-project#11016) Co-authored-by: Chris Collins <[email protected]> * docs: standardize terminology to DataHub Cloud (datahub-project#11003) * fix(ingestion/transformer): replace the externalUrl container (datahub-project#11013) * docs(slack) troubleshoot docs (datahub-project#11014) * feat(propagation): Add graphql API (datahub-project#11030) Co-authored-by: Chris Collins <[email protected]> * feat(propagation): Add models for Action feature settings (datahub-project#11029) * docs(custom properties): Remove duplicate from sidebar (datahub-project#11033) * feat(models): Introducing Dataset Partitions Aspect (datahub-project#10997) Co-authored-by: John Joyce <[email protected]> Co-authored-by: John Joyce <[email protected]> * feat(propagation): Add Documentation Propagation Settings (datahub-project#11038) * fix(models): chart schema fields mapping, add dataHubAction entity, t… (datahub-project#11040) * fix(ci): smoke test lint failures (datahub-project#11044) * docs: fix learning center color scheme & typo (datahub-project#11043) * feat: add cloud main page (datahub-project#11017) Co-authored-by: Jay <[email protected]> * feat(restore-indices): add additional step to also clear system metadata service (datahub-project#10662) Co-authored-by: John Joyce <[email protected]> * docs: fix typo (datahub-project#11046) * fix(lint): apply spotless (datahub-project#11050) * docs(airflow): example query to get datajobs for a dataflow (datahub-project#11034) * feat(cli): Add run-id option to put sub-command (datahub-project#11023) Adds an option to assign run-id to a given put command execution. This is useful when transformers do not exist for a given ingestion payload, we can follow up with custom metadata and assign it to an ingestion pipeline. * fix(ingest): improve sql error reporting calls (datahub-project#11025) * fix(airflow): fix CI setup (datahub-project#11031) * feat(ingest/dbt): add experimental `prefer_sql_parser_lineage` flag (datahub-project#11039) * fix(ingestion/lookml): enable stack-trace in lookml logs (datahub-project#10971) * (chore): Linting fix (datahub-project#11015) * chore(ci): update deprecated github actions (datahub-project#10977) * Fix ALB configuration example (datahub-project#10981) * chore(ingestion-base): bump base image packages (datahub-project#11053) * feat(cli): Trim report of dataHubExecutionRequestResult to max GMS size (datahub-project#11051) * fix(ingestion/lookml): emit dummy sql condition for lookml custom condition tag (datahub-project#11008) Co-authored-by: Harshal Sheth <[email protected]> * fix(ingestion/powerbi): fix issue with broken report lineage (datahub-project#10910) * feat(ingest/tableau): add retry on timeout (datahub-project#10995) * change generate kafka connect properties from env (datahub-project#10545) Co-authored-by: david-leifker <[email protected]> * fix(ingest): fix oracle cronjob ingestion (datahub-project#11001) Co-authored-by: david-leifker <[email protected]> * chore(ci): revert update deprecated github actions (datahub-project#10977) (datahub-project#11062) * feat(ingest/dbt-cloud): update metadata_endpoint inference (datahub-project#11041) * build: Reduce size of datahub-frontend-react image by 50-ish% (datahub-project#10878) Co-authored-by: david-leifker <[email protected]> * fix(ci): Fix lint issue in datahub_ingestion_run_summary_provider.py (datahub-project#11063) * docs(ingest): update developing-a-transformer.md (datahub-project#11019) * feat(search-test): update search tests from datahub-project#10408 (datahub-project#11056) * feat(cli): add aspects parameter to DataHubGraph.get_entity_semityped (datahub-project#11009) Co-authored-by: Harshal Sheth <[email protected]> * docs(airflow): update min version for plugin v2 (datahub-project#11065) * doc(ingestion/tableau): doc update for derived permission (datahub-project#11054) Co-authored-by: Pedro Silva <[email protected]> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: Harshal Sheth <[email protected]> * fix(py): remove dep on types-pkg_resources (datahub-project#11076) * feat(ingest/mode): add option to exclude restricted (datahub-project#11081) * fix(ingest): set lastObserved in sdk when unset (datahub-project#11071) * doc(ingest): Update capabilities (datahub-project#11072) * chore(vulnerability): Log Injection (datahub-project#11090) * chore(vulnerability): Information exposure through a stack trace (datahub-project#11091) * chore(vulnerability): Comparison of narrow type with wide type in loop condition (datahub-project#11089) * chore(vulnerability): Insertion of sensitive information into log files (datahub-project#11088) * chore(vulnerability): Risky Cryptographic Algorithm (datahub-project#11059) * chore(vulnerability): Overly permissive regex range (datahub-project#11061) Co-authored-by: Harshal Sheth <[email protected]> * fix: update customer data (datahub-project#11075) * fix(models): fixing the datasetPartition models (datahub-project#11085) Co-authored-by: John Joyce <[email protected]> * fix(ui): Adding view, forms GraphQL query, remove showing a fallback error message on unhandled GraphQL error (datahub-project#11084) Co-authored-by: John Joyce <[email protected]> * feat(docs-site): hiding learn more from cloud page (datahub-project#11097) * fix(docs): Add correct usage of orFilters in search API docs (datahub-project#11082) Co-authored-by: Jay <[email protected]> * fix(ingest/mode): Regexp in mode name matcher didn't allow underscore (datahub-project#11098) * docs: Refactor customer stories section (datahub-project#10869) Co-authored-by: Jeff Merrick <[email protected]> * fix(release): fix full/slim suffix on tag (datahub-project#11087) * feat(config): support alternate hashing algorithm for doc id (datahub-project#10423) Co-authored-by: david-leifker <[email protected]> Co-authored-by: John Joyce <[email protected]> * fix(emitter): fix typo in get method of java kafka emitter (datahub-project#11007) * fix(ingest): use correct native data type in all SQLAlchemy sources by compiling data type using dialect (datahub-project#10898) Co-authored-by: Harshal Sheth <[email protected]> * chore: Update contributors list in PR labeler (datahub-project#11105) * feat(ingest): tweak stale entity removal messaging (datahub-project#11064) * fix(ingestion): enforce lastObserved timestamps in SystemMetadata (datahub-project#11104) * fix(ingest/powerbi): fix broken lineage between chart and dataset (datahub-project#11080) * feat(ingest/lookml): CLL support for sql set in sql_table_name attribute of lookml view (datahub-project#11069) * docs: update graphql docs on forms & structured properties (datahub-project#11100) * test(search): search openAPI v3 test (datahub-project#11049) * fix(ingest/tableau): prevent empty site content urls (datahub-project#11057) Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * feat(entity-client): implement client batch interface (datahub-project#11106) * fix(snowflake): avoid reporting warnings/info for sys tables (datahub-project#11114) * fix(ingest): downgrade column type mapping warning to info (datahub-project#11115) * feat(api): add AuditStamp to the V3 API entity/aspect response (datahub-project#11118) * fix(ingest/redshift): replace r'\n' with '\n' to avoid token error redshift serverless… (datahub-project#11111) * fix(entiy-client): handle null entityUrn case for restli (datahub-project#11122) * fix(sql-parser): prevent bad urns from alter table lineage (datahub-project#11092) * fix(ingest/bigquery): use small batch size if use_tables_list_query_v2 is set (datahub-project#11121) * fix(graphql): add missing entities to EntityTypeMapper and EntityTypeUrnMapper (datahub-project#10366) * feat(ui): Changes to allow editable dataset name (datahub-project#10608) Co-authored-by: Jay Kadambi <[email protected]> * fix: remove saxo (datahub-project#11127) * feat(mcl-processor): Update mcl processor hooks (datahub-project#11134) * fix(openapi): fix openapi v2 endpoints & v3 documentation update * Revert "fix(openapi): fix openapi v2 endpoints & v3 documentation update" This reverts commit 573c1cb. * docs(policies): updates to policies documentation (datahub-project#11073) * fix(openapi): fix openapi v2 and v3 docs update (datahub-project#11139) * feat(auth): grant type and acr values custom oidc parameters support (datahub-project#11116) * fix(mutator): mutator hook fixes (datahub-project#11140) * feat(search): support sorting on multiple fields (datahub-project#10775) * feat(ingest): various logging improvements (datahub-project#11126) * fix(ingestion/lookml): fix for sql parsing error (datahub-project#11079) Co-authored-by: Harshal Sheth <[email protected]> * feat(docs-site) cloud page spacing and content polishes (datahub-project#11141) * feat(ui) Enable editing structured props on fields (datahub-project#11042) * feat(tests): add md5 and last computed to testResult model (datahub-project#11117) * test(openapi): openapi regression smoke tests (datahub-project#11143) * fix(airflow): fix tox tests + update docs (datahub-project#11125) * docs: add chime to adoption stories (datahub-project#11142) * fix(ingest/databricks): Updating code to work with Databricks sdk 0.30 (datahub-project#11158) * fix(kafka-setup): add missing script to image (datahub-project#11190) * fix(config): fix hash algo config (datahub-project#11191) * test(smoke-test): updates to smoke-tests (datahub-project#11152) * fix(elasticsearch): refactor idHashAlgo setting (datahub-project#11193) * chore(kafka): kafka version bump (datahub-project#11211) * readd UsageStatsWorkUnit * fix merge problems * change logo --------- Co-authored-by: Chris Collins <[email protected]> Co-authored-by: John Joyce <[email protected]> Co-authored-by: John Joyce <[email protected]> Co-authored-by: John Joyce <[email protected]> Co-authored-by: dushayntAW <[email protected]> Co-authored-by: sagar-salvi-apptware <[email protected]> Co-authored-by: Aseem Bansal <[email protected]> Co-authored-by: Kevin Chun <[email protected]> Co-authored-by: jordanjeremy <[email protected]> Co-authored-by: skrydal <[email protected]> Co-authored-by: Harshal Sheth <[email protected]> Co-authored-by: david-leifker <[email protected]> Co-authored-by: sid-acryl <[email protected]> Co-authored-by: Julien Jehannet <[email protected]> Co-authored-by: Hendrik Richert <[email protected]> Co-authored-by: Hendrik Richert <[email protected]> Co-authored-by: RyanHolstien <[email protected]> Co-authored-by: Felix Lüdin <[email protected]> Co-authored-by: Pirry <[email protected]> Co-authored-by: Hyejin Yoon <[email protected]> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: cburroughs <[email protected]> Co-authored-by: ksrinath <[email protected]> Co-authored-by: Mayuri Nehate <[email protected]> Co-authored-by: Kunal-kankriya <[email protected]> Co-authored-by: Shirshanka Das <[email protected]> Co-authored-by: ipolding-cais <[email protected]> Co-authored-by: Tamas Nemeth <[email protected]> Co-authored-by: Shubham Jagtap <[email protected]> Co-authored-by: haeniya <[email protected]> Co-authored-by: Yanik Häni <[email protected]> Co-authored-by: Gabe Lyons <[email protected]> Co-authored-by: Gabe Lyons <[email protected]> Co-authored-by: 808OVADOZE <[email protected]> Co-authored-by: noggi <[email protected]> Co-authored-by: Nicholas Pena <[email protected]> Co-authored-by: Jay <[email protected]> Co-authored-by: ethan-cartwright <[email protected]> Co-authored-by: Ethan Cartwright <[email protected]> Co-authored-by: Nadav Gross <[email protected]> Co-authored-by: Patrick Franco Braz <[email protected]> Co-authored-by: pie1nthesky <[email protected]> Co-authored-by: Joel Pinto Mata (KPN-DSH-DEX team) <[email protected]> Co-authored-by: Ellie O'Neil <[email protected]> Co-authored-by: Ajoy Majumdar <[email protected]> Co-authored-by: deepgarg-visa <[email protected]> Co-authored-by: Tristan Heisler <[email protected]> Co-authored-by: Andrew Sikowitz <[email protected]> Co-authored-by: Davi Arnaut <[email protected]> Co-authored-by: Pedro Silva <[email protected]> Co-authored-by: amit-apptware <[email protected]> Co-authored-by: Sam Black <[email protected]> Co-authored-by: Raj Tekal <[email protected]> Co-authored-by: Steffen Grohsschmiedt <[email protected]> Co-authored-by: jaegwon.seo <[email protected]> Co-authored-by: Renan F. Lima <[email protected]> Co-authored-by: Matt Exchange <[email protected]> Co-authored-by: Jonny Dixon <[email protected]> Co-authored-by: Pedro Silva <[email protected]> Co-authored-by: Pinaki Bhattacharjee <[email protected]> Co-authored-by: Jeff Merrick <[email protected]> Co-authored-by: skrydal <[email protected]> Co-authored-by: AndreasHegerNuritas <[email protected]> Co-authored-by: jayasimhankv <[email protected]> Co-authored-by: Jay Kadambi <[email protected]> Co-authored-by: David Leifker <[email protected]>

Adding azure blob ingestion

af3e161

github-actions bot added ingestion PR or Issue related to the ingestion of metadata community-contribution PR or Issue raised by member(s) of DataHub Community labels Jul 1, 2024

vercel bot deployed to Preview July 1, 2024 10:25 View deployment

coderabbitai bot reviewed Jul 1, 2024

View reviewed changes

Formatting with Black

c7f59d3

coderabbitai bot reviewed Jul 1, 2024

View reviewed changes

vercel bot deployed to Preview July 1, 2024 12:35 View deployment

Fixing coderabbitai PR comments

f6025a1

coderabbitai bot reviewed Jul 1, 2024

View reviewed changes

vercel bot deployed to Preview July 1, 2024 14:12 View deployment

treff7es self-assigned this Jul 5, 2024

treff7es closed this Jul 8, 2024

treff7es reopened this Jul 8, 2024

Code formatting

3f0d854

coderabbitai bot reviewed Jul 9, 2024

View reviewed changes

Fixing CodeRabbit PR comments

d315ecd

coderabbitai bot reviewed Jul 9, 2024

View reviewed changes

vercel bot had a problem deploying to Preview July 9, 2024 08:34 Failure

Code formatting

b16d626

vercel bot had a problem deploying to Preview July 15, 2024 08:19 Failure

coderabbitai bot reviewed Jul 15, 2024

View reviewed changes

Fixing CodeRabbit PR comments

d45710c

vercel bot had a problem deploying to Preview July 15, 2024 12:10 Failure

coderabbitai bot reviewed Jul 15, 2024

View reviewed changes

Formating code

b4bf0a4

vercel bot had a problem deploying to Preview July 15, 2024 12:42 Failure

coderabbitai bot reviewed Jul 15, 2024

View reviewed changes

Merge branch 'master' into feature/azure-blob-storage

b2174cc

coderabbitai bot reviewed Jul 15, 2024

View reviewed changes

vercel bot deployed to Preview July 15, 2024 13:21 View deployment

Formatting code

8f312da

coderabbitai bot reviewed Jul 16, 2024

View reviewed changes

Formatting code

6963078

coderabbitai bot reviewed Jul 16, 2024

View reviewed changes

vercel bot deployed to Preview July 16, 2024 12:36 View deployment

Formatting code

9b2d0cb

coderabbitai bot reviewed Jul 16, 2024

View reviewed changes

vercel bot deployed to Preview July 16, 2024 12:54 View deployment

Formatting code

7e61de6

coderabbitai bot reviewed Jul 16, 2024

View reviewed changes

vercel bot deployed to Preview July 16, 2024 13:41 View deployment

Formatting code

7a2c3fa

coderabbitai bot reviewed Jul 16, 2024

View reviewed changes

Formatting code

240d116

coderabbitai bot reviewed Jul 16, 2024

View reviewed changes

vercel bot deployed to Preview July 16, 2024 22:22 View deployment

treff7es approved these changes Jul 17, 2024

View reviewed changes

treff7es changed the title ~~Adding azure blob ingestion~~ feat(ingest/abs): Adding azure blob storage ingestion source Jul 17, 2024

treff7es merged commit 13b6feb into datahub-project:master Jul 17, 2024
56 of 58 checks passed

aviv-julienjehannet pushed a commit to aviv-julienjehannet/datahub that referenced this pull request Jul 25, 2024

feat(ingest/abs): Adding azure blob storage ingestion source (datahub…

b1017fa

…-project#10813)


		Path Specs (`path_specs`) is a list of Path Spec (`path_spec`) objects where each individual `path_spec` represents one or more datasets. Include path (`path_spec.include`) represents formatted path to the dataset. This path must end with `.` or `.[ext]` to represent leaf level. If `.[ext]` is provided then files with only specified extension type will be scanned. "`.[ext]`" can be any of [supported file types](#supported-file-types). Refer [example 1](#example-1---individual-file-as-dataset) below for more details.

		All folder levels need to be specified in include path. You can use `/*/` to represent a folder level and avoid specifying exact folder name. To map folder as a dataset, use `{table}` placeholder to represent folder level for which dataset is to be created. For a partitioned dataset, you can use placeholder `{partition_key[i]}` to represent name of `i`th partition and `{partition[i]}` to represent value of `i`th partition. During ingestion, `i` will be used to match partition_key to partition. Refer [example 2 and 3](#example-2---folder-of-files-as-dataset-without-partitions) below for more details.

		is_abs = is_abs_uri(values.get("include") or "")
		if not is_s3 and not is_gcs and not is_abs:

	if folder_name in this_dict.keys():
	if folder_name in this_dict:

		if azure_config is None:
		raise ValueError("No container_client available.")

	if key in custom_properties.keys():
	if key in custom_properties:

	To specify the group of files that form a dataset, use `path_specs` configuration in ingestion recipe. Refer
	To specify the group of files that form a dataset, use `path_specs` configuration in ingestion recipe. Refer to the section [Path Specs](https://datahubproject.io/docs/generated/ingestion/sources/s3/#path-specs) for more details.


		### Profiling

		Profiling is not available in the current release.

	from typing import Any, Dict, Optional
	from typing import Any, Dict

	100 rows by default, which can be controlled via the `max_rows` recipe parameter (see [below](#config-details))
	100 rows by default, which can be controlled via the `max_rows` recipe parameter (see [below](#profiling))


		Refer to [example 5](#example-5---advanced---either-individual-file-or-folder-of-files-as-dataset) if your container has a more complex dataset representation.

		Additional points to note

		if "=" in folder1 and "=" in folder2:
		if folder1.rsplit("=", 1)[0] == folder2.rsplit("=", 1)[0]:

	if "=" in folder1 and "=" in folder2:
	if folder1.rsplit("=", 1)[0] == folder2.rsplit("=", 1)[0]:
	if "=" in folder1 and "=" in folder2 and folder1.rsplit("=", 1)[0] == folder2.rsplit("=", 1)[0]:

feat(ingest/abs): Adding azure blob storage ingestion source #10813

feat(ingest/abs): Adding azure blob storage ingestion source #10813

Conversation

joelmataKPN commented Jul 1, 2024 • edited by coderabbitai bot Loading

Checklist

Summary by CodeRabbit

coderabbitai bot commented Jul 1, 2024 • edited Loading

Walkthrough

Changes

Poem

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Jul 1, 2024

Choose a reason for hiding this comment

coderabbitai bot Jul 1, 2024

Choose a reason for hiding this comment

coderabbitai bot Jul 1, 2024

Choose a reason for hiding this comment

coderabbitai bot Jul 1, 2024

Choose a reason for hiding this comment

coderabbitai bot Jul 1, 2024

Choose a reason for hiding this comment

coderabbitai bot Jul 1, 2024

Choose a reason for hiding this comment

coderabbitai bot Jul 1, 2024

Choose a reason for hiding this comment

coderabbitai bot Jul 1, 2024

Choose a reason for hiding this comment

coderabbitai bot Jul 1, 2024

Choose a reason for hiding this comment

coderabbitai bot Jul 1, 2024

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Jul 1, 2024

Choose a reason for hiding this comment

coderabbitai bot Jul 1, 2024

Choose a reason for hiding this comment

coderabbitai bot Jul 1, 2024

Choose a reason for hiding this comment

coderabbitai bot Jul 1, 2024

Choose a reason for hiding this comment

coderabbitai bot Jul 1, 2024

Choose a reason for hiding this comment

coderabbitai bot Jul 1, 2024

Choose a reason for hiding this comment

coderabbitai bot Jul 1, 2024

Choose a reason for hiding this comment

coderabbitai bot Jul 1, 2024

Choose a reason for hiding this comment

coderabbitai bot Jul 1, 2024

Choose a reason for hiding this comment

coderabbitai bot Jul 1, 2024

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Jul 1, 2024

Choose a reason for hiding this comment

coderabbitai bot Jul 1, 2024

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Jul 1, 2024

Choose a reason for hiding this comment

coderabbitai bot Jul 1, 2024

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Jul 9, 2024

Choose a reason for hiding this comment

coderabbitai bot Jul 9, 2024

Choose a reason for hiding this comment

coderabbitai bot Jul 9, 2024

Choose a reason for hiding this comment

coderabbitai bot Jul 9, 2024

Choose a reason for hiding this comment

joelmataKPN commented Jul 1, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jul 1, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)