-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Artifact Store isolation #2490
Improve Artifact Store isolation #2490
Conversation
Important Auto Review SkippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the To trigger a single review, invoke the WalkthroughThe recent updates to ZenML's artifact handling introduce enhanced input verification and path validation within artifact stores. These changes ensure that operations are securely contained within the bounds of the artifact stores, enhancing the robustness and security of data management. Additionally, a specific test suite has been added to validate these improvements, ensuring that artifact operations outside the designated bounds are correctly restricted. Changes
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configration File (
|
@coderabbitai review |
Quickstart template updates in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 2
Configuration used: .coderabbit.yaml
Files selected for processing (3)
- src/zenml/artifact_stores/base_artifact_store.py (4 hunks)
- src/zenml/artifacts/utils.py (2 hunks)
- tests/integration/functional/artifacts/test_base_artifact_store.py (1 hunks)
Additional comments: 5
src/zenml/artifact_stores/base_artifact_store.py (4)
- 15-18: The addition of import statements for
inspect
andPath
modules is appropriate given their usage in the newly introduced_verify
and_inner_verify
functions. These imports are necessary for the functionality being implemented and follow Python best practices.- 113-160: The
_verify
function is a significant addition aimed at enhancing the security of the artifact store by ensuring that operations are performed within its bounds. A few points to consider:
- The use of
inspect.signature
to dynamically inspect function arguments is a clever way to implement this verification generically for any function.- The method of determining if an argument is a
PathType
and then verifying it with_inner_verify
is sound. However, it's important to ensure that all relevant methods in subclasses ofBaseArtifactStore
correctly annotate their path arguments asPathType
for this mechanism to work effectively.- The distinction between handling
self
and other arguments is handled correctly, ensuring compatibility with both instance and class methods.
- 459-470: The modification to the
__init_subclass__
method to automatically wrap abstract method implementations with the_verify
function is a proactive approach to enforce path verification across all subclasses. This ensures that any subclass implementing these abstract methods will inherit the path verification logic without additional effort from the developer. This is a good example of leveraging Python's dynamic capabilities to enforce a security feature across a class hierarchy.- 471-486: The
_inner_verify
method provides the core functionality for verifying that a given path is within the bounds of the artifact store. This method is crucial for the isolation mechanism being implemented. A few observations:
- The method correctly sanitizes the input path and checks if it starts with the artifact store's path, raising an
IOError
if not. This is a straightforward and effective way to enforce the isolation constraint.- It's important to ensure that the
self.path
property always returns an absolute and sanitized path to avoid bypassing this verification due to path traversal or similar issues.src/zenml/artifacts/utils.py (1)
- 762-769: > 📝 NOTE
This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [755-766]
The update to
_load_file_from_artifact_store
to explicitly raise anIOError
if the artifact store rejects the request is a good practice for robust error handling. This change makes the error more explicit and helps calling functions to handle this specific case more gracefully. However, it's important to ensure that this change is consistently documented in the function's docstring and that any calling functions are updated to handle this exception if necessary.
tests/integration/functional/artifacts/test_base_artifact_store.py
Outdated
Show resolved
Hide resolved
tests/integration/functional/artifacts/test_base_artifact_store.py
Outdated
Show resolved
Hide resolved
NLP template updates in |
…a-step-endpoint' of https://github.com/zenml-io/zenml into bugfix/OSSK-462-directory-traversal-via-logs-object-of-a-step-endpoint
…ogs-object-of-a-step-endpoint
…a-step-endpoint' of https://github.com/zenml-io/zenml into bugfix/OSSK-462-directory-traversal-via-logs-object-of-a-step-endpoint
E2E template updates in |
…a-step-endpoint' of https://github.com/zenml-io/zenml into bugfix/OSSK-462-directory-traversal-via-logs-object-of-a-step-endpoint
@coderabbitai review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 4
Configuration used: .coderabbit.yaml
Files selected for processing (11)
- src/zenml/artifact_stores/base_artifact_store.py (5 hunks)
- src/zenml/artifacts/utils.py (5 hunks)
- src/zenml/logging/step_logging.py (4 hunks)
- src/zenml/materializers/base_materializer.py (1 hunks)
- src/zenml/materializers/built_in_materializer.py (7 hunks)
- src/zenml/materializers/cloudpickle_materializer.py (5 hunks)
- src/zenml/materializers/numpy_materializer.py (6 hunks)
- src/zenml/materializers/pandas_materializer.py (5 hunks)
- src/zenml/materializers/service_materializer.py (3 hunks)
- src/zenml/materializers/structured_string_materializer.py (3 hunks)
- src/zenml/orchestrators/output_utils.py (3 hunks)
Files skipped from review as they are similar to previous changes (1)
- src/zenml/artifacts/utils.py
Additional comments: 25
src/zenml/materializers/service_materializer.py (2)
- 49-51: The use of
Client().active_stack.artifact_store
to access the artifact store in theload
method is a good practice as it ensures that the operation is performed within the context of the active stack. This change enhances the isolation and security of artifact operations.- 64-66: The update in the
save
method to useClient().active_stack.artifact_store
for file operations aligns with the objective of improving artifact store isolation. This approach ensures that the service configuration is saved within the boundaries of the active stack's artifact store, which is crucial for maintaining data integrity and security.src/zenml/orchestrators/output_utils.py (2)
- 75-85: The update to use
Client().active_stack.artifact_store
for artifact URI preparation and directory operations inprepare_output_artifact_uris
is a significant improvement. It ensures that artifact operations are performed within the context of the active stack's artifact store, enhancing isolation and security. However, it's important to ensure that error handling is robust, especially for cases where artifact URIs already exist or directory creation fails.- 96-99: The use of
Client().active_stack.artifact_store
for directory removal inremove_artifact_dirs
is consistent with the PR's goal of improving artifact store isolation. This change ensures that operations are confined within the active stack's artifact store, which is crucial for maintaining data integrity and security. It's important to verify that error handling is in place for cases where directory removal fails.src/zenml/materializers/structured_string_materializer.py (2)
- 50-51: The update in the
load
method to useClient().active_stack.artifact_store
for file operations is a positive change, aligning with the PR's objectives to enhance artifact store isolation. This ensures that file operations are performed within the context of the active stack's artifact store, which is crucial for maintaining data integrity and security.- 60-61: The change in the
save
method to utilizeClient().active_stack.artifact_store
for saving structured strings (CSV, HTML, Markdown) is consistent with the goal of improving artifact store isolation. This approach ensures that data is saved within the boundaries of the active stack's artifact store, enhancing data integrity and security.src/zenml/materializers/cloudpickle_materializer.py (2)
- 59-65: > 📝 NOTE
This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [62-76]
The implementation of Python version compatibility checks in the
load
method, along with the use ofClient().active_stack.artifact_store
for file operations, is a thoughtful addition. This ensures that artifacts are loaded within the context of the active stack's artifact store and alerts users to potential issues when loading artifacts materialized under different Python versions. This approach enhances data integrity and user awareness of potential compatibility issues.
- 94-100: > 📝 NOTE
This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [97-116]
The update in the
save
method to useClient().active_stack.artifact_store
for saving data, along with the warning about the use of the default Pickle materializer, is consistent with the PR's objectives. This change ensures that data is saved within the boundaries of the active stack's artifact store and raises awareness about the limitations of using Pickle for serialization, encouraging users to consider implementing custom materializers for better compatibility and security.src/zenml/materializers/pandas_materializer.py (2)
- 77-86: > 📝 NOTE
This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [80-94]
The update in the
load
method to useClient().active_stack.artifact_store
for loading data from.parquet
or.csv
files is a significant improvement. This change ensures that data loading operations are performed within the context of the active stack's artifact store, enhancing data integrity and security. The conditional handling based on the existence of.parquet
files and the availability ofpyarrow
is well-implemented, ensuring compatibility and providing clear guidance for users on required dependencies.
- 126-134: The changes in the
save
method to utilizeClient().active_stack.artifact_store
for saving data in.parquet
or.csv
format align with the PR's objectives to improve artifact store isolation. This approach ensures that data is saved within the boundaries of the active stack's artifact store, enhancing data integrity and security. The conditional logic to save data in.parquet
format ifpyarrow
is available, otherwise in.csv
format, is a practical solution that offers flexibility and efficiency in data storage.src/zenml/logging/step_logging.py (2)
- 75-90: > 📝 NOTE
This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [67-87]
The update to use
Client().active_stack.artifact_store
for preparing log URIs in theprepare_logs_uri
function is a positive change, ensuring that log files are stored within the context of the active stack's artifact store. This enhances the isolation and security of log operations. The handling of log file creation and removal is well-implemented, ensuring that logs are stored in a dedicated directory and that old log files are appropriately managed.
- 139-145: The implementation of the
save_to_file
method in theStepLogsStorage
class to useClient().active_stack.artifact_store
for saving logs is consistent with the PR's objectives to improve artifact store isolation. This ensures that logs are saved within the boundaries of the active stack's artifact store, enhancing data integrity and security. The approach to buffer and store logs, along with the removal of ANSI escape codes, is well-thought-out, ensuring that logs are stored in a clean and readable format.src/zenml/materializers/numpy_materializer.py (4)
- 22-22: The import of the
Client
class fromzenml.client
is correctly added to facilitate the new approach of handling artifact store operations. This change aligns with the PR's objective to improve artifact store isolation.- 85-85: The use of
artifact_store.open
for reading numpy arrays from parquet files is correctly implemented. This change is part of the transition to using theClient
class for artifact store operations. Ensure that error handling is robust, especially for cases wherepyarrow
is not installed, as this is critical for backward compatibility with older versions of ZenML.- 162-164: The method
_save_histogram
correctly uses theClient
class for saving histograms. This is a good example of the improved artifact store isolation. Ensure that the visualization saving process is tested thoroughly, especially in edge cases where matplotlib might not be installed.- 177-178: The method
_save_image
correctly uses theClient
class for saving images. This change is consistent with the PR's objectives. As with the_save_histogram
method, ensure thorough testing of the visualization saving process.src/zenml/materializers/base_materializer.py (1)
- 159-161: The update to use
artifact_store
for file operations in thesave_visualizations
method is correctly implemented. This change enhances integration with the active stack's artifact store, aligning with the PR's objectives. Ensure that all visualization-related file operations are thoroughly tested to confirm that they work as expected with the new artifact store approach.src/zenml/materializers/built_in_materializer.py (3)
- 29-29: The import of the
Client
class fromzenml.client
is correctly added to facilitate the new approach of handling artifact store operations. This change aligns with the PR's objective to improve artifact store isolation.- 287-291: The method for loading materialized built-in container objects correctly uses the
Client
class for checking file existence. Ensure that error handling and logging are robust, especially for cases where expected files do not exist, to provide clear feedback to users.- 358-364: > 📝 NOTE
This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [361-382]
The method for saving built-in container objects correctly transitions to using
artifact_store
. However, ensure that the process of creating subdirectories (artifact_store.mkdir
) and handling exceptions is thoroughly tested, especially in scenarios where cleanup is necessary due to partial failures.src/zenml/artifact_stores/base_artifact_store.py (5)
- 16-16: The import of
inspect
is appropriate for introspection purposes, especially for the enhancements made to method registration and subclass initialization. This aligns with the PR's objective to enforce better isolation and validation within the artifact store.- 19-19: The import of
Path
frompathlib
is crucial for handling filesystem paths in a platform-independent manner. This is a good practice, especially for a component like an artifact store that deals with file and directory paths extensively.- 46-75: > 📝 NOTE
This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [49-95]
The
_sanitize_potential_path
function has been modified to handle root path and path type, and now raises aFileNotFoundError
if the input path is outside of the artifact store bounds. This is a significant improvement in enforcing artifact store isolation. However, ensure that all calls to this function correctly handle the potentialFileNotFoundError
to avoid unhandled exceptions.
- 100-184: The addition of the
decorator
function within_sanitize_paths
is a clever way to enforce path sanitization across all relevant methods. This approach ensures that all paths are validated against the artifact store's root path, enhancing security and isolation. It's important to verify that this decorator does not introduce any performance issues, especially in high-throughput scenarios.- 477-489: The addition of the
__init_subclass__
method to wrap abstract method implementations with path sanitizer is a proactive measure to ensure that all subclasses ofBaseArtifactStore
automatically enforce path sanitization. This is a good practice for maintaining consistency and security across all artifact store implementations. However, it's crucial to ensure that this does not interfere with the intended functionality of any subclass methods.
Co-authored-by: Stefan Nica <[email protected]>
…ogs-object-of-a-step-endpoint
* dir traversal issue * Auto-update of Starter template * Auto-update of NLP template * reroute artifacts and logs via AS * reroute materializers via AS * simplify to one deco * fix materializer tests * allow local download * Auto-update of E2E template * fix test issues * rework based on comments * fix bugs * lint * Candidate (zenml-io#2493) Co-authored-by: Stefan Nica <[email protected]> * darglint --------- Co-authored-by: GitHub Actions <[email protected]> Co-authored-by: Stefan Nica <[email protected]>
Describe changes
I fixed some isolation issues while using Artifact Stores.
_sanitize_paths
ofBaseArtifactStore
is extended to check that the requested path is not leaving Artifact Store boundsPotential breaking change:
s3://some_bucket/some_sub_folder
artifact_store.open("s3://some_bucket/some_other_folder/dummy.txt","w")
-> this is not allowed any mores3fs
or similar libraries if you need to execute such operationsPre-requisites
Please ensure you have done the following:
develop
and the open PR is targetingdevelop
. If your branch wasn't based on develop read Contribution guide on rebasing branch to develop.Types of changes
Summary by CodeRabbit
Summary by CodeRabbit
New Features
IOError
for rejected artifact store requests.Refactor
Tests