Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Artifact Store isolation #2490

Conversation

avishniakov
Copy link
Contributor

@avishniakov avishniakov commented Mar 1, 2024

Describe changes

I fixed some isolation issues while using Artifact Stores.

  • _sanitize_paths of BaseArtifactStore is extended to check that the requested path is not leaving Artifact Store bounds
  • various helper methods rerouted to use artifact store methods instead of direct file system
  • standard materializers rerouted to use artifact store methods instead of direct file system

Potential breaking change:

  • If unsafe operations were used in user's code - it needs to be revisited to make sure that no objects are created/fetched outside of Artifact Store scopes. Example:
    • Artifact Store is configured as s3://some_bucket/some_sub_folder
    • Code is doing artifact_store.open("s3://some_bucket/some_other_folder/dummy.txt","w") -> this is not allowed any more
    • Consider using s3fs or similar libraries if you need to execute such operations

Pre-requisites

Please ensure you have done the following:

  • I have read the CONTRIBUTING.md document.
  • If my change requires a change to docs, I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • I have based my new branch on develop and the open PR is targeting develop. If your branch wasn't based on develop read Contribution guide on rebasing branch to develop.
  • If my changes require changes to the dashboard, these changes are communicated/requested.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Other (add details above)

Summary by CodeRabbit

Summary by CodeRabbit

  • New Features

    • Enhanced security with input verification for artifact store operations.
    • Improved error handling by raising IOError for rejected artifact store requests.
  • Refactor

    • Implemented path verification for abstract method implementations in artifact store initialization.
  • Tests

    • Introduced integration tests to confine artifact store operations within specified bounds.

Copy link
Contributor

coderabbitai bot commented Mar 1, 2024

Important

Auto Review Skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository.

To trigger a single review, invoke the @coderabbitai review command.

Walkthrough

The recent updates to ZenML's artifact handling introduce enhanced input verification and path validation within artifact stores. These changes ensure that operations are securely contained within the bounds of the artifact stores, enhancing the robustness and security of data management. Additionally, a specific test suite has been added to validate these improvements, ensuring that artifact operations outside the designated bounds are correctly restricted.

Changes

File Path Change Summary
src/zenml/artifact_stores/base_artifact_store.py
src/zenml/artifacts/utils.py
- Introduced input and path verification methods.
- Enhanced error handling for rejected requests in artifact store operations.
tests/integration/functional/artifacts/test_base_artifact_store.py - Added tests to verify artifact store boundary operations.
src/zenml/logging/step_logging.py
src/zenml/materializers/...materializer.py
- Updated to use Client for artifact store operations instead of direct file I/O, improving integration and compatibility.
src/zenml/orchestrators/output_utils.py - Transitioned to using Client for artifact store operations, replacing direct fileio calls.

🐇✨

In the realm of data, where artifacts dwell,

A rabbit hopped, casting a secure spell.

Paths verified, inputs checked with care,

Ensuring that only safe travels are there.

With every hop, a new test case born,

Celebrating security from dusk till morn. 🌟📁

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit-tests for this file.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit tests for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit tests.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger a review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • The JSON schema for the configuration file is available here.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/coderabbit-overrides.v2.json

CodeRabbit Discord Community

Join our Discord Community to get help, request features, and share feedback.

@github-actions github-actions bot added internal To filter out internal PRs and issues bug Something isn't working labels Mar 1, 2024
@avishniakov
Copy link
Contributor Author

@coderabbitai review

Copy link
Contributor

github-actions bot commented Mar 1, 2024

Quickstart template updates in examples/quickstart have been pushed.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 2

Configuration used: .coderabbit.yaml

Commits Files that changed from the base of the PR and between dd12148 and b265c32.
Files selected for processing (3)
  • src/zenml/artifact_stores/base_artifact_store.py (4 hunks)
  • src/zenml/artifacts/utils.py (2 hunks)
  • tests/integration/functional/artifacts/test_base_artifact_store.py (1 hunks)
Additional comments: 5
src/zenml/artifact_stores/base_artifact_store.py (4)
  • 15-18: The addition of import statements for inspect and Path modules is appropriate given their usage in the newly introduced _verify and _inner_verify functions. These imports are necessary for the functionality being implemented and follow Python best practices.
  • 113-160: The _verify function is a significant addition aimed at enhancing the security of the artifact store by ensuring that operations are performed within its bounds. A few points to consider:
  • The use of inspect.signature to dynamically inspect function arguments is a clever way to implement this verification generically for any function.
  • The method of determining if an argument is a PathType and then verifying it with _inner_verify is sound. However, it's important to ensure that all relevant methods in subclasses of BaseArtifactStore correctly annotate their path arguments as PathType for this mechanism to work effectively.
  • The distinction between handling self and other arguments is handled correctly, ensuring compatibility with both instance and class methods.
  • 459-470: The modification to the __init_subclass__ method to automatically wrap abstract method implementations with the _verify function is a proactive approach to enforce path verification across all subclasses. This ensures that any subclass implementing these abstract methods will inherit the path verification logic without additional effort from the developer. This is a good example of leveraging Python's dynamic capabilities to enforce a security feature across a class hierarchy.
  • 471-486: The _inner_verify method provides the core functionality for verifying that a given path is within the bounds of the artifact store. This method is crucial for the isolation mechanism being implemented. A few observations:
  • The method correctly sanitizes the input path and checks if it starts with the artifact store's path, raising an IOError if not. This is a straightforward and effective way to enforce the isolation constraint.
  • It's important to ensure that the self.path property always returns an absolute and sanitized path to avoid bypassing this verification due to path traversal or similar issues.
src/zenml/artifacts/utils.py (1)
  • 762-769: > 📝 NOTE

This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [755-766]

The update to _load_file_from_artifact_store to explicitly raise an IOError if the artifact store rejects the request is a good practice for robust error handling. This change makes the error more explicit and helps calling functions to handle this specific case more gracefully. However, it's important to ensure that this change is consistently documented in the function's docstring and that any calling functions are updated to handle this exception if necessary.

Copy link
Contributor

github-actions bot commented Mar 1, 2024

NLP template updates in examples/e2e_nlp have been pushed.

Copy link
Contributor

github-actions bot commented Mar 2, 2024

E2E template updates in examples/e2e have been pushed.

…a-step-endpoint' of https://github.com/zenml-io/zenml into bugfix/OSSK-462-directory-traversal-via-logs-object-of-a-step-endpoint
@avishniakov avishniakov changed the title [WIP] Improve Artifact Store isolation Improve Artifact Store isolation Mar 4, 2024
@avishniakov
Copy link
Contributor Author

@coderabbitai review

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 4

Configuration used: .coderabbit.yaml

Commits Files that changed from the base of the PR and between b265c32 and 18b1eb6.
Files selected for processing (11)
  • src/zenml/artifact_stores/base_artifact_store.py (5 hunks)
  • src/zenml/artifacts/utils.py (5 hunks)
  • src/zenml/logging/step_logging.py (4 hunks)
  • src/zenml/materializers/base_materializer.py (1 hunks)
  • src/zenml/materializers/built_in_materializer.py (7 hunks)
  • src/zenml/materializers/cloudpickle_materializer.py (5 hunks)
  • src/zenml/materializers/numpy_materializer.py (6 hunks)
  • src/zenml/materializers/pandas_materializer.py (5 hunks)
  • src/zenml/materializers/service_materializer.py (3 hunks)
  • src/zenml/materializers/structured_string_materializer.py (3 hunks)
  • src/zenml/orchestrators/output_utils.py (3 hunks)
Files skipped from review as they are similar to previous changes (1)
  • src/zenml/artifacts/utils.py
Additional comments: 25
src/zenml/materializers/service_materializer.py (2)
  • 49-51: The use of Client().active_stack.artifact_store to access the artifact store in the load method is a good practice as it ensures that the operation is performed within the context of the active stack. This change enhances the isolation and security of artifact operations.
  • 64-66: The update in the save method to use Client().active_stack.artifact_store for file operations aligns with the objective of improving artifact store isolation. This approach ensures that the service configuration is saved within the boundaries of the active stack's artifact store, which is crucial for maintaining data integrity and security.
src/zenml/orchestrators/output_utils.py (2)
  • 75-85: The update to use Client().active_stack.artifact_store for artifact URI preparation and directory operations in prepare_output_artifact_uris is a significant improvement. It ensures that artifact operations are performed within the context of the active stack's artifact store, enhancing isolation and security. However, it's important to ensure that error handling is robust, especially for cases where artifact URIs already exist or directory creation fails.
  • 96-99: The use of Client().active_stack.artifact_store for directory removal in remove_artifact_dirs is consistent with the PR's goal of improving artifact store isolation. This change ensures that operations are confined within the active stack's artifact store, which is crucial for maintaining data integrity and security. It's important to verify that error handling is in place for cases where directory removal fails.
src/zenml/materializers/structured_string_materializer.py (2)
  • 50-51: The update in the load method to use Client().active_stack.artifact_store for file operations is a positive change, aligning with the PR's objectives to enhance artifact store isolation. This ensures that file operations are performed within the context of the active stack's artifact store, which is crucial for maintaining data integrity and security.
  • 60-61: The change in the save method to utilize Client().active_stack.artifact_store for saving structured strings (CSV, HTML, Markdown) is consistent with the goal of improving artifact store isolation. This approach ensures that data is saved within the boundaries of the active stack's artifact store, enhancing data integrity and security.
src/zenml/materializers/cloudpickle_materializer.py (2)
  • 59-65: > 📝 NOTE

This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [62-76]

The implementation of Python version compatibility checks in the load method, along with the use of Client().active_stack.artifact_store for file operations, is a thoughtful addition. This ensures that artifacts are loaded within the context of the active stack's artifact store and alerts users to potential issues when loading artifacts materialized under different Python versions. This approach enhances data integrity and user awareness of potential compatibility issues.

  • 94-100: > 📝 NOTE

This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [97-116]

The update in the save method to use Client().active_stack.artifact_store for saving data, along with the warning about the use of the default Pickle materializer, is consistent with the PR's objectives. This change ensures that data is saved within the boundaries of the active stack's artifact store and raises awareness about the limitations of using Pickle for serialization, encouraging users to consider implementing custom materializers for better compatibility and security.

src/zenml/materializers/pandas_materializer.py (2)
  • 77-86: > 📝 NOTE

This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [80-94]

The update in the load method to use Client().active_stack.artifact_store for loading data from .parquet or .csv files is a significant improvement. This change ensures that data loading operations are performed within the context of the active stack's artifact store, enhancing data integrity and security. The conditional handling based on the existence of .parquet files and the availability of pyarrow is well-implemented, ensuring compatibility and providing clear guidance for users on required dependencies.

  • 126-134: The changes in the save method to utilize Client().active_stack.artifact_store for saving data in .parquet or .csv format align with the PR's objectives to improve artifact store isolation. This approach ensures that data is saved within the boundaries of the active stack's artifact store, enhancing data integrity and security. The conditional logic to save data in .parquet format if pyarrow is available, otherwise in .csv format, is a practical solution that offers flexibility and efficiency in data storage.
src/zenml/logging/step_logging.py (2)
  • 75-90: > 📝 NOTE

This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [67-87]

The update to use Client().active_stack.artifact_store for preparing log URIs in the prepare_logs_uri function is a positive change, ensuring that log files are stored within the context of the active stack's artifact store. This enhances the isolation and security of log operations. The handling of log file creation and removal is well-implemented, ensuring that logs are stored in a dedicated directory and that old log files are appropriately managed.

  • 139-145: The implementation of the save_to_file method in the StepLogsStorage class to use Client().active_stack.artifact_store for saving logs is consistent with the PR's objectives to improve artifact store isolation. This ensures that logs are saved within the boundaries of the active stack's artifact store, enhancing data integrity and security. The approach to buffer and store logs, along with the removal of ANSI escape codes, is well-thought-out, ensuring that logs are stored in a clean and readable format.
src/zenml/materializers/numpy_materializer.py (4)
  • 22-22: The import of the Client class from zenml.client is correctly added to facilitate the new approach of handling artifact store operations. This change aligns with the PR's objective to improve artifact store isolation.
  • 85-85: The use of artifact_store.open for reading numpy arrays from parquet files is correctly implemented. This change is part of the transition to using the Client class for artifact store operations. Ensure that error handling is robust, especially for cases where pyarrow is not installed, as this is critical for backward compatibility with older versions of ZenML.
  • 162-164: The method _save_histogram correctly uses the Client class for saving histograms. This is a good example of the improved artifact store isolation. Ensure that the visualization saving process is tested thoroughly, especially in edge cases where matplotlib might not be installed.
  • 177-178: The method _save_image correctly uses the Client class for saving images. This change is consistent with the PR's objectives. As with the _save_histogram method, ensure thorough testing of the visualization saving process.
src/zenml/materializers/base_materializer.py (1)
  • 159-161: The update to use artifact_store for file operations in the save_visualizations method is correctly implemented. This change enhances integration with the active stack's artifact store, aligning with the PR's objectives. Ensure that all visualization-related file operations are thoroughly tested to confirm that they work as expected with the new artifact store approach.
src/zenml/materializers/built_in_materializer.py (3)
  • 29-29: The import of the Client class from zenml.client is correctly added to facilitate the new approach of handling artifact store operations. This change aligns with the PR's objective to improve artifact store isolation.
  • 287-291: The method for loading materialized built-in container objects correctly uses the Client class for checking file existence. Ensure that error handling and logging are robust, especially for cases where expected files do not exist, to provide clear feedback to users.
  • 358-364: > 📝 NOTE

This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [361-382]

The method for saving built-in container objects correctly transitions to using artifact_store. However, ensure that the process of creating subdirectories (artifact_store.mkdir) and handling exceptions is thoroughly tested, especially in scenarios where cleanup is necessary due to partial failures.

src/zenml/artifact_stores/base_artifact_store.py (5)
  • 16-16: The import of inspect is appropriate for introspection purposes, especially for the enhancements made to method registration and subclass initialization. This aligns with the PR's objective to enforce better isolation and validation within the artifact store.
  • 19-19: The import of Path from pathlib is crucial for handling filesystem paths in a platform-independent manner. This is a good practice, especially for a component like an artifact store that deals with file and directory paths extensively.
  • 46-75: > 📝 NOTE

This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [49-95]

The _sanitize_potential_path function has been modified to handle root path and path type, and now raises a FileNotFoundError if the input path is outside of the artifact store bounds. This is a significant improvement in enforcing artifact store isolation. However, ensure that all calls to this function correctly handle the potential FileNotFoundError to avoid unhandled exceptions.

  • 100-184: The addition of the decorator function within _sanitize_paths is a clever way to enforce path sanitization across all relevant methods. This approach ensures that all paths are validated against the artifact store's root path, enhancing security and isolation. It's important to verify that this decorator does not introduce any performance issues, especially in high-throughput scenarios.
  • 477-489: The addition of the __init_subclass__ method to wrap abstract method implementations with path sanitizer is a proactive measure to ensure that all subclasses of BaseArtifactStore automatically enforce path sanitization. This is a good practice for maintaining consistency and security across all artifact store implementations. However, it's crucial to ensure that this does not interfere with the intended functionality of any subclass methods.

src/zenml/artifact_stores/base_artifact_store.py Outdated Show resolved Hide resolved
src/zenml/artifact_stores/base_artifact_store.py Outdated Show resolved Hide resolved
src/zenml/artifact_stores/base_artifact_store.py Outdated Show resolved Hide resolved
@avishniakov avishniakov merged commit 00e934f into develop Mar 5, 2024
55 checks passed
@avishniakov avishniakov deleted the bugfix/OSSK-462-directory-traversal-via-logs-object-of-a-step-endpoint branch March 5, 2024 08:02
adtygan pushed a commit to adtygan/zenml that referenced this pull request Mar 21, 2024
* dir traversal issue

* Auto-update of Starter template

* Auto-update of NLP template

* reroute artifacts and logs via AS

* reroute materializers via AS

* simplify to one deco

* fix materializer tests

* allow local download

* Auto-update of E2E template

* fix test issues

* rework based on comments

* fix bugs

* lint

* Candidate (zenml-io#2493)

Co-authored-by: Stefan Nica <[email protected]>

* darglint

---------

Co-authored-by: GitHub Actions <[email protected]>
Co-authored-by: Stefan Nica <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working internal To filter out internal PRs and issues run-slow-ci security
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants