Merge pull request #187 from cgat-developers/AC-docs

Ac docs
cgat-developers · Nov 14, 2024 · a81b816 · a81b816
2 parents f8ae6b0 + acd057e
commit a81b816
Show file tree

Hide file tree

Showing 4 changed files with 169 additions and 0 deletions.
diff --git a/docs/img → docs/img/CGAT_logo.png b/docs/img → docs/img/CGAT_logo.png
diff --git a/docs/s3_integration/configuring_s3.md b/docs/s3_integration/configuring_s3.md
@@ -0,0 +1,90 @@
+# Configuring S3 for Pipeline Execution
+
+To integrate AWS S3 into your CGAT pipeline, you need to configure S3 access to facilitate file handling for reading and writing data. This document explains how to set up S3 configuration for the CGAT pipelines.
+
+## Overview
+
+`configure_s3()` is a utility function provided by the CGATcore pipeline tools to handle authentication and access to AWS S3. This function allows you to provide credentials, specify regions, and set up other configurations that enable seamless integration of S3 into your workflow.
+
+### Basic Configuration
+
+To get started, you will need to import and use the `configure_s3()` function. Here is a basic example:
+
+```python
+from cgatcore.pipeline import configure_s3
+
+configure_s3(aws_access_key_id="YOUR_AWS_ACCESS_KEY", aws_secret_access_key="YOUR_AWS_SECRET_KEY")
+```
+
+### Configurable Parameters
+
+- **`aws_access_key_id`**: Your AWS access key, used to authenticate and identify the user.
+- **`aws_secret_access_key`**: Your secret key, corresponding to your access key.
+- **`region_name`** (optional): AWS region where your S3 bucket is located. Defaults to the region set in your environment, if available.
+- **`profile_name`** (optional): Name of the AWS profile to use if you have multiple profiles configured locally.
+
+### Using AWS Profiles
+
+If you have multiple AWS profiles configured locally, you can use the `profile_name` parameter to select the appropriate one without hardcoding the access keys in your code:
+
+```python
+configure_s3(profile_name="my-profile")
+```
+
+### Configuring Endpoints
+
+To use custom endpoints, such as when working with MinIO or an AWS-compatible service:
+
+```python
+configure_s3(
+    aws_access_key_id="YOUR_AWS_ACCESS_KEY",
+    aws_secret_access_key="YOUR_AWS_SECRET_KEY",
+    endpoint_url="https://custom-endpoint.com"
+)
+```
+
+### Security Recommendations
+
+1. **Environment Variables**: Use environment variables to set credentials securely rather than hardcoding them in your scripts. This avoids potential exposure of credentials:
+
+   ```bash
+   export AWS_ACCESS_KEY_ID=YOUR_AWS_ACCESS_KEY
+   export AWS_SECRET_ACCESS_KEY=YOUR_AWS_SECRET_KEY
+   ```
+
+2. **AWS IAM Roles**: If you are running the pipeline on AWS infrastructure (such as EC2 instances), it's recommended to use IAM roles. These roles provide temporary security credentials that are automatically rotated by AWS.
+
+### Example Pipeline Integration
+
+After configuring S3, you can seamlessly use the S3-aware methods within your pipeline. Below is an example:
+
+```python
+from cgatcore.pipeline import get_s3_pipeline
+
+# Configure S3 access
+configure_s3(profile_name="my-profile")
+
+# Instantiate the S3 pipeline
+s3_pipeline = get_s3_pipeline()
+
+# Use S3-aware methods in the pipeline
+@s3_pipeline.s3_transform("s3://my-bucket/input.txt", suffix(".txt"), ".processed")
+def process_s3_file(infile, outfile):
+    # Processing logic
+    with open(infile, 'r') as fin:
+        data = fin.read()
+        processed_data = data.upper()
+    with open(outfile, 'w') as fout:
+        fout.write(processed_data)
+```
+
+### Summary
+
+- Use the `configure_s3()` function to set up AWS credentials and S3 access.
+- Options are available to use IAM roles, profiles, or custom endpoints.
+- Use the S3-aware decorators to integrate S3 files seamlessly in your pipeline.
+
+## Additional Resources
+
+- [AWS IAM Roles Documentation](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles.html)
+- [AWS CLI Configuration and Credential Files](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html)
diff --git a/docs/s3_integration/s3_decorators.md b/docs/s3_integration/s3_decorators.md
@@ -0,0 +1,5 @@
+# CGATcore S3 decorators
+
+::: cgatcore.pipeline
+    :members:
+    :show-inheritance:
diff --git a/docs/s3_integration/s3_pipeline.md b/docs/s3_integration/s3_pipeline.md
@@ -0,0 +1,74 @@
+# S3 Pipeline
+
+The `S3Pipeline` class is part of the integration with AWS S3, enabling seamless data handling in CGAT pipelines that use both local files and S3 storage. This is particularly useful when working with large datasets that are better managed in cloud storage or when collaborating across multiple locations.
+
+## Overview
+
+`S3Pipeline` provides the following functionalities:
+
+- Integration of AWS S3 into CGAT pipeline workflows
+- Lazy-loading of S3-specific classes and functions to avoid circular dependencies
+- Facilitating operations on files that reside on S3, making it possible to apply transformations and merges without copying everything locally
+
+### Example Usage
+
+The `S3Pipeline` class can be accessed through the `get_s3_pipeline()` function, which returns an instance that is lazy-loaded to prevent issues related to circular imports. Below is an example of how to use it:
+
+```python
+from cgatcore.pipeline import get_s3_pipeline
+
+# Instantiate the S3 pipeline
+s3_pipeline = get_s3_pipeline()
+
+# Use methods from s3_pipeline as needed
+s3_pipeline.s3_transform(...)
+```
+
+### Building a Function Using `S3Pipeline`
+
+To build a function that utilises `S3Pipeline`, you can follow a few simple steps. Below is a guide on how to create a function that uses the `s3_transform` method to process data from S3:
+
+1. **Import the required modules**: First, import `get_s3_pipeline` from `cgatcore.pipeline`.
+2. **Instantiate the pipeline**: Use `get_s3_pipeline()` to create an instance of `S3Pipeline`.
+3. **Define your function**: Use the S3-aware methods like `s3_transform()` to perform the desired operations on your S3 files.
+
+#### Example Function
+
+```python
+from cgatcore.pipeline import get_s3_pipeline
+
+# Instantiate the S3 pipeline
+s3_pipeline = get_s3_pipeline()
+
+# Define a function that uses s3_transform
+def process_s3_data(input_s3_path, output_s3_path):
+    @s3_pipeline.s3_transform(input_s3_path, suffix(".txt"), output_s3_path)
+    def transform_data(infile, outfile):
+        # Add your processing logic here
+        with open(infile, 'r') as fin:
+            data = fin.read()
+            # Example transformation
+            processed_data = data.upper()
+        with open(outfile, 'w') as fout:
+            fout.write(processed_data)
+
+    # Run the transformation
+    transform_data()
+```
+
+### Methods in `S3Pipeline`
+
+- **`s3_transform(*args, **kwargs)`**: Perform a transformation on data stored in S3, similar to Ruffus `transform()` but adapted for S3 files.
+- **`s3_merge(*args, **kwargs)`**: Merge multiple input files into one, allowing the files to be located on S3.
+- **`s3_split(*args, **kwargs)`**: Split input data into smaller chunks, enabling parallel processing, even when the input resides on S3.
+- **`s3_originate(*args, **kwargs)`**: Create new files directly in S3.
+- **`s3_follows(*args, **kwargs)`**: Indicate a dependency on another task, ensuring correct task ordering even for S3 files.
+
+These methods are intended to be directly equivalent to standard Ruffus methods, allowing pipelines to easily mix and match S3-based and local operations.
+
+## Why Use `S3Pipeline`?
+
+- **Scalable Data Management**: Keeps large datasets in the cloud, reducing local storage requirements.
+- **Seamless Integration**: Provides a drop-in replacement for standard decorators, enabling hybrid workflows involving both local and cloud files.
+- **Lazy Loading**: Optimised to initialise S3 components only when they are needed, minimising overhead and avoiding unnecessary dependencies.
+