Skip to content

Commit

Permalink
Merge pull request #187 from cgat-developers/AC-docs
Browse files Browse the repository at this point in the history
Ac docs
  • Loading branch information
Acribbs authored Nov 14, 2024
2 parents f8ae6b0 + acd057e commit a81b816
Show file tree
Hide file tree
Showing 4 changed files with 169 additions and 0 deletions.
File renamed without changes
90 changes: 90 additions & 0 deletions docs/s3_integration/configuring_s3.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
# Configuring S3 for Pipeline Execution

To integrate AWS S3 into your CGAT pipeline, you need to configure S3 access to facilitate file handling for reading and writing data. This document explains how to set up S3 configuration for the CGAT pipelines.

## Overview

`configure_s3()` is a utility function provided by the CGATcore pipeline tools to handle authentication and access to AWS S3. This function allows you to provide credentials, specify regions, and set up other configurations that enable seamless integration of S3 into your workflow.

### Basic Configuration

To get started, you will need to import and use the `configure_s3()` function. Here is a basic example:

```python
from cgatcore.pipeline import configure_s3

configure_s3(aws_access_key_id="YOUR_AWS_ACCESS_KEY", aws_secret_access_key="YOUR_AWS_SECRET_KEY")
```

### Configurable Parameters

- **`aws_access_key_id`**: Your AWS access key, used to authenticate and identify the user.
- **`aws_secret_access_key`**: Your secret key, corresponding to your access key.
- **`region_name`** (optional): AWS region where your S3 bucket is located. Defaults to the region set in your environment, if available.
- **`profile_name`** (optional): Name of the AWS profile to use if you have multiple profiles configured locally.

### Using AWS Profiles

If you have multiple AWS profiles configured locally, you can use the `profile_name` parameter to select the appropriate one without hardcoding the access keys in your code:

```python
configure_s3(profile_name="my-profile")
```

### Configuring Endpoints

To use custom endpoints, such as when working with MinIO or an AWS-compatible service:

```python
configure_s3(
aws_access_key_id="YOUR_AWS_ACCESS_KEY",
aws_secret_access_key="YOUR_AWS_SECRET_KEY",
endpoint_url="https://custom-endpoint.com"
)
```

### Security Recommendations

1. **Environment Variables**: Use environment variables to set credentials securely rather than hardcoding them in your scripts. This avoids potential exposure of credentials:

```bash
export AWS_ACCESS_KEY_ID=YOUR_AWS_ACCESS_KEY
export AWS_SECRET_ACCESS_KEY=YOUR_AWS_SECRET_KEY
```

2. **AWS IAM Roles**: If you are running the pipeline on AWS infrastructure (such as EC2 instances), it's recommended to use IAM roles. These roles provide temporary security credentials that are automatically rotated by AWS.
### Example Pipeline Integration
After configuring S3, you can seamlessly use the S3-aware methods within your pipeline. Below is an example:
```python
from cgatcore.pipeline import get_s3_pipeline
# Configure S3 access
configure_s3(profile_name="my-profile")
# Instantiate the S3 pipeline
s3_pipeline = get_s3_pipeline()
# Use S3-aware methods in the pipeline
@s3_pipeline.s3_transform("s3://my-bucket/input.txt", suffix(".txt"), ".processed")
def process_s3_file(infile, outfile):
# Processing logic
with open(infile, 'r') as fin:
data = fin.read()
processed_data = data.upper()
with open(outfile, 'w') as fout:
fout.write(processed_data)
```
### Summary
- Use the `configure_s3()` function to set up AWS credentials and S3 access.
- Options are available to use IAM roles, profiles, or custom endpoints.
- Use the S3-aware decorators to integrate S3 files seamlessly in your pipeline.
## Additional Resources
- [AWS IAM Roles Documentation](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles.html)
- [AWS CLI Configuration and Credential Files](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html)
5 changes: 5 additions & 0 deletions docs/s3_integration/s3_decorators.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# CGATcore S3 decorators

::: cgatcore.pipeline
:members:
:show-inheritance:
74 changes: 74 additions & 0 deletions docs/s3_integration/s3_pipeline.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# S3 Pipeline

The `S3Pipeline` class is part of the integration with AWS S3, enabling seamless data handling in CGAT pipelines that use both local files and S3 storage. This is particularly useful when working with large datasets that are better managed in cloud storage or when collaborating across multiple locations.

## Overview

`S3Pipeline` provides the following functionalities:

- Integration of AWS S3 into CGAT pipeline workflows
- Lazy-loading of S3-specific classes and functions to avoid circular dependencies
- Facilitating operations on files that reside on S3, making it possible to apply transformations and merges without copying everything locally

### Example Usage

The `S3Pipeline` class can be accessed through the `get_s3_pipeline()` function, which returns an instance that is lazy-loaded to prevent issues related to circular imports. Below is an example of how to use it:

```python
from cgatcore.pipeline import get_s3_pipeline

# Instantiate the S3 pipeline
s3_pipeline = get_s3_pipeline()

# Use methods from s3_pipeline as needed
s3_pipeline.s3_transform(...)
```

### Building a Function Using `S3Pipeline`

To build a function that utilises `S3Pipeline`, you can follow a few simple steps. Below is a guide on how to create a function that uses the `s3_transform` method to process data from S3:

1. **Import the required modules**: First, import `get_s3_pipeline` from `cgatcore.pipeline`.
2. **Instantiate the pipeline**: Use `get_s3_pipeline()` to create an instance of `S3Pipeline`.
3. **Define your function**: Use the S3-aware methods like `s3_transform()` to perform the desired operations on your S3 files.

#### Example Function

```python
from cgatcore.pipeline import get_s3_pipeline

# Instantiate the S3 pipeline
s3_pipeline = get_s3_pipeline()

# Define a function that uses s3_transform
def process_s3_data(input_s3_path, output_s3_path):
@s3_pipeline.s3_transform(input_s3_path, suffix(".txt"), output_s3_path)
def transform_data(infile, outfile):
# Add your processing logic here
with open(infile, 'r') as fin:
data = fin.read()
# Example transformation
processed_data = data.upper()
with open(outfile, 'w') as fout:
fout.write(processed_data)

# Run the transformation
transform_data()
```

### Methods in `S3Pipeline`

- **`s3_transform(*args, **kwargs)`**: Perform a transformation on data stored in S3, similar to Ruffus `transform()` but adapted for S3 files.
- **`s3_merge(*args, **kwargs)`**: Merge multiple input files into one, allowing the files to be located on S3.
- **`s3_split(*args, **kwargs)`**: Split input data into smaller chunks, enabling parallel processing, even when the input resides on S3.
- **`s3_originate(*args, **kwargs)`**: Create new files directly in S3.
- **`s3_follows(*args, **kwargs)`**: Indicate a dependency on another task, ensuring correct task ordering even for S3 files.

These methods are intended to be directly equivalent to standard Ruffus methods, allowing pipelines to easily mix and match S3-based and local operations.

## Why Use `S3Pipeline`?

- **Scalable Data Management**: Keeps large datasets in the cloud, reducing local storage requirements.
- **Seamless Integration**: Provides a drop-in replacement for standard decorators, enabling hybrid workflows involving both local and cloud files.
- **Lazy Loading**: Optimised to initialise S3 components only when they are needed, minimising overhead and avoiding unnecessary dependencies.

0 comments on commit a81b816

Please sign in to comment.