Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cli-upload #81

Merged
merged 17 commits into from
Jun 16, 2022
Merged
Show file tree
Hide file tree
Changes from 14 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -57,5 +57,6 @@ jobs:
gt4sd-trainer --help
gt4sd-inference --help
gt4sd-saving --help
gt4sd-upload --help
gt4sd-pl-to-hf --help
gt4sd-hf-to-st --help
11 changes: 11 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -268,6 +268,17 @@ Run the algorithm via `gt4sd-inference` (again the model produced in the example
gt4sd-inference --algorithm_name PaccMannGP --algorithm_application PaccMannGPGenerator --algorithm_version fast-example-v0 --number_of_samples 5 --target '{"molwt": {"target": 60.0}}'
```

### Uploading a trained algorithm on a server via the CLI command

If you have access to a server (local or cloud) you can upload your trained models easily. The syntax follow the saving pipeline using `gt4sd-upload`:

```sh
gt4sd-upload --training_pipeline_name paccmann-vae-trainer --model_path /tmp/gt4sd-paccmann-gp --training_name fast-example --target_version fast-example-v0 --algorithm_application PaccMannGPGenerator
```

You have to update `gt4sd_s3_host`, `gt4sd_s3_access_key`, `gt4sd_s3_secret_key`, `gt4sd_s3_secure` and `gt4sd_s3_bucket` appropriately to upload the models.
georgosgeorgos marked this conversation as resolved.
Show resolved Hide resolved
An example on self-hosting locally a COS (minio) where to upload your models can be found [here](https://gt4sd.github.io/gt4sd-core/source/gt4sd_server_upload_md.md).

### Additional examples

Find more examples in [notebooks](./notebooks)
Expand Down
1 change: 1 addition & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ maxdepth: 2
---
Examples on how to use the GT4SD algorithms <source/gt4sd_inference_usage_md.md>
Examples on how to add an algorithm to GT4SD <source/gt4sd_algorithm_addition_md.md>
Examples on how to upload models on a self-hosted minio service using GT4SD <source/gt4sd_server_upload_md.md>
```

## Python API
Expand Down
128 changes: 128 additions & 0 deletions docs/source/gt4sd_server_upload_md.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
# GT4SD server upload
georgosgeorgos marked this conversation as resolved.
Show resolved Hide resolved

georgosgeorgos marked this conversation as resolved.
Show resolved Hide resolved
Here we report an example of how you can setup a custom minio server on localhost where you can upload your algorithms. Keep in mind that the same procedure can be used with a pre-existing COS simply setting the environment variables to the appropriate values.

------

## Requirements

* docker
* minio
* gt4sd [requirements](https://github.com/GT4SD/gt4sd-core/blob/main/requirements.txt)

-----

## Run a local minio server


### 1) Set environment variables

```sh
export GT4SD_S3_SECRET_KEY=''
export GT4SD_S3_ACCESS_KEY=''
export GT4SD_S3_HOST='127.0.0.1:9000'
export GT4SD_S3_SECURE=False
export GT4SD_S3_BUCKET='gt4sd-cos-algorithms-artifacts'
export GT4SD_S3_BUCKET_MODELS='gt4sd-cos-algorithms-models'
export GT4SD_S3_BUCKET_DATA='gt4sd-cos-algorithms-data'
```

set `GT4SD_S3_SECURE` `True` or `False` if https/http server.

### 2) Create a docker container with a minio server

```sh
cd ~/
mkdir localhost-server
cd localhost-server
mkdir env/
echo >> docker-compose.yml
```

copy this configuration script in `docker-compose.yml`:

```sh
version: '10'
services:
cos:
image: minio/minio:RELEASE.2022-06-07T00-33-41Z
ports:
- 9000:9000
env_file:
- env/.env.dev
environment:
MINIO_ACCESS_KEY: "${GT4SD_S3_ACCESS_KEY}"
MINIO_SECRET_KEY: "${GT4SD_S3_SECRET_KEY}"
command: server /export
createbuckets:
image: minio/mc
depends_on:
- cos
env_file:
- env/.env.dev
# ensure there is a file in the artifacts bucket
entrypoint: >
/bin/sh -c "
/usr/bin/mc config host add myminio http://cos:9000 ${GT4SD_S3_ACCESS_KEY} ${GT4SD_S3_SECRET_KEY};
/usr/bin/mc mb myminio/${GT4SD_S3_BUCKET};
/usr/bin/mc mb myminio/${GT4SD_S3_BUCKET_DATA};
/usr/bin/mc mb myminio/${GT4SD_S3_BUCKET_MODELS};
echo 'this is an artifact' >> a_file.txt;
/usr/bin/mc cp a_file.txt myminio/${GT4SD_S3_BUCKET}/a_file.txt;
exit 0;
"
```

You can store default environment variables in `.env.dev`.


### 3) MinIO server configuration

Add the new server to the minio configuration file (`~/.mc/config.json`):

```sh
{
"version": "10",
"aliases": {
"myminio": {
"url": "${GT4SD_S3_HOST}",
"accessKey": "${GT4SD_S3_ACCESS_KEY}",
"secretKey": "${GT4SD_S3_SECRET_KEY}",
"api": "s3v4",
"path": "auto"
},
...
}
}
```

and add `myminio` to the list of servers:

```sh
mc alias set myminio $GT4SD_S3_HOST $GT4SD_S3_ACCESS_KEY $GT4SD_S3_SECRET_KEY
```

### 4) run docker

After running `docker compose up` inside localhost-server the script creates a local minio server and the bucket structure on `myminio`.
If everything is working you should be able to see `a_file.txt` running:

```
mc ls myminio/gt4sd-cos-algorithms-artifacts/
```

-------

## Upload models
georgosgeorgos marked this conversation as resolved.
Show resolved Hide resolved

After setting th environment variables appropriately and following steps 1-4), you can now upload your model on the server:

```sh
gt4sd-upload --training_pipeline_name paccmann-vae-trainer --model_path /tmp/gt4sd-paccmann-gp --training_name fast-example --target_version fast-example-v0 --algorithm_application PaccMannGPGenerator
```

You should be able to see the model and uploaded files running:

```sh
mc ls myminio/gt4sd-algorithms-artifacts/controlled_sampling/PaccMannGP/PaccMannGPGenerator/fast-example-v0/
```
2 changes: 2 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -22,3 +22,5 @@ tensorflow==2.1.0
keras==2.3.1
sacremoses>=0.0.41
protobuf<=3.20.1


1 change: 1 addition & 0 deletions setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ console_scripts=
gt4sd-trainer = gt4sd.cli.trainer:main
gt4sd-inference = gt4sd.cli.inference:main
gt4sd-saving = gt4sd.cli.saving:main
gt4sd-upload = gt4sd.cli.upload:main
gt4sd-pl-to-hf = gt4sd.cli.pl_to_hf_converter:main
gt4sd-hf-to-st = gt4sd.cli.hf_to_st_converter:main

Expand Down
121 changes: 121 additions & 0 deletions src/gt4sd/algorithms/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,7 @@
get_algorithm_subdirectories_with_s3,
get_cached_algorithm_path,
sync_algorithm_with_s3,
upload_to_s3,
)
from ..exceptions import InvalidItem, S3SyncError, SamplingError
from ..training_pipelines.core import TrainingPipelineArguments
Expand Down Expand Up @@ -442,6 +443,33 @@ def list_versions(cls) -> Set[str]:
versions = versions.union(get_algorithm_subdirectories_in_cache(prefix))
return versions

@classmethod
def list_remote_versions(cls, prefix) -> Set[str]:
"""Get possible algorithm versions on s3.
Before uploading an artifact on S3, we need to check that
a particular version is not already present and overwrite by mistake.
If the final set is empty we can then upload the folder artifact.
If the final set is not empty, we need to check that the specific version
of interest is not present.

only S3 is searched (not the local cache) for matching versions.

Returns:
viable values as :attr:`algorithm_version` for the environment.
"""
# all name without version
if not prefix:
prefix = cls.get_application_prefix()
try:
versions = get_algorithm_subdirectories_with_s3(prefix)
except (KeyError, S3SyncError) as error:
logger.info(
f"searching S3 raised {error.__class__.__name__}. This means that no versions are available on S3."
)
logger.debug(error)
versions = set()
return versions

@classmethod
def get_filepath_mappings_for_training_pipeline_arguments(
cls, training_pipeline_arguments: TrainingPipelineArguments
Expand Down Expand Up @@ -535,6 +563,99 @@ def save_version_from_training_pipeline_arguments(

logger.info(f"Artifacts saving completed into {target_path}")

@classmethod
def upload_version_from_training_pipeline_arguments_postprocess(
cls,
training_pipeline_arguments: TrainingPipelineArguments,
):
"""Postprocess after uploading. Not implemented yet.

Args:
training_pipeline_arguments: training pipeline arguments.
"""
pass

@classmethod
def upload_version_from_training_pipeline_arguments(
cls,
training_pipeline_arguments: TrainingPipelineArguments,
target_version: str,
source_version: Optional[str] = None,
) -> None:
"""Upload a version using training pipeline arguments.

Args:
training_pipeline_arguments: training pipeline arguments.
target_version: target version used to save the model in s3.
source_version: source version to use for missing artifacts.
Defaults to None, a.k.a., use the default version.
"""
filepaths_mapping: Dict[str, str] = {}

try:
filepaths_mapping = (
cls.get_filepath_mappings_for_training_pipeline_arguments(
training_pipeline_arguments=training_pipeline_arguments
)
)
except ValueError:
logger.info(
f"{cls.__name__} can not save a version based on {training_pipeline_arguments}"
)

if len(filepaths_mapping) > 0:
# probably redundant
if source_version is None:
source_version = cls.algorithm_version
source_missing_path = cls.ensure_artifacts_for_version(source_version)

# prefix for a run
prefix = cls.get_application_prefix()
# versions in s3 with that prefix
versions = cls.list_remote_versions(prefix)

# check if the target version is already in s3. If yes, don't upload.
if target_version not in versions:
logger.info(
f"There is no version {target_version} in S3, starting upload..."
)
else:
logger.info(
f"Version {target_version} already exists in S3, skipping upload..."
)
return

# mapping between filenames and paths for a version.
filepaths_mapping = {
filename: source_filepath
if os.path.exists(source_filepath)
else os.path.join(source_missing_path, filename)
for filename, source_filepath in filepaths_mapping.items()
}

logger.info(
f"Uploading artifacts into {os.path.join(prefix, target_version)}..."
)
try:
for target_filename, source_filepath in filepaths_mapping.items():
# algorithm_type/algorithm_name/algorithm_application/version/filename
# for the moment we assume that the prefix exists in s3.
target_filepath = os.path.join(
prefix, target_version, target_filename
)
upload_to_s3(target_filepath, source_filepath)
logger.info(
f"Upload artifact {source_filepath} into {target_filepath}..."
)

except S3SyncError:
logger.warning("Problem with upload...")
return

logger.info(
f"Artifacts uploading completed into {os.path.join(prefix, target_version)}"
)

@classmethod
def ensure_artifacts_for_version(cls, algorithm_version: str) -> str:
"""The artifacts matching the path defined by class attributes and the given version are downloaded.
Expand Down
Loading