GT4SD · jannisborn · Jun 16, 2022 · Jun 3, 2022 · Jun 13, 2022 · Jun 13, 2022
diff --git a/.github/workflows/tests.yaml b/.github/workflows/tests.yaml
@@ -57,5 +57,6 @@ jobs:
           gt4sd-trainer --help
           gt4sd-inference --help
           gt4sd-saving --help
+          gt4sd-upload --help
           gt4sd-pl-to-hf --help
           gt4sd-hf-to-st --help
diff --git a/README.md b/README.md
@@ -268,6 +268,17 @@ Run the algorithm via `gt4sd-inference` (again the model produced in the example
 gt4sd-inference --algorithm_name PaccMannGP --algorithm_application PaccMannGPGenerator --algorithm_version fast-example-v0 --number_of_samples 5  --target '{"molwt": {"target": 60.0}}'
 ```
 
+### Uploading a trained algorithm on a server via the CLI command
+
+If you have access to a server (local or cloud) you can upload your trained models easily. The syntax follow the saving pipeline using `gt4sd-upload`:
+
+```sh
+gt4sd-upload --training_pipeline_name paccmann-vae-trainer --model_path /tmp/gt4sd-paccmann-gp --training_name fast-example --target_version fast-example-v0 --algorithm_application PaccMannGPGenerator
+```
+
+You have to update `gt4sd_s3_host`, `gt4sd_s3_access_key`, `gt4sd_s3_secret_key`, `gt4sd_s3_secure` and `gt4sd_s3_bucket` appropriately to upload the models.
+An example on self-hosting locally a COS (minio) where to upload your models can be found [here](https://gt4sd.github.io/gt4sd-core/source/gt4sd_server_upload_md.md).
+
 ### Additional examples
 
 Find more examples in [notebooks](./notebooks)

diff --git a/docs/index.md b/docs/index.md
@@ -8,6 +8,7 @@ maxdepth: 2
 ---
 Examples on how to use the GT4SD algorithms <source/gt4sd_inference_usage_md.md>
 Examples on how to add an algorithm to GT4SD <source/gt4sd_algorithm_addition_md.md>
+Examples on how to upload models on a self-hosted minio service using GT4SD <source/gt4sd_server_upload_md.md>
 ```
 
 ## Python API

diff --git a/docs/source/gt4sd_server_upload_md.md b/docs/source/gt4sd_server_upload_md.md
@@ -0,0 +1,128 @@
+# GT4SD server upload
+
+Here we report an example of how you can setup a custom minio server on localhost where you can upload your algorithms. Keep in mind that the same procedure can be used with a pre-existing COS simply setting the environment variables to the appropriate values.
+
+------
+
+## Requirements
+
+* docker
+* minio
+* gt4sd [requirements](https://github.com/GT4SD/gt4sd-core/blob/main/requirements.txt)
+
+-----
+
+## Run a local minio server
+
+
+### 1) Set environment variables
+
+```sh
+export GT4SD_S3_SECRET_KEY=''
+export GT4SD_S3_ACCESS_KEY=''
+export GT4SD_S3_HOST='127.0.0.1:9000'
+export GT4SD_S3_SECURE=False
+export GT4SD_S3_BUCKET='gt4sd-cos-algorithms-artifacts'
+export GT4SD_S3_BUCKET_MODELS='gt4sd-cos-algorithms-models'
+export GT4SD_S3_BUCKET_DATA='gt4sd-cos-algorithms-data'
+```
+
+set `GT4SD_S3_SECURE` `True` or `False` if https/http server.
+
+### 2) Create a docker container with a minio server
+
+```sh
+cd ~/
+mkdir localhost-server
+cd localhost-server
+mkdir env/
+echo >> docker-compose.yml
+```
+
+copy this configuration script in `docker-compose.yml`:
+
+```sh
+version: '10'
+services:
+  cos:
+    image: minio/minio:RELEASE.2022-06-07T00-33-41Z
+    ports:
+      - 9000:9000
+    env_file:
+     - env/.env.dev 
+    environment:
+      MINIO_ACCESS_KEY: "${GT4SD_S3_ACCESS_KEY}"
+      MINIO_SECRET_KEY: "${GT4SD_S3_SECRET_KEY}"
+    command: server /export
+  createbuckets:
+    image: minio/mc
+    depends_on:
+      - cos
+    env_file:
+     - env/.env.dev
+    # ensure there is a file in the artifacts bucket
+    entrypoint: >
+      /bin/sh -c "
+      /usr/bin/mc config host add myminio http://cos:9000 ${GT4SD_S3_ACCESS_KEY} ${GT4SD_S3_SECRET_KEY};
+      /usr/bin/mc mb myminio/${GT4SD_S3_BUCKET};
+      /usr/bin/mc mb myminio/${GT4SD_S3_BUCKET_DATA};
+      /usr/bin/mc mb myminio/${GT4SD_S3_BUCKET_MODELS};
+      echo 'this is an artifact' >> a_file.txt;
+      /usr/bin/mc cp a_file.txt myminio/${GT4SD_S3_BUCKET}/a_file.txt;
+      exit 0;
+      "
+```
+
+You can store default environment variables in `.env.dev`.
+
+
+### 3) MinIO server configuration
+
+Add the new server to the minio configuration file (`~/.mc/config.json`):
+
+```sh
+{
+	"version": "10",
+	"aliases": {
+                "myminio": {
+                        "url": "${GT4SD_S3_HOST}",
+                        "accessKey": "${GT4SD_S3_ACCESS_KEY}",
+                        "secretKey": "${GT4SD_S3_SECRET_KEY}",
+                        "api": "s3v4",
+                        "path": "auto"
+                },
+                ...
+                }
+}
+```
+
+ and add `myminio` to the list of servers:
+
+```sh
+mc alias set myminio $GT4SD_S3_HOST $GT4SD_S3_ACCESS_KEY $GT4SD_S3_SECRET_KEY
+```
+
+### 4) run docker
+
+After running `docker compose up` inside localhost-server the script creates a local minio server and the bucket structure on `myminio`.
+If everything is working you should be able to see `a_file.txt` running:
+
+```
+mc ls myminio/gt4sd-cos-algorithms-artifacts/
+```
+
+-------
+
+## Upload models
+
+After setting th environment variables appropriately and following steps 1-4), you can now upload your model on the server:
+
+```sh
+gt4sd-upload --training_pipeline_name paccmann-vae-trainer --model_path /tmp/gt4sd-paccmann-gp --training_name fast-example --target_version fast-example-v0 --algorithm_application PaccMannGPGenerator
+```
+
+You should be able to see the model and uploaded files running:
+
+```sh
+mc ls myminio/gt4sd-algorithms-artifacts/controlled_sampling/PaccMannGP/PaccMannGPGenerator/fast-example-v0/
+```
diff --git a/requirements.txt b/requirements.txt
@@ -22,3 +22,5 @@ tensorflow==2.1.0
 keras==2.3.1
 sacremoses>=0.0.41
 protobuf<=3.20.1
+
+
diff --git a/setup.cfg b/setup.cfg
@@ -44,6 +44,7 @@ console_scripts=
     gt4sd-trainer = gt4sd.cli.trainer:main
     gt4sd-inference = gt4sd.cli.inference:main
     gt4sd-saving = gt4sd.cli.saving:main
+    gt4sd-upload = gt4sd.cli.upload:main
     gt4sd-pl-to-hf = gt4sd.cli.pl_to_hf_converter:main
     gt4sd-hf-to-st = gt4sd.cli.hf_to_st_converter:main
 

diff --git a/src/gt4sd/algorithms/core.py b/src/gt4sd/algorithms/core.py
@@ -54,6 +54,7 @@
     get_algorithm_subdirectories_with_s3,
     get_cached_algorithm_path,
     sync_algorithm_with_s3,
+    upload_to_s3,
 )
 from ..exceptions import InvalidItem, S3SyncError, SamplingError
 from ..training_pipelines.core import TrainingPipelineArguments
@@ -442,6 +443,33 @@ def list_versions(cls) -> Set[str]:
         versions = versions.union(get_algorithm_subdirectories_in_cache(prefix))
         return versions
 
+    @classmethod
+    def list_remote_versions(cls, prefix) -> Set[str]:
+        """Get possible algorithm versions on s3.
+           Before uploading an artifact on S3, we need to check that
+           a particular version is not already present and overwrite by mistake.
+           If the final set is empty we can then upload the folder artifact.
+           If the final set is not empty, we need to check that the specific version
+           of interest is not present.
+
+        only S3 is searched (not the local cache) for matching versions.
+
+        Returns:
+            viable values as :attr:`algorithm_version` for the environment.
+        """
+        # all name without version
+        if not prefix:
+            prefix = cls.get_application_prefix()
+        try:
+            versions = get_algorithm_subdirectories_with_s3(prefix)
+        except (KeyError, S3SyncError) as error:
+            logger.info(
+                f"searching S3 raised {error.__class__.__name__}. This means that no versions are available on S3."
+            )
+            logger.debug(error)
+            versions = set()
+        return versions
+
     @classmethod
     def get_filepath_mappings_for_training_pipeline_arguments(
         cls, training_pipeline_arguments: TrainingPipelineArguments
@@ -535,6 +563,99 @@ def save_version_from_training_pipeline_arguments(
 
             logger.info(f"Artifacts saving completed into {target_path}")
 
+    @classmethod
+    def upload_version_from_training_pipeline_arguments_postprocess(
+        cls,
+        training_pipeline_arguments: TrainingPipelineArguments,
+    ):
+        """Postprocess after uploading. Not implemented yet.
+
+        Args:
+            training_pipeline_arguments: training pipeline arguments.
+        """
+        pass
+
+    @classmethod
+    def upload_version_from_training_pipeline_arguments(
+        cls,
+        training_pipeline_arguments: TrainingPipelineArguments,
+        target_version: str,
+        source_version: Optional[str] = None,
+    ) -> None:
+        """Upload a version using training pipeline arguments.
+
+        Args:
+            training_pipeline_arguments: training pipeline arguments.
+            target_version: target version used to save the model in s3.
+            source_version: source version to use for missing artifacts.
+                Defaults to None, a.k.a., use the default version.
+        """
+        filepaths_mapping: Dict[str, str] = {}
+
+        try:
+            filepaths_mapping = (
+                cls.get_filepath_mappings_for_training_pipeline_arguments(
+                    training_pipeline_arguments=training_pipeline_arguments
+                )
+            )
+        except ValueError:
+            logger.info(
+                f"{cls.__name__} can not save a version based on {training_pipeline_arguments}"
+            )
+
+        if len(filepaths_mapping) > 0:
+            # probably redundant
+            if source_version is None:
+                source_version = cls.algorithm_version
+            source_missing_path = cls.ensure_artifacts_for_version(source_version)
+
+            # prefix for a run
+            prefix = cls.get_application_prefix()
+            # versions in s3 with that prefix
+            versions = cls.list_remote_versions(prefix)
+
+            # check if the target version is already in s3. If yes, don't upload.
+            if target_version not in versions:
+                logger.info(
+                    f"There is no version {target_version} in S3, starting upload..."
+                )
+            else:
+                logger.info(
+                    f"Version {target_version} already exists in S3, skipping upload..."
+                )
+                return
+
+            # mapping between filenames and paths for a version.
+            filepaths_mapping = {
+                filename: source_filepath
+                if os.path.exists(source_filepath)
+                else os.path.join(source_missing_path, filename)
+                for filename, source_filepath in filepaths_mapping.items()
+            }
+
+            logger.info(
+                f"Uploading artifacts into {os.path.join(prefix, target_version)}..."
+            )
+            try:
+                for target_filename, source_filepath in filepaths_mapping.items():
+                    # algorithm_type/algorithm_name/algorithm_application/version/filename
+                    # for the moment we assume that the prefix exists in s3.
+                    target_filepath = os.path.join(
+                        prefix, target_version, target_filename
+                    )
+                    upload_to_s3(target_filepath, source_filepath)
+                    logger.info(
+                        f"Upload artifact {source_filepath} into {target_filepath}..."
+                    )
+
+            except S3SyncError:
+                logger.warning("Problem with upload...")
+                return
+
+            logger.info(
+                f"Artifacts uploading completed into {os.path.join(prefix, target_version)}"
+            )
+
     @classmethod
     def ensure_artifacts_for_version(cls, algorithm_version: str) -> str:
         """The artifacts matching the path defined by class attributes and the given version are downloaded.
Original file line number	Diff line number	Diff line change
Expand Up		@@ -22,3 +22,5 @@ tensorflow==2.1.0
		keras==2.3.1
		sacremoses>=0.0.41
		protobuf<=3.20.1