Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support cloud storage #2620

Closed
wants to merge 38 commits into from
Closed

Support cloud storage #2620

wants to merge 38 commits into from

Conversation

Marishka17
Copy link
Contributor

@Marishka17 Marishka17 commented Dec 25, 2020

Motivation and context

Support work with remote cloud storage without copying data to CVAT
Related issue: #863

How to generate temporary credentials (S3) e.g.

import boto3
# generate temporatry credentials
sts_client = boto3.client('sts', aws_access_key_id="...", aws_secret_access_key="...")
tokens = sts_client.get_session_token()
aws_access_key_id = tokens.get('Credentials').get('AccessKeyId')
aws_secret_access_key = tokens.get('Credentials').get('SecretAccessKey')
aws_session_token = tokens.get('Credentials').get('SessionToken')

# test a credentials validity
s3 = boto3.client('s3', 
	aws_access_key_id=aws_access_key_id, 
	aws_secret_access_key=aws_secret_access_key,
	aws_session_token=aws_session_token
)
s3.list_buckets()

How has this been tested?

Manually with swagger

REST API

  • GET /api/v1/cloudstorages
  • POST /api/v1/cloudstorages
  • GET /api/v1/cloudstorages/{id}
  • GET /api/v1/cloudstorages/{id}/content
  • PATCH /api/v1/cloudstorages/{id}

Supported cloud providers:

  • Azure Blob container (implemented, not tested)
  • AWS S3 bucket (implemented, tested)
  • Google Drive

Iterations:

  1. Support images
  2. Support video
  3. Support archive

Checklist

License

  • I submit my code changes under the same MIT License that covers the project.
    Feel free to contact the maintainers if that's a concern.
  • I have updated the license header for each file (see an example below)
# Copyright (C) 2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@coveralls
Copy link

coveralls commented Dec 25, 2020

Coverage Status

Coverage decreased (-0.8%) to 73.245% when pulling 91c0e42 on mk/support_cloud_storage into be9e00f on develop.

@iraadit
Copy link

iraadit commented Feb 22, 2021

I'm really interested by this feature, do you have an idea of when it will be integrated?

By the way, I'm using MinIO; that is a S3-compatible storage, i imagine it should be used as for AWS S3 therefore?

@Marishka17
Copy link
Contributor Author

I'm really interested by this feature, do you have an idea of when it will be integrated?

I will continue to develop this functionality in the near future.

@Marishka17 Marishka17 changed the title [WIP] Support cloud storage Support cloud storage Apr 22, 2021
cvat/apps/engine/cache.py Outdated Show resolved Hide resolved
cvat/apps/engine/cloud_provider.py Outdated Show resolved Hide resolved
cvat/apps/engine/cloud_provider.py Show resolved Hide resolved
cvat/apps/engine/cloud_provider.py Outdated Show resolved Hide resolved
cvat/apps/engine/cloud_provider.py Outdated Show resolved Hide resolved
cvat/apps/engine/cloud_provider.py Outdated Show resolved Hide resolved
cvat/apps/engine/serializers.py Show resolved Hide resolved
cvat/apps/engine/views.py Outdated Show resolved Hide resolved
Comment on lines 1135 to 1140
tmp_manifest = NamedTemporaryFile(mode='w+b', suffix='cvat', prefix='manifest')
storage.download_file(manifest_path, tmp_manifest.name)
manifest = ImageManifestManager(tmp_manifest.name)
manifest.init_index()
manifest_files = manifest.data
tmp_manifest.close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use context manager here

cvat/requirements/base.txt Outdated Show resolved Hide resolved

def is_exist(self):
try:
self._container_client.create_container()
Copy link
Contributor

@nmanovic nmanovic May 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Marishka17 , is it the only way to check that? Looks unusual.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ContainerClient in version 12.6.0 did not nave a method exists and I have seen such practices in other implementations. But now in 12.8.1 version it appeared and I changed this.

Copy link
Contributor

@nmanovic nmanovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Marishka17 , in general it looks very well. Need to check that we don't copy all data to the upload directory from a cloud drive. It is the only major question which I have. Other comments are minor.

@nmanovic
Copy link
Contributor

@Marishka17 , could you please add documentation how to add cloud storages? There are some problems with error messages. They provide zero information from my perspective.

{
  "provider_type": "AWS_S3_BUCKET",
  "resource": "nmanovic-data",
  "display_name": "mydata",
  "credentials_type": "TEMP_KEY_SECRET_KEY_TOKEN_SET",
  "session_token": "ZZZZZZZZZZ",
  "account_name": "test",
  "key": "XXXXXXXXXXXXXXXXXXX",
  "secret_key": "YYYYYYYYYYYYYYYYYYYY",
  "specific_attributes": "",
  "description": "test"
}

The request works, but

{
  "provider_type": "AWS_S3_BUCKET",
  "resource": "nmanovic-data",
  "display_name": "mydata1",
  "credentials_type": "TEMP_KEY_SECRET_KEY_TOKEN_SET",
  "session_token": "ZZZZZZZZZZ",
  "account_name": "test",
  "key": "XXXXXXXXXXXXXXXXXXX",
  "secret_key": "YYYYYYYYYYYYYYYYYYYY",
  "specific_attributes": "string",
  "description": "test"
}

Produce list index out of rang

{
  "provider_type": "AWS_S3_BUCKET",
  "resource": "nmanovic-data",
  "display_name": "mydata1",
  "credentials_type": "TEMP_KEY_SECRET_KEY_TOKEN_SET",
  "session_token": "ZZZZZZZZZZ",
  "account_name": "test",
  "key": "XXXXXXXXXXXXXXXXXXX",
  "secret_key": "YYYYYYYYYYYYYYYYYYYY",
  "specific_attributes": "",
  "description": ""
}

The request produces:

{
  "description": [
    {
      "message": "This field may not be blank.",
      "code": "blank"
    }
  ]
}

@nmanovic
Copy link
Contributor

If I try to run http://localhost:8080/api/v1/cloudstorages/1/content without any extra arguments, it gives me list index out of range. Probably it isn't enough to understand that the user didn't provide the manifest file.

Also I believe that a manifest file should be specified when we create a cloud storage. What do you think? I don't think that it is the right idea to specify it when you want to list content of the attached cloud drive.

@nmanovic
Copy link
Contributor

@Marishka17 , get_specific_attributes function works incorrectly. ''.split('&') returns ['']. It has len 1.

@Marishka17
Copy link
Contributor Author

@Marishka17 , could you please add documentation how to add cloud storages? There are some problems with error messages. They provide zero information from my perspective.

In order to create S3 storage need to indicate:

  1. for access with credentials
{
  "provider_type": "AWS_S3_BUCKET",
  "resource": "nmanovic-data",
  "display_name": "mydata1",
  "credentials_type": "TEMP_KEY_SECRET_KEY_TOKEN_SET",
  "session_token": "ZZZZZZZZZZ",
  "key": "XXXXXXXXXXXXXXXXXXX",
  "secret_key": "YYYYYYYYYYYYYYYYYYYY",
  "specific_attributes": "",
  "description": ""
}
  1. for access without credentials
{
  "provider_type": "AWS_S3_BUCKET",
  "resource": "nmanovic-data",
  "display_name": "mydata1",
  "credentials_type": "ANONYMOUS_ACCESS",
  "specific_attributes": "",
  "description": ""
}
{
  "provider_type": "AWS_S3_BUCKET",
  "resource": "nmanovic-data",
  "display_name": "mydata",
  "credentials_type": "TEMP_KEY_SECRET_KEY_TOKEN_SET",
  "session_token": "ZZZZZZZZZZ",
  "account_name": "test",
  "key": "XXXXXXXXXXXXXXXXXXX",
  "secret_key": "YYYYYYYYYYYYYYYYYYYY",
  "specific_attributes": "",
  "description": "test"
}

The request works, but

{
  "provider_type": "AWS_S3_BUCKET",
  "resource": "nmanovic-data",
  "display_name": "mydata1",
  "credentials_type": "TEMP_KEY_SECRET_KEY_TOKEN_SET",
  "session_token": "ZZZZZZZZZZ",
  "account_name": "test",
  "key": "XXXXXXXXXXXXXXXXXXX",
  "secret_key": "YYYYYYYYYYYYYYYYYYYY",
  "specific_attributes": "string",
  "description": "test"
}

Produce `list index out of range

specific_attributes should contain structure like key1=value1&key2=value2.
In real case e.g "specific_attributes": "range=eu-west-2"
Ok, I will add validator.

@Marishka17
Copy link
Contributor Author

If I try to run http://localhost:8080/api/v1/cloudstorages/1/content without any extra arguments, it gives me list index out of range. Probably it isn't enough to understand that the user didn't provide the manifest file.

I checked this case and my request is fulfilled, the content is returned.
Screenshot from 2021-05-20 12-01-24

Also I believe that a manifest file should be specified when we create a cloud storage. What do you think? I don't think that it is the right idea to specify it when you want to list content of the attached cloud drive.

I believe that it is wrong to specify manifest path when creating a storage, because in the provided implementation it will not be possible to create many identical storeges. It is assumed that the customer has a valid S3 bucket and connects it. Customer can have several datasets in the bucket and he can simply specify the desired manifest instead of creating the same storage with different manifest locations.

When creating a task, when expansion a specific existing cloud storage, a request GET cloudstorage/{id}/content will be sent and we will get storage content mapped with the content of the manifest (by default, it will be assumed that the manifest is at the root, but it will be possible to specify the location of the manifest and update the tree of available files.)

@Marishka17
Copy link
Contributor Author

{
  "description": [
    {
      "message": "This field may not be blank.",
      "code": "blank"
    }
  ]
}

@nmanovic , is that better?

{
  "provider_type": "",
  "resource": "",
  "display_name": "existing_display_name",
  "credentials_type": "",
  "specific_attributes": "",
  "description": ""
}

Screenshot from 2021-05-21 10-42-16

{
  "provider_type": "AWS_S3_BUCKET",
  "resource": "XXX",
  "display_name": "",
  "credentials_type": "ANONYMOUS_ACCESS",
  "specific_attributes": "",
  "description": ""
}

Screenshot from 2021-05-21 10-41-12

@nmanovic
Copy link
Contributor

If I try to run http://localhost:8080/api/v1/cloudstorages/1/content without any extra arguments, it gives me list index out of range. Probably it isn't enough to understand that the user didn't provide the manifest file.

I checked this case and my request is fulfilled, the content is returned.
Screenshot from 2021-05-20 12-01-24

Also I believe that a manifest file should be specified when we create a cloud storage. What do you think? I don't think that it is the right idea to specify it when you want to list content of the attached cloud drive.

I believe that it is wrong to specify manifest path when creating a storage, because in the provided implementation it will not be possible to create many identical storeges. It is assumed that the customer has a valid S3 bucket and connects it. Customer can have several datasets in the bucket and he can simply specify the desired manifest instead of creating the same storage with different manifest locations.

When creating a task, when expansion a specific existing cloud storage, a request GET cloudstorage/{id}/content will be sent and we will get storage content mapped with the content of the manifest (by default, it will be assumed that the manifest is at the root, but it will be possible to specify the location of the manifest and update the tree of available files.)

@Marishka17 , let's discuss how it is going to look like for users. There are multiple ways to implement the feature. If you can provide good UX in UI, I'm totally fine with any approach. In my proposed approach we can clone a connection and change something (e.g. manifest file) during the cloning process. Thus you can name the connection Cityscapes and it will correspond to one Cityscapes dataset.

@nmanovic
Copy link
Contributor

{
  "description": [
    {
      "message": "This field may not be blank.",
      "code": "blank"
    }
  ]
}

@nmanovic , is that better?

{
  "provider_type": "",
  "resource": "",
  "display_name": "existing_display_name",
  "credentials_type": "",
  "specific_attributes": "",
  "description": ""
}

Screenshot from 2021-05-21 10-42-16

{
  "provider_type": "AWS_S3_BUCKET",
  "resource": "XXX",
  "display_name": "",
  "credentials_type": "ANONYMOUS_ACCESS",
  "specific_attributes": "",
  "description": ""
}

Screenshot from 2021-05-21 10-41-12

It is much better.

@azhavoro
Copy link
Contributor

specific_attributes should contain structure like key1=value1&key2=value2.
In real case e.g "specific_attributes": "range=eu-west-2"

@Marishka17 could you please add a note about that into swagger docs?

# The typical token size is less than 4096 bytes, but that can vary.
provider_type = models.CharField(max_length=20, choices=CloudProviderChoice.choices())
resource = models.CharField(max_length=63)
display_name = models.CharField(max_length=63, unique=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Marishka17 @nmanovic why make this field unique? I think it might be unique in user space, but it should not be globally unique. For example, on Github, I can create a repo named cvat in my user space, but on cvat.org I won't be able to create a cloud storage named aws if someone has already created it. Can a cloud resources be shared between users?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can a cloud resources be shared between users?

The created storage will be available to the user who created it and the administrator

specific_attributes should contain structure like key1=value1&key2=value2.
In real case e.g "specific_attributes": "range=eu-west-2"

@Marishka17 could you please add a note about that into swagger docs?

possible solutions:

  1. Use
swagger_schema_fields = {
            'type': openapi.TYPE_OBJECT,
            'title': 'Cloud Storage',
            'properties': {
                'specific_attributes': openapi.Schema(
                    title='Specific attributes',
                    type=openapi.TYPE_STRING,
                    description='structure like key1=value1&key2=value2\n'
                        'supported: range=aws_range',
                ),
            },
            # "required": [...],
         }

in CloudStorageSerializer Meta class need to redefine all descriptions that should be generated automatically.

  1. move specific_attributes into separate class
class SpecificAttributes(serializers.Field):
    def to_representation(self, value):
        pass
    def to_internal_value(self, value):
        pass
    class Meta:
        swagger_schema_fields = {
            'type': openapi.TYPE_STRING,
            'title': 'Specific attributes',
            'description': 'structure like key1=value1&key2=value2\n'
                           'supported: range=aws_range',
            'maxLength': '50',
        } 

IMHO, it should not be separated into a separate class just because of the addition to the documentation.

  1. Use fieldInspector, I chose this solution.

@nmanovic nmanovic mentioned this pull request Jun 15, 2021
10 tasks
@nmanovic
Copy link
Contributor

New PR without multiple dummy changes from pretifier: #3326

@nmanovic nmanovic closed this Jun 15, 2021
@nmanovic nmanovic deleted the mk/support_cloud_storage branch June 16, 2021 12:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants