Support cloud storage #2620

Marishka17 · 2020-12-25T18:29:27Z

Motivation and context

Support work with remote cloud storage without copying data to CVAT
Related issue: #863

How to generate temporary credentials (S3) e.g.

import boto3
# generate temporatry credentials
sts_client = boto3.client('sts', aws_access_key_id="...", aws_secret_access_key="...")
tokens = sts_client.get_session_token()
aws_access_key_id = tokens.get('Credentials').get('AccessKeyId')
aws_secret_access_key = tokens.get('Credentials').get('SecretAccessKey')
aws_session_token = tokens.get('Credentials').get('SessionToken')

# test a credentials validity
s3 = boto3.client('s3', 
	aws_access_key_id=aws_access_key_id, 
	aws_secret_access_key=aws_secret_access_key,
	aws_session_token=aws_session_token
)
s3.list_buckets()

How has this been tested?

Manually with swagger

REST API

GET /api/v1/cloudstorages
POST /api/v1/cloudstorages
GET /api/v1/cloudstorages/{id}
GET /api/v1/cloudstorages/{id}/content
PATCH /api/v1/cloudstorages/{id}

Supported cloud providers:

Azure Blob container (implemented, not tested)
AWS S3 bucket (implemented, tested)
Google Drive

Iterations:

Support images
Support video
Support archive

Checklist

I submit my changes into the develop branch
I have added description of my changes into CHANGELOG file
I have updated the documentation accordingly
I have added tests to cover my changes
I have linked related issues (read github docs)
- [ ] I have increased versions of npm packages if it is necessary (cvat-canvas,
cvat-core, cvat-data and cvat-ui)

License

I submit my code changes under the same MIT License that covers the project.
Feel free to contact the maintainers if that's a concern.
I have updated the license header for each file (see an example below)

# Copyright (C) 2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

coveralls · 2020-12-25T19:19:26Z

Coverage decreased (-0.8%) to 73.245% when pulling 91c0e42 on mk/support_cloud_storage into be9e00f on develop.

iraadit · 2021-02-22T13:20:45Z

I'm really interested by this feature, do you have an idea of when it will be integrated?

By the way, I'm using MinIO; that is a S3-compatible storage, i imagine it should be used as for AWS S3 therefore?

Marishka17 · 2021-02-24T14:46:04Z

I'm really interested by this feature, do you have an idea of when it will be integrated?

I will continue to develop this functionality in the near future.

cvat/apps/engine/cache.py

cvat/apps/engine/cloud_provider.py

cvat/apps/engine/serializers.py

cvat/apps/engine/views.py

azhavoro · 2021-04-23T07:02:08Z

cvat/apps/engine/views.py

+                tmp_manifest = NamedTemporaryFile(mode='w+b', suffix='cvat', prefix='manifest')
+                storage.download_file(manifest_path, tmp_manifest.name)
+                manifest = ImageManifestManager(tmp_manifest.name)
+                manifest.init_index()
+                manifest_files = manifest.data
+                tmp_manifest.close()


please use context manager here

cvat/requirements/base.txt

cvat/apps/engine/models.py

cvat/settings/base.py

nmanovic · 2021-05-18T09:58:03Z

cvat/apps/engine/cloud_provider.py

+
+    def is_exist(self):
+        try:
+            self._container_client.create_container()


@Marishka17 , is it the only way to check that? Looks unusual.

ContainerClient in version 12.6.0 did not nave a method exists and I have seen such practices in other implementations. But now in 12.8.1 version it appeared and I changed this.

nmanovic

@Marishka17 , in general it looks very well. Need to check that we don't copy all data to the upload directory from a cloud drive. It is the only major question which I have. Other comments are minor.

nmanovic · 2021-05-19T16:41:38Z

@Marishka17 , could you please add documentation how to add cloud storages? There are some problems with error messages. They provide zero information from my perspective.

{
  "provider_type": "AWS_S3_BUCKET",
  "resource": "nmanovic-data",
  "display_name": "mydata",
  "credentials_type": "TEMP_KEY_SECRET_KEY_TOKEN_SET",
  "session_token": "ZZZZZZZZZZ",
  "account_name": "test",
  "key": "XXXXXXXXXXXXXXXXXXX",
  "secret_key": "YYYYYYYYYYYYYYYYYYYY",
  "specific_attributes": "",
  "description": "test"
}

The request works, but

{
  "provider_type": "AWS_S3_BUCKET",
  "resource": "nmanovic-data",
  "display_name": "mydata1",
  "credentials_type": "TEMP_KEY_SECRET_KEY_TOKEN_SET",
  "session_token": "ZZZZZZZZZZ",
  "account_name": "test",
  "key": "XXXXXXXXXXXXXXXXXXX",
  "secret_key": "YYYYYYYYYYYYYYYYYYYY",
  "specific_attributes": "string",
  "description": "test"
}

Produce list index out of rang

{
  "provider_type": "AWS_S3_BUCKET",
  "resource": "nmanovic-data",
  "display_name": "mydata1",
  "credentials_type": "TEMP_KEY_SECRET_KEY_TOKEN_SET",
  "session_token": "ZZZZZZZZZZ",
  "account_name": "test",
  "key": "XXXXXXXXXXXXXXXXXXX",
  "secret_key": "YYYYYYYYYYYYYYYYYYYY",
  "specific_attributes": "",
  "description": ""
}

The request produces:

{
  "description": [
    {
      "message": "This field may not be blank.",
      "code": "blank"
    }
  ]
}

nmanovic · 2021-05-19T18:16:35Z

If I try to run http://localhost:8080/api/v1/cloudstorages/1/content without any extra arguments, it gives me list index out of range. Probably it isn't enough to understand that the user didn't provide the manifest file.

Also I believe that a manifest file should be specified when we create a cloud storage. What do you think? I don't think that it is the right idea to specify it when you want to list content of the attached cloud drive.

nmanovic · 2021-05-19T18:51:23Z

@Marishka17 , get_specific_attributes function works incorrectly. ''.split('&') returns ['']. It has len 1.

Marishka17 · 2021-05-20T06:00:18Z

@Marishka17 , could you please add documentation how to add cloud storages? There are some problems with error messages. They provide zero information from my perspective.

In order to create S3 storage need to indicate:

for access with credentials

{
  "provider_type": "AWS_S3_BUCKET",
  "resource": "nmanovic-data",
  "display_name": "mydata1",
  "credentials_type": "TEMP_KEY_SECRET_KEY_TOKEN_SET",
  "session_token": "ZZZZZZZZZZ",
  "key": "XXXXXXXXXXXXXXXXXXX",
  "secret_key": "YYYYYYYYYYYYYYYYYYYY",
  "specific_attributes": "",
  "description": ""
}

for access without credentials

{
  "provider_type": "AWS_S3_BUCKET",
  "resource": "nmanovic-data",
  "display_name": "mydata1",
  "credentials_type": "ANONYMOUS_ACCESS",
  "specific_attributes": "",
  "description": ""
}

{
  "provider_type": "AWS_S3_BUCKET",
  "resource": "nmanovic-data",
  "display_name": "mydata",
  "credentials_type": "TEMP_KEY_SECRET_KEY_TOKEN_SET",
  "session_token": "ZZZZZZZZZZ",
  "account_name": "test",
  "key": "XXXXXXXXXXXXXXXXXXX",
  "secret_key": "YYYYYYYYYYYYYYYYYYYY",
  "specific_attributes": "",
  "description": "test"
}

The request works, but

{
  "provider_type": "AWS_S3_BUCKET",
  "resource": "nmanovic-data",
  "display_name": "mydata1",
  "credentials_type": "TEMP_KEY_SECRET_KEY_TOKEN_SET",
  "session_token": "ZZZZZZZZZZ",
  "account_name": "test",
  "key": "XXXXXXXXXXXXXXXXXXX",
  "secret_key": "YYYYYYYYYYYYYYYYYYYY",
  "specific_attributes": "string",
  "description": "test"
}

Produce `list index out of range

specific_attributes should contain structure like key1=value1&key2=value2.
In real case e.g "specific_attributes": "range=eu-west-2"
Ok, I will add validator.

Marishka17 · 2021-05-20T09:37:41Z

If I try to run http://localhost:8080/api/v1/cloudstorages/1/content without any extra arguments, it gives me list index out of range. Probably it isn't enough to understand that the user didn't provide the manifest file.

I checked this case and my request is fulfilled, the content is returned.

Also I believe that a manifest file should be specified when we create a cloud storage. What do you think? I don't think that it is the right idea to specify it when you want to list content of the attached cloud drive.

I believe that it is wrong to specify manifest path when creating a storage, because in the provided implementation it will not be possible to create many identical storeges. It is assumed that the customer has a valid S3 bucket and connects it. Customer can have several datasets in the bucket and he can simply specify the desired manifest instead of creating the same storage with different manifest locations.

When creating a task, when expansion a specific existing cloud storage, a request GET cloudstorage/{id}/content will be sent and we will get storage content mapped with the content of the manifest (by default, it will be assumed that the manifest is at the root, but it will be possible to specify the location of the manifest and update the tree of available files.)

Marishka17 · 2021-05-21T07:59:01Z

{
  "description": [
    {
      "message": "This field may not be blank.",
      "code": "blank"
    }
  ]
}

@nmanovic , is that better?

{
  "provider_type": "",
  "resource": "",
  "display_name": "existing_display_name",
  "credentials_type": "",
  "specific_attributes": "",
  "description": ""
}

{
  "provider_type": "AWS_S3_BUCKET",
  "resource": "XXX",
  "display_name": "",
  "credentials_type": "ANONYMOUS_ACCESS",
  "specific_attributes": "",
  "description": ""
}

nmanovic · 2021-05-21T10:28:58Z

If I try to run http://localhost:8080/api/v1/cloudstorages/1/content without any extra arguments, it gives me list index out of range. Probably it isn't enough to understand that the user didn't provide the manifest file.

I checked this case and my request is fulfilled, the content is returned.

Also I believe that a manifest file should be specified when we create a cloud storage. What do you think? I don't think that it is the right idea to specify it when you want to list content of the attached cloud drive.

I believe that it is wrong to specify manifest path when creating a storage, because in the provided implementation it will not be possible to create many identical storeges. It is assumed that the customer has a valid S3 bucket and connects it. Customer can have several datasets in the bucket and he can simply specify the desired manifest instead of creating the same storage with different manifest locations.

When creating a task, when expansion a specific existing cloud storage, a request GET cloudstorage/{id}/content will be sent and we will get storage content mapped with the content of the manifest (by default, it will be assumed that the manifest is at the root, but it will be possible to specify the location of the manifest and update the tree of available files.)

@Marishka17 , let's discuss how it is going to look like for users. There are multiple ways to implement the feature. If you can provide good UX in UI, I'm totally fine with any approach. In my proposed approach we can clone a connection and change something (e.g. manifest file) during the cloning process. Thus you can name the connection Cityscapes and it will correspond to one Cityscapes dataset.

nmanovic · 2021-05-21T10:29:14Z

{
  "description": [
    {
      "message": "This field may not be blank.",
      "code": "blank"
    }
  ]
}

@nmanovic , is that better?

{
  "provider_type": "",
  "resource": "",
  "display_name": "existing_display_name",
  "credentials_type": "",
  "specific_attributes": "",
  "description": ""
}

{
  "provider_type": "AWS_S3_BUCKET",
  "resource": "XXX",
  "display_name": "",
  "credentials_type": "ANONYMOUS_ACCESS",
  "specific_attributes": "",
  "description": ""
}

It is much better.

azhavoro · 2021-05-26T09:54:04Z

specific_attributes should contain structure like key1=value1&key2=value2.
In real case e.g "specific_attributes": "range=eu-west-2"

@Marishka17 could you please add a note about that into swagger docs?

azhavoro · 2021-05-26T10:13:03Z

cvat/apps/engine/models.py

+    # The typical token size is less than 4096 bytes, but that can vary.
+    provider_type = models.CharField(max_length=20, choices=CloudProviderChoice.choices())
+    resource = models.CharField(max_length=63)
+    display_name = models.CharField(max_length=63, unique=True)


@Marishka17 @nmanovic why make this field unique? I think it might be unique in user space, but it should not be globally unique. For example, on Github, I can create a repo named cvat in my user space, but on cvat.org I won't be able to create a cloud storage named aws if someone has already created it. Can a cloud resources be shared between users?

Can a cloud resources be shared between users?

The created storage will be available to the user who created it and the administrator

specific_attributes should contain structure like key1=value1&key2=value2.
In real case e.g "specific_attributes": "range=eu-west-2"

@Marishka17 could you please add a note about that into swagger docs?

possible solutions:

Use

swagger_schema_fields = { 'type': openapi.TYPE_OBJECT, 'title': 'Cloud Storage', 'properties': { 'specific_attributes': openapi.Schema( title='Specific attributes', type=openapi.TYPE_STRING, description='structure like key1=value1&key2=value2\n' 'supported: range=aws_range', ), }, # "required": [...], }

in CloudStorageSerializer Meta class need to redefine all descriptions that should be generated automatically.

move specific_attributes into separate class

class SpecificAttributes(serializers.Field): def to_representation(self, value): pass def to_internal_value(self, value): pass class Meta: swagger_schema_fields = { 'type': openapi.TYPE_STRING, 'title': 'Specific attributes', 'description': 'structure like key1=value1&key2=value2\n' 'supported: range=aws_range', 'maxLength': '50', }

IMHO, it should not be separated into a separate class just because of the addition to the documentation.

Use fieldInspector, I chose this solution.

…torage

This reverts commit 142dea7.

…_cloud_storage" This reverts commit 7004cab, reversing changes made to faaefb9.

nmanovic · 2021-06-15T11:59:12Z

New PR without multiple dummy changes from pretifier: #3326

Marishka17 added 7 commits December 22, 2020 00:15

Add: simple base server part to support cloud storages

6ddd6c1

Fix: save only necessary info, credentials after code redesign

63c12f1

Deleted unnecessary & some changes & some fixes

627398d

Fix

9324ad1

Add(cache): support files on cloud storage

a7399f3

tmp

3cb547e

Merge branch 'develop' into mk/support_cloud_storage

61398f8

Marishka17 requested review from nmanovic and azhavoro December 25, 2020 18:29

Marishka17 requested a review from bsekachev as a code owner December 25, 2020 18:29

Fix

b177bed

Marishka17 added 6 commits March 1, 2021 14:28

Revert prettier changes

3be1fc7

Merge branch 'upstream/develop' into mk/support_cloud_storage

da8b583

Merge branch 'develop' into mk/support_cloud_storage

9cecd39

tmp

82adc0b

Merge branch 'upstream/develop' into mk/support_cloud_storage

089d6a8

Fix bucket public access

8af9a39

Marishka17 force-pushed the mk/support_cloud_storage branch from 5d8be0a to 8af9a39 Compare April 15, 2021 23:25

Marishka17 added 3 commits April 16, 2021 02:29

Update migration dependency

9caff79

Fix pylint issues

6413906

Some fixes & bandit & add specific attr

91c0e42

Marishka17 force-pushed the mk/support_cloud_storage branch from ecd0a0a to 91c0e42 Compare April 22, 2021 10:12

Marishka17 changed the title ~~[WIP] Support cloud storage~~ Support cloud storage Apr 22, 2021

azhavoro reviewed Apr 23, 2021

View reviewed changes

nmanovic reviewed Apr 29, 2021

View reviewed changes

cvat/apps/engine/models.py Outdated Show resolved Hide resolved

nmanovic reviewed Apr 29, 2021

View reviewed changes

cvat/apps/engine/models.py Show resolved Hide resolved

nmanovic reviewed Apr 29, 2021

View reviewed changes

cvat/settings/base.py Show resolved Hide resolved

nmanovic reviewed May 18, 2021

View reviewed changes

nmanovic requested changes May 18, 2021

View reviewed changes

Fix comments

a81b0da

Marishka17 dismissed azhavoro’s stale review via a81b0da May 19, 2021 10:29

Merge branch 'upstream/develop' into mk/support_cloud_storage

4b3b2e9

Marishka17 added 2 commits May 20, 2021 11:40

Add validator for specific attributes

1ac7b51

Allow blank for description field

a93dbab

Marishka17 added 2 commits May 20, 2021 13:36

Update CHANGELOG

c910c3c

Change error display

2f73319

Redirect error in case when storage doesn't exist

4ac72cc

azhavoro reviewed May 26, 2021

View reviewed changes

Marishka17 and others added 5 commits May 31, 2021 09:56

fix

faaefb9

Merge remote-tracking branch 'origin/develop' into mk/support_cloud_s…

7004cab

…torage

Fix layouts/404.html

142dea7

Revert "Fix layouts/404.html"

eed0663

This reverts commit 142dea7.

Revert "Merge remote-tracking branch 'origin/develop' into mk/support…

4ac2da3

…_cloud_storage" This reverts commit 7004cab, reversing changes made to faaefb9.

Marishka17 requested a review from zhiltsov-max as a code owner June 15, 2021 10:10

nmanovic mentioned this pull request Jun 15, 2021

Support cloud storage #3326

Merged

10 tasks

nmanovic closed this Jun 15, 2021

nmanovic deleted the mk/support_cloud_storage branch June 16, 2021 12:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support cloud storage #2620

Support cloud storage #2620

Marishka17 commented Dec 25, 2020 •

edited

Loading

coveralls commented Dec 25, 2020 •

edited

Loading

iraadit commented Feb 22, 2021

Marishka17 commented Feb 24, 2021

azhavoro Apr 23, 2021

nmanovic May 18, 2021 •

edited

Loading

Marishka17 May 19, 2021

nmanovic left a comment

nmanovic commented May 19, 2021

nmanovic commented May 19, 2021

nmanovic commented May 19, 2021

Marishka17 commented May 20, 2021

Marishka17 commented May 20, 2021

Marishka17 commented May 21, 2021

nmanovic commented May 21, 2021

nmanovic commented May 21, 2021

azhavoro commented May 26, 2021

azhavoro May 26, 2021

Marishka17 May 31, 2021

nmanovic commented Jun 15, 2021

Support cloud storage #2620

Support cloud storage #2620

Conversation

Marishka17 commented Dec 25, 2020 • edited Loading

Motivation and context

How to generate temporary credentials (S3) e.g.

How has this been tested?

Checklist

License

coveralls commented Dec 25, 2020 • edited Loading

iraadit commented Feb 22, 2021

Marishka17 commented Feb 24, 2021

azhavoro Apr 23, 2021

Choose a reason for hiding this comment

nmanovic May 18, 2021 • edited Loading

Choose a reason for hiding this comment

Marishka17 May 19, 2021

Choose a reason for hiding this comment

nmanovic left a comment

Choose a reason for hiding this comment

nmanovic commented May 19, 2021

nmanovic commented May 19, 2021

nmanovic commented May 19, 2021

Marishka17 commented May 20, 2021

Marishka17 commented May 20, 2021

Marishka17 commented May 21, 2021

nmanovic commented May 21, 2021

nmanovic commented May 21, 2021

azhavoro commented May 26, 2021

azhavoro May 26, 2021

Choose a reason for hiding this comment

Marishka17 May 31, 2021

Choose a reason for hiding this comment

nmanovic commented Jun 15, 2021

Marishka17 commented Dec 25, 2020 •

edited

Loading

coveralls commented Dec 25, 2020 •

edited

Loading

nmanovic May 18, 2021 •

edited

Loading