[Feature] AWS S3 obtainer support #1888

EnableAsync · 2023-05-02T06:42:26Z

Motivation

Same as #1885
When I was adding the hiertext dataset to mmocr, I found out the hiertext dataset could only be downloaded through AWS CLI¹². Therefore, I submitted this pr.
I'm new to mmocr and grateful for any advice or insights.

Modification

Added an AWS S3 obtainer implemented through the AWS SDK.

Use cases (Optional)

data_root = 'data/hiertext'
cache_path = 'data/cache'
test_preparer = dict(
    obtainer=dict(
        type='AWSS3Obtainer',
        cache_path=cache_path,
        files=[
            dict(
                url='s3://open-images-dataset/ocr/test.tgz',
                save_name='hiertext_textdet_test_img.tgz',
                md5='656ab5da42986408a57eb678f9540e0a',
                content=['image'],
                mapping=[['hiertext_textdet_test_img', 'textdet_imgs/test']]),
        ]), )

Checklist

Before PR:

I have read and followed the workflow indicated in the CONTRIBUTING.md to create this PR.
Pre-commit or linting tools indicated in CONTRIBUTING.md are used to fix the potential lint issues.
Bug fixes are covered by unit tests, the case that causes the bug should be added in the unit tests.
New functionalities are covered by complete unit tests. If not, please add more unit test to ensure the correctness.
The documentation has been modified accordingly, including docstring or example tutorials.

After PR:

If the modification has potential influence on downstream or other related projects, this PR should be tested with some of those projects.
CLA has been signed and all committers have signed the CLA in this PR.

feat: add aws s3 obtainer fix: format fix: format

CLAassistant · 2023-05-02T06:42:31Z

All committers have signed the CLA.

OpenMMLab-Assistant-004 · 2023-05-08T01:25:04Z

Hi @EnableAsync,

We'd like to express our appreciation for your valuable contributions to the mmocr. Your efforts have significantly aided in enhancing the project's quality.
It is our pleasure to invite you to join our community thorugh Discord_Special Interest Group (SIG) channel. This is a great place to share your experiences, discuss ideas, and connect with other like-minded people. To become a part of the SIG channel, send a message to the moderator, OpenMMLab, briefly introduce yourself and mention your open-source contributions in the #introductions channel. Our team will gladly facilitate your entry. We eagerly await your presence. Please follow this link to join us: https://discord.gg/UjgXkPWNqA.

If you're on WeChat, we'd also love for you to join our community there. Just add our assistant using the WeChat ID: openmmlabwx. When sending the friend request, remember to include the remark "mmsig + Github ID".

Thanks again for your awesome contribution, and we're excited to have you as part of our community!

gaotongxiao

Thanks for your PR! It looks overall great and we can merge it as soon as you have some minor issues fixed.

gaotongxiao · 2023-06-02T09:12:46Z

mmocr/datasets/preparers/obtainers/aws_s3_obtainer.py

+import boto3
+from botocore import UNSIGNED
+from botocore.config import Config


Since not everyone needs the s3 client, we can turn this requirement as optional by moving the import util into AWSS3Obtainer.__init__ where these packages are actually needed. Also, try to give a proper error msg in case of missing package like:

mmocr/mmocr/datasets/recog_lmdb_dataset.py

Lines 158 to 162 in 37c5d37

try:

import lmdb

except ImportError:

raise ImportError(

'Please install lmdb to enable RecogLMDBDataset.')

gaotongxiao · 2023-06-02T09:17:14Z

mmocr/datasets/preparers/obtainers/aws_s3_obtainer.py

+    def extract(self,
+                src_path: str,
+                dst_path: str,
+                delete: bool = False) -> None:
+        """Extract zip/tar.gz files.
+
+        Args:
+            src_path (str): Path to the zip file.
+            dst_path (str): Path to the destination folder.
+            delete (bool, optional): Whether to delete the zip file. Defaults
+                to False.
+        """
+        if not is_archive(src_path):
+            # Copy the file to the destination folder if it is not a zip
+            if osp.isfile(src_path):
+                shutil.copy(src_path, dst_path)
+            else:
+                shutil.copytree(src_path, dst_path)
+            return
+
+        zip_name = osp.basename(src_path).split('.')[0]
+        if dst_path is None:
+            dst_path = osp.join(osp.dirname(src_path), zip_name)
+        else:
+            dst_path = osp.join(dst_path, zip_name)
+
+        extracted = False
+        if osp.exists(dst_path):
+            name = set(os.listdir(dst_path))
+            if '.finish' in name:
+                extracted = True
+            elif '.finish' not in name and len(name) > 0:
+                while True:
+                    c = input(f'{dst_path} already exists when extracting '
+                              '{zip_name}, unzip again? (y/N) ') or 'N'
+                    if c.lower() in ['y', 'n']:
+                        extracted = c == 'n'
+                        break
+        if extracted:
+            open(osp.join(dst_path, '.finish'), 'w').close()
+            print(f'{zip_name} has been extracted. Skip')
+            return
+        mkdir_or_exist(dst_path)
+        print(f'Extracting: {osp.basename(src_path)}')
+        if src_path.endswith('.zip'):
+            try:
+                import zipfile
+            except ImportError:
+                raise ImportError(
+                    'Please install zipfile by running "pip install zipfile".')
+            with zipfile.ZipFile(src_path, 'r') as zip_ref:
+                zip_ref.extractall(dst_path)
+        elif src_path.endswith('.tar.gz') or src_path.endswith(
+                '.tar') or src_path.endswith('.tgz'):
+            if src_path.endswith('.tar.gz'):
+                mode = 'r:gz'
+            elif src_path.endswith('.tar'):
+                mode = 'r:'
+            elif src_path.endswith('tgz'):
+                mode = 'r:gz'
+            try:
+                import tarfile
+            except ImportError:
+                raise ImportError(
+                    'Please install tarfile by running "pip install tarfile".')
+            with tarfile.open(src_path, mode) as tar_ref:
+                tar_ref.extractall(dst_path)
+
+        open(osp.join(dst_path, '.finish'), 'w').close()
+        if delete:
+            os.remove(src_path)
+
+    def move(self, mapping: List[Tuple[str, str]]) -> None:
+        """Rename and move dataset files one by one.
+
+        Args:
+            mapping (List[Tuple[str, str]]): A list of tuples, each
+            tuple contains the source file name and the destination file name.
+        """
+        for src, dst in mapping:
+            src = osp.join(self.data_root, src)
+            dst = osp.join(self.data_root, dst)
+
+            if '*' in src:
+                mkdir_or_exist(dst)
+                for f in glob.glob(src):
+                    if not osp.exists(
+                            osp.join(dst, osp.relpath(f, self.data_root))):
+                        shutil.move(f, dst)
+
+            elif osp.exists(src) and not osp.exists(dst):
+                mkdir_or_exist(osp.dirname(dst))
+                shutil.move(src, dst)
+
+    def clean(self) -> None:
+        """Remove empty dirs."""
+        for root, dirs, files in os.walk(self.data_root, topdown=False):
+            if not files and not dirs:
+                os.rmdir(root)


Duplicated code can be avoided by inheriting this class from NaiveDataObtainer

gaotongxiao · 2023-06-02T09:17:33Z

requirements/runtime.txt

@@ -1,3 +1,4 @@
+boto3


This can be moved to "optional.txt"

gaotongxiao

Thanks for your PR! It looks overall great and we can merge it as soon as you have some minor issues fixed.

fix: code format

EnableAsync · 2023-06-08T05:38:44Z

@gaotongxiao Hi, Thank you for your review of my code. I greatly appreciate your insights and suggestions. I have made the necessary changes based on your feedback and would be grateful if you could review the updated version at your convenience.

…_obtainer' into enableasync/add_aws_obtainer

gaotongxiao · 2023-06-23T16:10:40Z

LGTM, thanks for you contribution!

* feat: add aws s3 obtainer feat: add aws s3 obtainer fix: format fix: format * fix: avoid duplicated code fix: code format * fix: runtime.txt * fix: remove duplicated code

feat: add aws s3 obtainer

f021ab2

feat: add aws s3 obtainer fix: format fix: format

mm-assistant bot assigned gaotongxiao May 2, 2023

gaotongxiao reviewed Jun 2, 2023

View reviewed changes

EnableAsync and others added 2 commits June 8, 2023 13:26

fix: avoid duplicated code

3f98cb7

fix: code format

fix: runtime.txt

2f80286

EnableAsync added 2 commits June 8, 2023 13:49

fix: remove duplicated code

5e720bd

Merge remote-tracking branch 'refs/remotes/origin/enableasync/add_aws…

54ac04f

…_obtainer' into enableasync/add_aws_obtainer

gaotongxiao approved these changes Jun 23, 2023

View reviewed changes

gaotongxiao merged commit e991cc9 into open-mmlab:main Jun 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] AWS S3 obtainer support #1888

[Feature] AWS S3 obtainer support #1888

EnableAsync commented May 2, 2023 •

edited

Loading

CLAassistant commented May 2, 2023 •

edited

Loading

OpenMMLab-Assistant-004 commented May 8, 2023

gaotongxiao left a comment

gaotongxiao Jun 2, 2023

gaotongxiao Jun 2, 2023

gaotongxiao Jun 2, 2023

gaotongxiao left a comment

EnableAsync commented Jun 8, 2023

gaotongxiao commented Jun 23, 2023

	try:
	import lmdb
	except ImportError:
	raise ImportError(
	'Please install lmdb to enable RecogLMDBDataset.')

[Feature] AWS S3 obtainer support #1888

[Feature] AWS S3 obtainer support #1888

Conversation

EnableAsync commented May 2, 2023 • edited Loading

Motivation

Modification

Use cases (Optional)

Checklist

Footnotes

CLAassistant commented May 2, 2023 • edited Loading

OpenMMLab-Assistant-004 commented May 8, 2023

gaotongxiao left a comment

Choose a reason for hiding this comment

gaotongxiao Jun 2, 2023

Choose a reason for hiding this comment

gaotongxiao Jun 2, 2023

Choose a reason for hiding this comment

gaotongxiao Jun 2, 2023

Choose a reason for hiding this comment

gaotongxiao left a comment

Choose a reason for hiding this comment

EnableAsync commented Jun 8, 2023

gaotongxiao commented Jun 23, 2023

EnableAsync commented May 2, 2023 •

edited

Loading

CLAassistant commented May 2, 2023 •

edited

Loading