Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] AWS S3 obtainer support #1888

Merged
merged 5 commits into from
Jun 23, 2023

Conversation

EnableAsync
Copy link
Contributor

@EnableAsync EnableAsync commented May 2, 2023

Motivation

Same as #1885
When I was adding the hiertext dataset to mmocr, I found out the hiertext dataset could only be downloaded through AWS CLI12. Therefore, I submitted this pr.
I'm new to mmocr and grateful for any advice or insights.

Modification

Added an AWS S3 obtainer implemented through the AWS SDK.

Use cases (Optional)

data_root = 'data/hiertext'
cache_path = 'data/cache'
test_preparer = dict(
    obtainer=dict(
        type='AWSS3Obtainer',
        cache_path=cache_path,
        files=[
            dict(
                url='s3://open-images-dataset/ocr/test.tgz',
                save_name='hiertext_textdet_test_img.tgz',
                md5='656ab5da42986408a57eb678f9540e0a',
                content=['image'],
                mapping=[['hiertext_textdet_test_img', 'textdet_imgs/test']]),
        ]), )

Checklist

Before PR:

  • I have read and followed the workflow indicated in the CONTRIBUTING.md to create this PR.
  • Pre-commit or linting tools indicated in CONTRIBUTING.md are used to fix the potential lint issues.
  • Bug fixes are covered by unit tests, the case that causes the bug should be added in the unit tests.
  • New functionalities are covered by complete unit tests. If not, please add more unit test to ensure the correctness.
  • The documentation has been modified accordingly, including docstring or example tutorials.

After PR:

  • If the modification has potential influence on downstream or other related projects, this PR should be tested with some of those projects.
  • CLA has been signed and all committers have signed the CLA in this PR.

Footnotes

  1. https://github.com/google-research-datasets/hiertext

  2. https://docs.aws.amazon.com/cli/index.html

feat: add aws s3 obtainer

fix: format

fix: format
@CLAassistant
Copy link

CLAassistant commented May 2, 2023

CLA assistant check
All committers have signed the CLA.

@OpenMMLab-Assistant-004
Copy link

Hi @EnableAsync,

We'd like to express our appreciation for your valuable contributions to the mmocr. Your efforts have significantly aided in enhancing the project's quality.
It is our pleasure to invite you to join our community thorugh Discord_Special Interest Group (SIG) channel. This is a great place to share your experiences, discuss ideas, and connect with other like-minded people. To become a part of the SIG channel, send a message to the moderator, OpenMMLab, briefly introduce yourself and mention your open-source contributions in the #introductions channel. Our team will gladly facilitate your entry. We eagerly await your presence. Please follow this link to join us: ​https://discord.gg/UjgXkPWNqA.

If you're on WeChat, we'd also love for you to join our community there. Just add our assistant using the WeChat ID: openmmlabwx. When sending the friend request, remember to include the remark "mmsig + Github ID".

Thanks again for your awesome contribution, and we're excited to have you as part of our community!

Copy link
Collaborator

@gaotongxiao gaotongxiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your PR! It looks overall great and we can merge it as soon as you have some minor issues fixed.

Comment on lines 9 to 11
import boto3
from botocore import UNSIGNED
from botocore.config import Config
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since not everyone needs the s3 client, we can turn this requirement as optional by moving the import util into AWSS3Obtainer.__init__ where these packages are actually needed. Also, try to give a proper error msg in case of missing package like:

try:
import lmdb
except ImportError:
raise ImportError(
'Please install lmdb to enable RecogLMDBDataset.')

Comment on lines 139 to 237
def extract(self,
src_path: str,
dst_path: str,
delete: bool = False) -> None:
"""Extract zip/tar.gz files.

Args:
src_path (str): Path to the zip file.
dst_path (str): Path to the destination folder.
delete (bool, optional): Whether to delete the zip file. Defaults
to False.
"""
if not is_archive(src_path):
# Copy the file to the destination folder if it is not a zip
if osp.isfile(src_path):
shutil.copy(src_path, dst_path)
else:
shutil.copytree(src_path, dst_path)
return

zip_name = osp.basename(src_path).split('.')[0]
if dst_path is None:
dst_path = osp.join(osp.dirname(src_path), zip_name)
else:
dst_path = osp.join(dst_path, zip_name)

extracted = False
if osp.exists(dst_path):
name = set(os.listdir(dst_path))
if '.finish' in name:
extracted = True
elif '.finish' not in name and len(name) > 0:
while True:
c = input(f'{dst_path} already exists when extracting '
'{zip_name}, unzip again? (y/N) ') or 'N'
if c.lower() in ['y', 'n']:
extracted = c == 'n'
break
if extracted:
open(osp.join(dst_path, '.finish'), 'w').close()
print(f'{zip_name} has been extracted. Skip')
return
mkdir_or_exist(dst_path)
print(f'Extracting: {osp.basename(src_path)}')
if src_path.endswith('.zip'):
try:
import zipfile
except ImportError:
raise ImportError(
'Please install zipfile by running "pip install zipfile".')
with zipfile.ZipFile(src_path, 'r') as zip_ref:
zip_ref.extractall(dst_path)
elif src_path.endswith('.tar.gz') or src_path.endswith(
'.tar') or src_path.endswith('.tgz'):
if src_path.endswith('.tar.gz'):
mode = 'r:gz'
elif src_path.endswith('.tar'):
mode = 'r:'
elif src_path.endswith('tgz'):
mode = 'r:gz'
try:
import tarfile
except ImportError:
raise ImportError(
'Please install tarfile by running "pip install tarfile".')
with tarfile.open(src_path, mode) as tar_ref:
tar_ref.extractall(dst_path)

open(osp.join(dst_path, '.finish'), 'w').close()
if delete:
os.remove(src_path)

def move(self, mapping: List[Tuple[str, str]]) -> None:
"""Rename and move dataset files one by one.

Args:
mapping (List[Tuple[str, str]]): A list of tuples, each
tuple contains the source file name and the destination file name.
"""
for src, dst in mapping:
src = osp.join(self.data_root, src)
dst = osp.join(self.data_root, dst)

if '*' in src:
mkdir_or_exist(dst)
for f in glob.glob(src):
if not osp.exists(
osp.join(dst, osp.relpath(f, self.data_root))):
shutil.move(f, dst)

elif osp.exists(src) and not osp.exists(dst):
mkdir_or_exist(osp.dirname(dst))
shutil.move(src, dst)

def clean(self) -> None:
"""Remove empty dirs."""
for root, dirs, files in os.walk(self.data_root, topdown=False):
if not files and not dirs:
os.rmdir(root)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicated code can be avoided by inheriting this class from NaiveDataObtainer

@@ -1,3 +1,4 @@
boto3
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be moved to "optional.txt"

Copy link
Collaborator

@gaotongxiao gaotongxiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your PR! It looks overall great and we can merge it as soon as you have some minor issues fixed.

@EnableAsync
Copy link
Contributor Author

@gaotongxiao Hi, Thank you for your review of my code. I greatly appreciate your insights and suggestions. I have made the necessary changes based on your feedback and would be grateful if you could review the updated version at your convenience.

@gaotongxiao
Copy link
Collaborator

LGTM, thanks for you contribution!

@gaotongxiao gaotongxiao merged commit e991cc9 into open-mmlab:main Jun 23, 2023
gaotongxiao pushed a commit to gaotongxiao/mmocr that referenced this pull request Jun 23, 2023
* feat: add aws s3 obtainer

feat: add aws s3 obtainer

fix: format

fix: format

* fix: avoid duplicated code

fix: code format

* fix: runtime.txt

* fix: remove duplicated code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants