-
Notifications
You must be signed in to change notification settings - Fork 754
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] AWS S3 obtainer support #1888
[Feature] AWS S3 obtainer support #1888
Conversation
feat: add aws s3 obtainer fix: format fix: format
Hi @EnableAsync, We'd like to express our appreciation for your valuable contributions to the mmocr. Your efforts have significantly aided in enhancing the project's quality. If you're on WeChat, we'd also love for you to join our community there. Just add our assistant using the WeChat ID: openmmlabwx. When sending the friend request, remember to include the remark "mmsig + Github ID". Thanks again for your awesome contribution, and we're excited to have you as part of our community! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your PR! It looks overall great and we can merge it as soon as you have some minor issues fixed.
import boto3 | ||
from botocore import UNSIGNED | ||
from botocore.config import Config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since not everyone needs the s3 client, we can turn this requirement as optional by moving the import util into AWSS3Obtainer.__init__
where these packages are actually needed. Also, try to give a proper error msg in case of missing package like:
mmocr/mmocr/datasets/recog_lmdb_dataset.py
Lines 158 to 162 in 37c5d37
try: | |
import lmdb | |
except ImportError: | |
raise ImportError( | |
'Please install lmdb to enable RecogLMDBDataset.') |
def extract(self, | ||
src_path: str, | ||
dst_path: str, | ||
delete: bool = False) -> None: | ||
"""Extract zip/tar.gz files. | ||
|
||
Args: | ||
src_path (str): Path to the zip file. | ||
dst_path (str): Path to the destination folder. | ||
delete (bool, optional): Whether to delete the zip file. Defaults | ||
to False. | ||
""" | ||
if not is_archive(src_path): | ||
# Copy the file to the destination folder if it is not a zip | ||
if osp.isfile(src_path): | ||
shutil.copy(src_path, dst_path) | ||
else: | ||
shutil.copytree(src_path, dst_path) | ||
return | ||
|
||
zip_name = osp.basename(src_path).split('.')[0] | ||
if dst_path is None: | ||
dst_path = osp.join(osp.dirname(src_path), zip_name) | ||
else: | ||
dst_path = osp.join(dst_path, zip_name) | ||
|
||
extracted = False | ||
if osp.exists(dst_path): | ||
name = set(os.listdir(dst_path)) | ||
if '.finish' in name: | ||
extracted = True | ||
elif '.finish' not in name and len(name) > 0: | ||
while True: | ||
c = input(f'{dst_path} already exists when extracting ' | ||
'{zip_name}, unzip again? (y/N) ') or 'N' | ||
if c.lower() in ['y', 'n']: | ||
extracted = c == 'n' | ||
break | ||
if extracted: | ||
open(osp.join(dst_path, '.finish'), 'w').close() | ||
print(f'{zip_name} has been extracted. Skip') | ||
return | ||
mkdir_or_exist(dst_path) | ||
print(f'Extracting: {osp.basename(src_path)}') | ||
if src_path.endswith('.zip'): | ||
try: | ||
import zipfile | ||
except ImportError: | ||
raise ImportError( | ||
'Please install zipfile by running "pip install zipfile".') | ||
with zipfile.ZipFile(src_path, 'r') as zip_ref: | ||
zip_ref.extractall(dst_path) | ||
elif src_path.endswith('.tar.gz') or src_path.endswith( | ||
'.tar') or src_path.endswith('.tgz'): | ||
if src_path.endswith('.tar.gz'): | ||
mode = 'r:gz' | ||
elif src_path.endswith('.tar'): | ||
mode = 'r:' | ||
elif src_path.endswith('tgz'): | ||
mode = 'r:gz' | ||
try: | ||
import tarfile | ||
except ImportError: | ||
raise ImportError( | ||
'Please install tarfile by running "pip install tarfile".') | ||
with tarfile.open(src_path, mode) as tar_ref: | ||
tar_ref.extractall(dst_path) | ||
|
||
open(osp.join(dst_path, '.finish'), 'w').close() | ||
if delete: | ||
os.remove(src_path) | ||
|
||
def move(self, mapping: List[Tuple[str, str]]) -> None: | ||
"""Rename and move dataset files one by one. | ||
|
||
Args: | ||
mapping (List[Tuple[str, str]]): A list of tuples, each | ||
tuple contains the source file name and the destination file name. | ||
""" | ||
for src, dst in mapping: | ||
src = osp.join(self.data_root, src) | ||
dst = osp.join(self.data_root, dst) | ||
|
||
if '*' in src: | ||
mkdir_or_exist(dst) | ||
for f in glob.glob(src): | ||
if not osp.exists( | ||
osp.join(dst, osp.relpath(f, self.data_root))): | ||
shutil.move(f, dst) | ||
|
||
elif osp.exists(src) and not osp.exists(dst): | ||
mkdir_or_exist(osp.dirname(dst)) | ||
shutil.move(src, dst) | ||
|
||
def clean(self) -> None: | ||
"""Remove empty dirs.""" | ||
for root, dirs, files in os.walk(self.data_root, topdown=False): | ||
if not files and not dirs: | ||
os.rmdir(root) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicated code can be avoided by inheriting this class from NaiveDataObtainer
requirements/runtime.txt
Outdated
@@ -1,3 +1,4 @@ | |||
boto3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be moved to "optional.txt"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your PR! It looks overall great and we can merge it as soon as you have some minor issues fixed.
fix: code format
@gaotongxiao Hi, Thank you for your review of my code. I greatly appreciate your insights and suggestions. I have made the necessary changes based on your feedback and would be grateful if you could review the updated version at your convenience. |
…_obtainer' into enableasync/add_aws_obtainer
LGTM, thanks for you contribution! |
* feat: add aws s3 obtainer feat: add aws s3 obtainer fix: format fix: format * fix: avoid duplicated code fix: code format * fix: runtime.txt * fix: remove duplicated code
Motivation
Same as #1885
When I was adding the hiertext dataset to mmocr, I found out the hiertext dataset could only be downloaded through AWS CLI12. Therefore, I submitted this pr.
I'm new to mmocr and grateful for any advice or insights.
Modification
Added an AWS S3 obtainer implemented through the AWS SDK.
Use cases (Optional)
Checklist
Before PR:
After PR:
Footnotes
https://github.com/google-research-datasets/hiertext ↩
https://docs.aws.amazon.com/cli/index.html ↩