Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download pre-generated offset file from data corpus repository, if present #542

Merged
merged 1 commit into from
May 27, 2024

Conversation

gkamat
Copy link
Collaborator

@gkamat gkamat commented May 26, 2024

Description

Generation of the offset table that maps line numbers to file offsets for large data corpora can take a considerable amount of time. Downloading pre-generated offset files is much faster. This change implements this capability and downloads associated offset files when available.

Issues Resolved

#519

Testing

  • Ran unit and integration tests.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@gkamat gkamat force-pushed the offset-file branch 3 times, most recently from 8aac123 to 6512064 Compare May 26, 2024 06:33
try:
self.downloader.download(document_set.base_url, None, doc_path + '.offset', None)
except exceptions.DataError as e:
if isinstance(e.cause, urllib.error.HTTPError) and (e.cause.code == 403 or e.cause.code == 404):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for file not found ,shouldn't error code be just 404 ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is no read permission enabled on the directory, there will be a 403 error code for non-existent files or objects.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. My comment was more in regards to the log message, since it had only "file not found" message, I was wondering why 403 is also included here. Thanks for clarifying

@gkamat gkamat merged commit 3d9ca6e into opensearch-project:main May 27, 2024
14 checks passed
@IanHoang
Copy link
Collaborator

This will expedite the downloading process. Thanks for doing this 👍🏻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants