-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: The hash_cache fails with filenames including surrogates #492
Conversation
755112c
to
a4c1a41
Compare
Cases encountered in practice show that filenames can include unicode surrogates. This changes the hash_cache functionality to preserve them by using the "surrogatepass" handler for them. Another issue encountered is that S3 metadata only allows ASCII, but the code was placing unicode paths in it. * Modify the hash_cache sqlite schema to use a BLOB instead of TEXT for the file_path, and then explicitly encode/decode it using the 'utf-8' encoding and 'surrogatepass' error handler. * In the canonical json code, use the 'surrogatepass' error handler in the canonical file path sorting. Add the new test case data to the test manifest file. * In the error logging of the exception that revealed this bug, include the exception context to improve debugging of future similar issues. * Add a test case for the hash_cache based on an example that was failing in practice. * Add the "asset-root-json" metadata containing the path encoded to ASCII via a JSON string. If the file path is not ASCII, it adds this metadata instead. Signed-off-by: Mark Wiebe <[email protected]>
a4c1a41
to
7e8626d
Compare
Quality Gate passedIssues Measures |
return path.path.encode("utf-16_be") | ||
# Use the "surrogatepass" error handler because filenames encountered in the wild | ||
# include surrogates. | ||
return path.path.encode("utf-16_be", errors="surrogatepass") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I thought this was some magic keyword :)
https://docs.python.org/3/library/codecs.html
@@ -527,7 +528,16 @@ def _upload_output_manifest_to_s3( | |||
full_output_prefix, | |||
f"{manifest_name_prefix}_output", | |||
) | |||
metadata = {"Metadata": {"asset-root": root_path}} | |||
metadata = {"Metadata": {"asset-root": json.dumps(root_path, ensure_ascii=True)}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will need to amend this to the manifest upload
CLI since it has similar mechanisms.
@@ -670,7 +671,10 @@ def _get_asset_root_from_s3( | |||
except Exception as e: | |||
raise AssetSyncError(e) from e | |||
|
|||
return head["Metadata"].get("asset-root", None) | |||
if "asset-root-json" in head["Metadata"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, I need to port this to manifest download
. (Will do after this PR is merged)
Fixes: #481
What was the problem/requirement? (What/Why)
Cases encountered in practice show that filenames can include unicode surrogates. This changes the hash_cache functionality to preserve them by using the "surrogatepass" handler for them.
What was the solution? (How)
What is the impact of this change?
Jobs that could not be submitted, because their filenames include surrogates, will work now.
How was this change tested?
See DEVELOPMENT.md for information on running tests.
Yes
No
download
orasset_sync
modules? If so, then it is highly recommendedthat you ensure that the docker-based unit tests pass.
Yes, but have not run these tests.
Was this change documented?
N/A
Does this PR introduce new dependencies?
Is this a breaking change?
No. In increases the table version of the sqlite hash cache, but that does not break compatibility.
Does this change impact security?
No
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.