Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: The hash_cache fails with filenames including surrogates #492

Merged
merged 1 commit into from
Nov 5, 2024

Conversation

mwiebe
Copy link
Contributor

@mwiebe mwiebe commented Nov 4, 2024

Fixes: #481

What was the problem/requirement? (What/Why)

Cases encountered in practice show that filenames can include unicode surrogates. This changes the hash_cache functionality to preserve them by using the "surrogatepass" handler for them.

What was the solution? (How)

  • Modify the hash_cache sqlite schema to use a BLOB instead of TEXT for the file_path, and then explicitly encode/decode it using the 'utf-8' encoding and 'surrogatepass' error handler.
  • In the canonical json code, use the 'surrogatepass' error handler in the canonical file path sorting.
  • In the error logging of the exception that revealed this bug, include the exception context to improve debugging of future similar issues.
  • Add a test case for the hash_cache based on an example that was failing in practice.

What is the impact of this change?

Jobs that could not be submitted, because their filenames include surrogates, will work now.

How was this change tested?

See DEVELOPMENT.md for information on running tests.

  • Have you run the unit tests?
    Yes
  • Have you run the integration tests?
    No
  • Have you made changes to the download or asset_sync modules? If so, then it is highly recommended
    that you ensure that the docker-based unit tests pass.
    Yes, but have not run these tests.

Was this change documented?

N/A

Does this PR introduce new dependencies?

  • This PR adds one or more new dependency Python packages. I acknowledge I have reviewed the considerations for adding dependencies in DEVELOPMENT.md.
  • This PR does not add any new dependencies.

Is this a breaking change?

No. In increases the table version of the sqlite hash cache, but that does not break compatibility.

Does this change impact security?

No


By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@mwiebe mwiebe force-pushed the hash-cache-surrogates branch 3 times, most recently from 755112c to a4c1a41 Compare November 4, 2024 19:55
Cases encountered in practice show that filenames can include unicode
surrogates. This changes the hash_cache functionality to preserve them
by using the "surrogatepass" handler for them.

Another issue encountered is that S3 metadata only allows ASCII, but the
code was placing unicode paths in it.

* Modify the hash_cache sqlite schema to use a BLOB instead of TEXT for
  the file_path, and then explicitly encode/decode it using the 'utf-8'
  encoding and 'surrogatepass' error handler.
* In the canonical json code, use the 'surrogatepass' error handler in
  the canonical file path sorting. Add the new test case data to the
  test manifest file.
* In the error logging of the exception that revealed this bug, include
  the exception context to improve debugging of future similar issues.
* Add a test case for the hash_cache based on an example that was
  failing in practice.
* Add the "asset-root-json" metadata containing the path encoded to
  ASCII via a JSON string. If the file path is not ASCII, it adds this
  metadata instead.

Signed-off-by: Mark Wiebe <[email protected]>
@mwiebe mwiebe force-pushed the hash-cache-surrogates branch from a4c1a41 to 7e8626d Compare November 4, 2024 19:57
Copy link

sonarqubecloud bot commented Nov 4, 2024

@mwiebe mwiebe marked this pull request as ready for review November 4, 2024 19:58
@mwiebe mwiebe requested a review from a team as a code owner November 4, 2024 19:58
return path.path.encode("utf-16_be")
# Use the "surrogatepass" error handler because filenames encountered in the wild
# include surrogates.
return path.path.encode("utf-16_be", errors="surrogatepass")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I thought this was some magic keyword :)
https://docs.python.org/3/library/codecs.html

@@ -527,7 +528,16 @@ def _upload_output_manifest_to_s3(
full_output_prefix,
f"{manifest_name_prefix}_output",
)
metadata = {"Metadata": {"asset-root": root_path}}
metadata = {"Metadata": {"asset-root": json.dumps(root_path, ensure_ascii=True)}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will need to amend this to the manifest upload CLI since it has similar mechanisms.

@@ -670,7 +671,10 @@ def _get_asset_root_from_s3(
except Exception as e:
raise AssetSyncError(e) from e

return head["Metadata"].get("asset-root", None)
if "asset-root-json" in head["Metadata"]:
Copy link
Contributor

@leongdl leongdl Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, I need to port this to manifest download. (Will do after this PR is merged)

@mwiebe mwiebe merged commit 39c285d into aws-deadline:mainline Nov 5, 2024
18 checks passed
@mwiebe mwiebe deleted the hash-cache-surrogates branch November 5, 2024 01:30
This was referenced Nov 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: Special Characters cannot be included in job bundles
3 participants