Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle UTF BOM when decoding text #10130

Merged
merged 53 commits into from
Jun 4, 2024
Merged

Handle UTF BOM when decoding text #10130

merged 53 commits into from
Jun 4, 2024

Conversation

radeusgd
Copy link
Member

@radeusgd radeusgd commented May 29, 2024

Pull Request Description

Important Notes

Checklist

Please ensure that the following checklist has been satisfied before submitting the PR:

  • The documentation has been updated, if necessary.
  • Screenshots/screencasts have been attached, if there are any visual changes. For interactive or animated visual changes, a screencast is preferred.
  • All code follows the
    Scala,
    Java,
    TypeScript,
    and
    Rust
    style guides. In case you are using a language not listed above, follow the Rust style guide.
  • Unit tests have been written where possible.

@radeusgd radeusgd self-assigned this May 29, 2024
@radeusgd radeusgd changed the title Wip/radeusgd/9849 auto utf Handle UTF BOM when decoding text May 29, 2024
@radeusgd radeusgd force-pushed the wip/radeusgd/9849-auto-utf branch 2 times, most recently from 5448bbc to d35a444 Compare May 31, 2024 11:49
*
* <p>Used for reporting warnings.
* <p>It will never return 0 - it will either return a positive number of characters read, or -1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment seems to contradict line 120

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooops! I forgot about this edge case, I will update the doc 'will never return 0 unless user requested 0 bytes'. As the user setting len == 0 should be the only case where 0 would be returned - as guaranteed by asserts below.

radeusgd added 2 commits June 3, 2024 10:49
Encoding.Default is treated as UTF-8 for new files
when appending, Encoding.Default will try to detect the encoding
@radeusgd radeusgd added the CI: Ready to merge This PR is eligible for automatic merge label Jun 3, 2024
Copy link
Member

@jdunkerley jdunkerley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some little things for next PR...

distribution/lib/Standard/Base/0.0.0-dev/src/Data.enso Outdated Show resolved Hide resolved

Arguments:
- character_set: java.nio.charset name.
Value (character_set:Text)
Value (internal_character_set:Text)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Value (internal_character_set:Text)
private Value (internal_character_set:Text)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cannot be done easily because we cannot have both private and public constructors mixed, and Encoding.Default is public.

We could make it private and create a 'factory' method Encoding.default I guess. Will amend in the next PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amended in this PR after all, for clarity.

Comment on lines 76 to 78
to_java_charset self -> Charset ! Illegal_Argument =
self.to_java_charset_or_null.if_nothing <|
Error.throw (Illegal_Argument.Error "The Default encoding cannot be converted to a Java Charset. Please select a specific encoding for this operation.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to use UTF-8 and attach a warning?

Be good to get a feel where this is as we should probably think around how we workaround longer term.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, I did this deliberately.

My idea is that for Write operations we should use the Encoding.utf_8 as the default explicitly OR in places where read/write is both possible and Encoding.Default is used, our write logic should have logic to consciously replace Encoding.Default with something else in Write mode.

This is nicely demonstrated in commit 5656d6a

If I had this method select UTF-8 automatically, I would not have noticed that I'm lacking special logic for Delimited Write - it would just write as UTF-8. But that would be incorrect - e.g. when appending to a file that upon read is detected to be UTF-16 because it has a BOM - in such cases we want to also append in UTF-16, so we needed special logic changes.

Thus IMO it is better to keep as is - because whenever this error occurs it should be one of 3 scenarios:

  1. we mistakenly put Encoding.Default as a default value for a write-only operation,
  2. we forgot to have special handling for Encoding.Default in a read/write operation,
  3. the user explicitly selected Encoding.Default for a write operation.

In case of (3) the error makes sense because the user should just select an encoding explicitly.

In cases (1) and (2) it also means that we actually forgot to implement something and we should fix the code - otherwise we could get invalid behaviour and corrupted files - e.g. with mixed UTF-8 and UTF-16.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I think I somehow missed the 'attach a warning' part.

I still think the error will be more noticeable. But OTOH the warning will allow the user to proceed so maybe that is the right call indeed.

Amended.

@@ -739,7 +739,7 @@ Text.is_whitespace self =
"Hello".bytes (Encoding.ascii)
@encoding Encoding.default_widget
Text.bytes : Encoding -> Problem_Behavior -> Vector Integer
Text.bytes self encoding on_problems=Problem_Behavior.Report_Warning =
Text.bytes self encoding (on_problems : Problem_Behavior = Problem_Behavior.Report_Warning) =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Text.bytes self encoding (on_problems : Problem_Behavior = Problem_Behavior.Report_Warning) =
Text.bytes self encoding:Encoding (on_problems : Problem_Behavior = Problem_Behavior.Report_Warning) =

We should probably default here as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would the default be? UTF-8?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added UTF-8 default. Now this method by default has the same behaviour as utf_8. But I think it makes sense to keep both, to still have the utf_8 shorthand if we want UTF-8 explicitly. The default for bytes is more oriented at the GUI/Component Browser usage where we want some default and it also happens to be the same.

@radeusgd radeusgd removed the CI: Ready to merge This PR is eligible for automatic merge label Jun 3, 2024
@radeusgd radeusgd added the CI: Ready to merge This PR is eligible for automatic merge label Jun 3, 2024
@mergify mergify bot merged commit 7cf80f3 into develop Jun 4, 2024
35 of 36 checks passed
@mergify mergify bot deleted the wip/radeusgd/9849-auto-utf branch June 4, 2024 13:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI: Ready to merge This PR is eligible for automatic merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

JSON parser breaks in presence of UTF BOM
4 participants