-
Notifications
You must be signed in to change notification settings - Fork 326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle UTF BOM when decoding text #10130
Conversation
5448bbc
to
d35a444
Compare
…it as default/fallback
* | ||
* <p>Used for reporting warnings. | ||
* <p>It will never return 0 - it will either return a positive number of characters read, or -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment seems to contradict line 120
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooops! I forgot about this edge case, I will update the doc 'will never return 0 unless user requested 0 bytes'. As the user setting len == 0
should be the only case where 0
would be returned - as guaranteed by asserts below.
Encoding.Default is treated as UTF-8 for new files when appending, Encoding.Default will try to detect the encoding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some little things for next PR...
|
||
Arguments: | ||
- character_set: java.nio.charset name. | ||
Value (character_set:Text) | ||
Value (internal_character_set:Text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Value (internal_character_set:Text) | |
private Value (internal_character_set:Text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This cannot be done easily because we cannot have both private and public constructors mixed, and Encoding.Default
is public.
We could make it private and create a 'factory' method Encoding.default
I guess. Will amend in the next PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amended in this PR after all, for clarity.
to_java_charset self -> Charset ! Illegal_Argument = | ||
self.to_java_charset_or_null.if_nothing <| | ||
Error.throw (Illegal_Argument.Error "The Default encoding cannot be converted to a Java Charset. Please select a specific encoding for this operation.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to use UTF-8 and attach a warning?
Be good to get a feel where this is as we should probably think around how we workaround longer term.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, I did this deliberately.
My idea is that for Write operations we should use the Encoding.utf_8
as the default explicitly OR in places where read/write is both possible and Encoding.Default
is used, our write logic should have logic to consciously replace Encoding.Default
with something else in Write mode.
This is nicely demonstrated in commit 5656d6a
If I had this method select UTF-8 automatically, I would not have noticed that I'm lacking special logic for Delimited Write - it would just write as UTF-8. But that would be incorrect - e.g. when appending to a file that upon read is detected to be UTF-16 because it has a BOM - in such cases we want to also append in UTF-16, so we needed special logic changes.
Thus IMO it is better to keep as is - because whenever this error occurs it should be one of 3 scenarios:
- we mistakenly put
Encoding.Default
as a default value for a write-only operation, - we forgot to have special handling for
Encoding.Default
in a read/write operation, - the user explicitly selected
Encoding.Default
for a write operation.
In case of (3) the error makes sense because the user should just select an encoding explicitly.
In cases (1) and (2) it also means that we actually forgot to implement something and we should fix the code - otherwise we could get invalid behaviour and corrupted files - e.g. with mixed UTF-8 and UTF-16.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I think I somehow missed the 'attach a warning' part.
I still think the error will be more noticeable. But OTOH the warning will allow the user to proceed so maybe that is the right call indeed.
Amended.
@@ -739,7 +739,7 @@ Text.is_whitespace self = | |||
"Hello".bytes (Encoding.ascii) | |||
@encoding Encoding.default_widget | |||
Text.bytes : Encoding -> Problem_Behavior -> Vector Integer | |||
Text.bytes self encoding on_problems=Problem_Behavior.Report_Warning = | |||
Text.bytes self encoding (on_problems : Problem_Behavior = Problem_Behavior.Report_Warning) = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Text.bytes self encoding (on_problems : Problem_Behavior = Problem_Behavior.Report_Warning) = | |
Text.bytes self encoding:Encoding (on_problems : Problem_Behavior = Problem_Behavior.Report_Warning) = |
We should probably default here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would the default be? UTF-8?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added UTF-8
default. Now this method by default has the same behaviour as utf_8
. But I think it makes sense to keep both, to still have the utf_8
shorthand if we want UTF-8 explicitly. The default for bytes
is more oriented at the GUI/Component Browser usage where we want some default and it also happens to be the same.
Pull Request Description
Encoding.Default
if invalid UTF-8 characters are encountered #10148.Important Notes
Checklist
Please ensure that the following checklist has been satisfied before submitting the PR:
Scala,
Java,
TypeScript,
and
Rust
style guides. In case you are using a language not listed above, follow the Rust style guide.