Handle UTF BOM when decoding text #10130

radeusgd · 2024-05-29T15:45:11Z

Pull Request Description

Improve BOM handling: detect and skip the BOM character, Default encoding that detects encoding based on BOM if present, warnings if unexpected BOM is encountered.
Closes JSON parser breaks in presence of UTF BOM #9849
Windows-1252 fallback will be done as a separate PR as it has additional complexity. Tracked in ticket Fallback to Windows-1252 encoding with Encoding.Default if invalid UTF-8 characters are encountered #10148.

Important Notes

Checklist

Please ensure that the following checklist has been satisfied before submitting the PR:

The documentation has been updated, if necessary.
Screenshots/screencasts have been attached, if there are any visual changes. For interactive or animated visual changes, a screencast is preferred.
All code follows the
Scala,
Java,
TypeScript,
and
Rust
style guides. In case you are using a language not listed above, follow the Rust style guide.
Unit tests have been written where possible.

…it as default/fallback

…tered

AdRiley · 2024-05-31T15:55:46Z

std-bits/base/src/main/java/org/enso/base/encoding/ReportingStreamDecoder.java

   *
-   * <p>Used for reporting warnings.
+   * <p>It will never return 0 - it will either return a positive number of characters read, or -1


This comment seems to contradict line 120

Ooops! I forgot about this edge case, I will update the doc 'will never return 0 unless user requested 0 bytes'. As the user setting len == 0 should be the only case where 0 would be returned - as guaranteed by asserts below.

…considers metadata

Encoding.Default is treated as UTF-8 for new files when appending, Encoding.Default will try to detect the encoding

jdunkerley

Some little things for next PR...

distribution/lib/Standard/AWS/0.0.0-dev/src/S3/S3_File.enso

distribution/lib/Standard/Base/0.0.0-dev/src/Data.enso

jdunkerley · 2024-06-03T08:23:33Z

distribution/lib/Standard/Base/0.0.0-dev/src/Data/Text/Encoding.enso


       Arguments:
       - character_set: java.nio.charset name.
-    Value (character_set:Text)
+    Value (internal_character_set:Text)


Suggested change

Value (internal_character_set:Text)

private Value (internal_character_set:Text)

This cannot be done easily because we cannot have both private and public constructors mixed, and Encoding.Default is public.

We could make it private and create a 'factory' method Encoding.default I guess. Will amend in the next PR.

Amended in this PR after all, for clarity.

jdunkerley · 2024-06-03T08:24:33Z

distribution/lib/Standard/Base/0.0.0-dev/src/Data/Text/Encoding.enso

+    to_java_charset self -> Charset ! Illegal_Argument =
+        self.to_java_charset_or_null.if_nothing <|
+            Error.throw (Illegal_Argument.Error "The Default encoding cannot be converted to a Java Charset. Please select a specific encoding for this operation.")


Would it be better to use UTF-8 and attach a warning?

Be good to get a feel where this is as we should probably think around how we workaround longer term.

I don't think so, I did this deliberately.

My idea is that for Write operations we should use the Encoding.utf_8 as the default explicitly OR in places where read/write is both possible and Encoding.Default is used, our write logic should have logic to consciously replace Encoding.Default with something else in Write mode.

This is nicely demonstrated in commit 5656d6a

If I had this method select UTF-8 automatically, I would not have noticed that I'm lacking special logic for Delimited Write - it would just write as UTF-8. But that would be incorrect - e.g. when appending to a file that upon read is detected to be UTF-16 because it has a BOM - in such cases we want to also append in UTF-16, so we needed special logic changes.

Thus IMO it is better to keep as is - because whenever this error occurs it should be one of 3 scenarios:

we mistakenly put Encoding.Default as a default value for a write-only operation,

we forgot to have special handling for Encoding.Default in a read/write operation,

the user explicitly selected Encoding.Default for a write operation.

In case of (3) the error makes sense because the user should just select an encoding explicitly.

In cases (1) and (2) it also means that we actually forgot to implement something and we should fix the code - otherwise we could get invalid behaviour and corrupted files - e.g. with mixed UTF-8 and UTF-16.

Ah I think I somehow missed the 'attach a warning' part.

I still think the error will be more noticeable. But OTOH the warning will allow the user to proceed so maybe that is the right call indeed.

Amended.

jdunkerley · 2024-06-03T08:25:59Z

distribution/lib/Standard/Base/0.0.0-dev/src/Data/Text/Extensions.enso

@@ -739,7 +739,7 @@ Text.is_whitespace self =
         "Hello".bytes (Encoding.ascii)
 @encoding Encoding.default_widget
 Text.bytes : Encoding -> Problem_Behavior -> Vector Integer
-Text.bytes self encoding on_problems=Problem_Behavior.Report_Warning =
+Text.bytes self encoding (on_problems : Problem_Behavior = Problem_Behavior.Report_Warning) =


Suggested change

Text.bytes self encoding (on_problems : Problem_Behavior = Problem_Behavior.Report_Warning) =

Text.bytes self encoding:Encoding (on_problems : Problem_Behavior = Problem_Behavior.Report_Warning) =

We should probably default here as well.

What would the default be? UTF-8?

Added UTF-8 default. Now this method by default has the same behaviour as utf_8. But I think it makes sense to keep both, to still have the utf_8 shorthand if we want UTF-8 explicitly. The default for bytes is more oriented at the GUI/Component Browser usage where we want some default and it also happens to be the same.

distribution/lib/Standard/Base/0.0.0-dev/src/Data/Text/Extensions.enso

distribution/lib/Standard/Base/0.0.0-dev/src/Enso_Cloud/Enso_File.enso

distribution/lib/Standard/Base/0.0.0-dev/src/System/File.enso

radeusgd self-assigned this May 29, 2024

radeusgd changed the title ~~Wip/radeusgd/9849 auto utf~~ Handle UTF BOM when decoding text May 29, 2024

radeusgd force-pushed the wip/radeusgd/9849-auto-utf branch 2 times, most recently from 5448bbc to d35a444 Compare May 31, 2024 11:49

radeusgd added 26 commits May 31, 2024 17:14

Introduced Encoding.Default and switched read-like operations to use …

ef627e4

…it as default/fallback

tests for default encoding and UTF BOM

691688f

tests for File read/write, what happens if invalid UTF but BOM encoun…

69c7322

…tered

JSON test

c3d5dc4

Delimited tests

6858a83

fix typos

931f544

moving Encoding Utils

3f917ad

updating Decoder

f6551b4

update test

9572d67

DRY

a2f9068

fix test

570ab83

more tests

78bd465

encoding detection based on BOM

923bb1b

mark fallback tests as pending

b738ba9

skip valid BOM in known UTF encodings

f473e13

refactor to have problem aggregator pattern

ddd4d19

javafmt

e271fc8

detected BOM context in errors

3a5d376

note

fadb6c3

report unexpected BOMs

e01cfe6

fixing tests

728ca75

check preconditions

80aa6cf

add some asserts, fix missing last partial character edge case

50bdd70

fix test

e587343

test edge case - currently parser has div by 0 ...

acb9de7

fix most Delimited tests by allowing Default encoding in stream decoder

f08666a

radeusgd requested review from GregoryTravis, AdRiley and marthasharkey as code owners May 31, 2024 15:16

radeusgd mentioned this pull request May 31, 2024

Fallback to Windows-1252 encoding with Encoding.Default if invalid UTF-8 characters are encountered #10148

Closed

fix tests

6becd7f

AdRiley reviewed May 31, 2024

View reviewed changes

AdRiley approved these changes May 31, 2024

View reviewed changes

radeusgd added 5 commits May 31, 2024 18:08

fix tests (2)

08da99a

ensure that we don't read the file all at once, but as needed

12c60d0

fix comment

b07a9c5

fmt

b9706a2

revert to Infer as it has different semantics from Default - it also …

bdbd3ff

…considers metadata

GregoryTravis approved these changes May 31, 2024

View reviewed changes

radeusgd added 2 commits June 3, 2024 10:49

Merge branch 'develop' into wip/radeusgd/9849-auto-utf

a5f53b9

fix Delimited Writer:

5656d6a

Encoding.Default is treated as UTF-8 for new files when appending, Encoding.Default will try to detect the encoding

radeusgd added the CI: Ready to merge This PR is eligible for automatic merge label Jun 3, 2024

jdunkerley approved these changes Jun 3, 2024

View reviewed changes

radeusgd removed the CI: Ready to merge This PR is eligible for automatic merge label Jun 3, 2024

radeusgd added 6 commits June 3, 2024 13:24

add tests that write uses the encoding like read would

0ed6f7d

make Encoding constructor private

d5be424

CR: adding type annotations

e8982fd

CR: change error to warning

477123e

CR: add defaults

14952fc

Merge branch 'develop' into wip/radeusgd/9849-auto-utf

84434c8

radeusgd added the CI: Ready to merge This PR is eligible for automatic merge label Jun 3, 2024

radeusgd added 3 commits June 4, 2024 11:50

fixing tests

1f9f8ae

Merge branch 'refs/heads/develop' into wip/radeusgd/9849-auto-utf

0aed534

fix NI test

4413c72

mergify bot merged commit 7cf80f3 into develop Jun 4, 2024
35 of 36 checks passed

mergify bot deleted the wip/radeusgd/9849-auto-utf branch June 4, 2024 13:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle UTF BOM when decoding text #10130

Handle UTF BOM when decoding text #10130

radeusgd commented May 29, 2024 •

edited

Loading

AdRiley May 31, 2024

radeusgd May 31, 2024

jdunkerley left a comment

jdunkerley Jun 3, 2024

radeusgd Jun 3, 2024

radeusgd Jun 3, 2024

jdunkerley Jun 3, 2024

radeusgd Jun 3, 2024

radeusgd Jun 3, 2024

jdunkerley Jun 3, 2024

radeusgd Jun 3, 2024

radeusgd Jun 3, 2024

	Value (internal_character_set:Text)
	private Value (internal_character_set:Text)

	Text.bytes self encoding (on_problems : Problem_Behavior = Problem_Behavior.Report_Warning) =
	Text.bytes self encoding:Encoding (on_problems : Problem_Behavior = Problem_Behavior.Report_Warning) =

Handle UTF BOM when decoding text #10130

Handle UTF BOM when decoding text #10130

Conversation

radeusgd commented May 29, 2024 • edited Loading

Pull Request Description

Important Notes

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jdunkerley left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

radeusgd commented May 29, 2024 •

edited

Loading