Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doc: add documentation for invalid byte sequences #28249

Closed
wants to merge 6 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 32 additions & 0 deletions doc/api/buffer.md
Original file line number Diff line number Diff line change
Expand Up @@ -197,6 +197,38 @@ the WHATWG specification it is possible that the server actually returned
`'win-1252'`-encoded data, and using `'latin1'` encoding may incorrectly decode
the characters.

### Evaluating legal code points for '`utf-8'` encoding

Byte sequences that do not have corresponding UTF-16 encodings and non-legal
Unicode values, along with their UTF-8 counterparts must be treated as
invalid byte sequences.

For cases regarding operations other than employing backward compatibility
for 7-bit (and [extended 8-bit]((https://en.wikipedia.org/wiki/UTF-8#Description))
in rare cases) `'ascii'` data, and the valid [`UTF-8` code units](https://en.wikipedia.org/wiki/UTF-8#Codepage_layout),
the replacement character (`�`) is returned,
and no exception will be thrown.

A `U+FFFD` replacement value
(representing the aforementioned replacement character) will be returned
in case of decoding errors (invalid unicode scalar values).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I don’t understand most of the text or its relevance here… the text basically says that invalid UTF-8 byte sequences will be decoded into U+FFFD replacement characters and that no error will be thrown in those cases, right?

How do UTF-16 and ASCII relate to that? What does “non-legal Unicode value” mean? (I would guess that this refers to characters that would be beyond U+10FFFF – if that’s correct, can you clarify that in the text?)


```js
// Assuming an invalid byte sequence
const buf = Buffer.from([237, 166, 164]);

const buf_str = buf.toString('utf-8');

console.log(buf_str);
// Prints: '�'

console.log(buf.byteLength(buf_str));
// Prints: 3

console.log(buf.codePointAt(0).toString(16));
// Prints: 'fffd'
```

## Buffers and TypedArray
<!-- YAML
changes:
Expand Down