UTF8 bytes count support as columnKind? #466

haya14busa · 2020-07-20T03:49:18Z

Can or should sarif support (UTF-8) bytes count support as columnKind?

utf16CodeUnits seems to be handy for programming languages that use UTF 16 as default string representation, but many other languages use UTF-8 by default and the de-fact standard encoding of text files is UTF-8 these days.

SARIF supports unicodeCodePoints as alternative but I think it's not handy neither for tools that output sarif nor tools which consume sarif format. To support unicodeCodePoints, tools need extra encode/decode steps.

For example, consider supporting replacements as consumers. If it's (UTF8) bytes count based number, it's simple to support replacements because it can just replace the content of the byte in the specified range.
If tools need to use unicodeCodePoints, it needs to read and encode the content even before the specified range to get the correct number.

Actual Examples

Go

Go uses byte count for AST node's position. https://golang.org/pkg/go/token/#Position
This is used for Go standard analysis package (https://pkg.go.dev/golang.org/x/tools/go/analysis) as well.

type Position struct {
    Filename string // filename, if any
    Offset   int    // offset, starting at 0
    Line     int    // line number, starting at 1
    Column   int    // column number, starting at 1 (byte count)
}

Vim

Vim uses the byte index as a column position (See :help col() for example).

:help col()
col({expr})	The result is a Number, which is the byte index of the column
		position given with {expr}.

I don't research other tools, but I believe there are many other tools out there that use UTF8 byte count as column.

Proposal

I propose to add utf8CodeUnits or bytes (in UTF8 or bytes in the text encoding of the file) as columnKind.

The text was updated successfully, but these errors were encountered:

ghost · 2020-07-27T14:03:49Z

Thank you for the suggestion! @michaelcfanning FYI

michaelcfanning · 2020-07-27T15:12:12Z

+1, good suggestion.

michaelcfanning · 2023-07-13T16:17:30Z

Next steps, investigate XCode + VS Code behavior for managing column data.
We tend to think a bytes value might be unnecessary as regions already have an expression for bytes.

KalleOlaviNiemitalo · 2023-07-14T03:44:05Z

In a SARIF-2.1.0 region object, the byteOffset property counts from the start of the artifact, not from the start of the line indicated by the startLine property. So it can be used for UTF-8 bytes in principle but I suspect it is not convenient for text-oriented tools.

schlaman-ms · 2023-08-09T21:48:19Z

Document location for issue:

§3.14 run object
§3.14.27 columnKind property

haya14busa:

"I propose to add utf8CodeUnits or bytes (in UTF8 or bytes in the text encoding of the file) as columnKind"

Propose to add utf8CodeUnits as a possible value to `columnKind` property

michaelcfanning added 2.2.0 discussion-ongoing labels Aug 12, 2021

michaelcfanning added to-be-discussed and removed discussion-ongoing labels Jul 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF8 bytes count support as columnKind? #466

UTF8 bytes count support as columnKind? #466

haya14busa commented Jul 20, 2020

ghost commented Jul 27, 2020

michaelcfanning commented Jul 27, 2020

michaelcfanning commented Jul 13, 2023

KalleOlaviNiemitalo commented Jul 14, 2023

schlaman-ms commented Aug 9, 2023 •

edited

Loading

UTF8 bytes count support as columnKind? #466

UTF8 bytes count support as columnKind? #466

Comments

haya14busa commented Jul 20, 2020

Actual Examples

Go

Vim

Proposal

ghost commented Jul 27, 2020

michaelcfanning commented Jul 27, 2020

michaelcfanning commented Jul 13, 2023

KalleOlaviNiemitalo commented Jul 14, 2023

schlaman-ms commented Aug 9, 2023 • edited Loading

Document location for issue:

Propose to add utf8CodeUnits as a possible value to columnKind property

schlaman-ms commented Aug 9, 2023 •

edited

Loading

Propose to add utf8CodeUnits as a possible value to `columnKind` property