Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF8 bytes count support as columnKind? #466

Open
haya14busa opened this issue Jul 20, 2020 · 5 comments
Open

UTF8 bytes count support as columnKind? #466

haya14busa opened this issue Jul 20, 2020 · 5 comments

Comments

@haya14busa
Copy link

Can or should sarif support (UTF-8) bytes count support as columnKind?

utf16CodeUnits seems to be handy for programming languages that use UTF 16 as default string representation, but many other languages use UTF-8 by default and the de-fact standard encoding of text files is UTF-8 these days.

SARIF supports unicodeCodePoints as alternative but I think it's not handy neither for tools that output sarif nor tools which consume sarif format. To support unicodeCodePoints, tools need extra encode/decode steps.

For example, consider supporting replacements as consumers. If it's (UTF8) bytes count based number, it's simple to support replacements because it can just replace the content of the byte in the specified range.
If tools need to use unicodeCodePoints, it needs to read and encode the content even before the specified range to get the correct number.

Actual Examples

Go

Go uses byte count for AST node's position. https://golang.org/pkg/go/token/#Position
This is used for Go standard analysis package (https://pkg.go.dev/golang.org/x/tools/go/analysis) as well.

type Position struct {
    Filename string // filename, if any
    Offset   int    // offset, starting at 0
    Line     int    // line number, starting at 1
    Column   int    // column number, starting at 1 (byte count)
}

Vim

Vim uses the byte index as a column position (See :help col() for example).

:help col()
col({expr})	The result is a Number, which is the byte index of the column
		position given with {expr}.

I don't research other tools, but I believe there are many other tools out there that use UTF8 byte count as column.

Proposal

I propose to add utf8CodeUnits or bytes (in UTF8 or bytes in the text encoding of the file) as columnKind.

@ghost
Copy link

ghost commented Jul 27, 2020

Thank you for the suggestion! @michaelcfanning FYI

@michaelcfanning
Copy link
Contributor

+1, good suggestion.

@michaelcfanning
Copy link
Contributor

Next steps, investigate XCode + VS Code behavior for managing column data.
We tend to think a bytes value might be unnecessary as regions already have an expression for bytes.

@KalleOlaviNiemitalo
Copy link

In a SARIF-2.1.0 region object, the byteOffset property counts from the start of the artifact, not from the start of the line indicated by the startLine property. So it can be used for UTF-8 bytes in principle but I suspect it is not convenient for text-oriented tools.

@schlaman-ms
Copy link

schlaman-ms commented Aug 9, 2023

Document location for issue:

§3.14 run object
§3.14.27 columnKind property

haya14busa:

"I propose to add utf8CodeUnits or bytes (in UTF8 or bytes in the text encoding of the file) as columnKind"

Propose to add utf8CodeUnits as a possible value to columnKind property

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants