Add option to allow column position report in characters #45

xpol · 2016-01-05T06:55:57Z

Currently column position are reported in bytes.

Eg:

local hand = {['♥']={}}
print(hand['♥'][i])

luacheck --ranges test.lua will reports follow warning:

spec/Hand_spec.lua:2:19-19: accessing undefined variable i

In atom with linter-luacheck displays:

The problems is ♥ has 3 bytes in utf8 which caused 2 character offset in Atom.

Is it possible to include an option like --characters in luacheck to report column position in characters rather than bytes? Maybe a related options such as --encoding is needed (or --characters default treat input as utf8 is enough).

The text was updated successfully, but these errors were encountered:

mpeterv · 2016-01-05T09:49:11Z

I don't really want to implement any encodings other than utf8, so I prefer --characters (maybe call it --utf8 then).

xpol · 2016-01-05T10:14:57Z

I personally don't use Lua file encoded other than utf8.

And I prefers --characters or --count-in-characters rather than --utf8 😄 .

mpeterv · 2016-02-17T10:28:59Z

I think it's OK to just always use UTF8, without any options. Then this feature won't have strange interactions with caching, and can be implemented right away.

mpeterv · 2017-08-27T12:30:51Z

There is also another use case: there is value in being able to check that input files are purely ASCII. E.g. it ensures that there are no comments in some other alphabet in codebase that's supposed to use purely English comments. So there should be an option to use ascii encoding. Then it makes sense to have --encoding option with possible values being:

latin: current behaviour, one byte = one character.
ascii: current behaviour for bytes in 0..127 range, error if there is a byte in 128..255 range
utf8: decode input as UTF8, report everything in characters, error on invalid byte sequence

An option to use UTF8 but ignore decoding errors does not seem to be very useful.

johnd0e · 2017-08-30T11:20:52Z

An option to use UTF8 but ignore decoding errors does not seem to be very useful.

May be it would be usefull to get other option: try to use UTF8 but fall back to latin in case of errors.

Or another option: detect UTF8 by BOM.

It come in handy if we have to check mixed set of files.

mpeterv · 2017-08-30T17:08:24Z

It will be possible to set encoding per file in config. Autodetection can be implemented separately as --encoding=auto, which may be then become default if it works well enough.

Alloyed · 2017-09-04T02:22:10Z

I've got a particularly fun job with respect to this one.

Visual Studio Code, and then by extension anybody that wants to implement an LSP server, uses the word "character" to mean utf16 code units. I don't realistically expect you to implement this, but it means to match spec I need to go from bytes -> unicode code points -> utf16 code units.

I am doing this using the lua API, so I prefer working in bytes so the built-in string.sub() works. I don't mind an opt-in choice to use codepoints, but there are enough ambiguous definitions of the word "character" that I'd rather use the word code point in the docs.

mpeterv · 2017-09-04T10:03:55Z

@Alloyed weird cases like that could be implemented as a separate encoding in luacheck, or maybe luacheck could load custom encodings from external modules, like with formatters.

And yes, ideally columns should be reported in grapheme clusters and not just code points. Grapheme clusters seem to be pretty clearly defined in the unicode spec.

mpeterv · 2018-09-06T15:31:28Z

Just a bit late but on unicode branch columns/line length warnings are now in UTF8 codepoints (with fallback to old behaviour on decoding error).

mpeterv · 2018-09-06T17:24:32Z

@Alloyed could you clarify the LSP issue? I'm not sure what Visual Studio Code ... uses the word "character" to mean utf16 code units means. UTF-16 encoded text can't be syntactically correct Lua.

Alloyed · 2018-09-07T00:13:48Z

VSCode stores its text buffers internally as utf-16, even if the file that gets saved and loaded is saved as utf-8.

This means (by intention or by accident), that when the LSP protocol wants to communicate a position in a text file, it sends it as {line number, utf16 code unit}, even when the actual encoding of the text being sent back and forth is utf-8.

See
microsoft/language-server-protocol#376
for more info, and
https://github.com/Alloyed/lua-lsp/blob/master/lua-lsp/methods.lua#L292
for what this means in code.

mpeterv · 2018-09-08T09:50:23Z

Unicode support merged into master, closing this, new release coming soon. @Alloyed if there are any issues with this please open a new one.

xpol changed the title ~~Should column position report in characters or bytes?~~ Add option to allow column position report in characters Jan 5, 2016

mpeterv added the enhancement label Jan 5, 2016

mpeterv mentioned this issue Aug 27, 2017

max line length count is in bytes, should be in characters #115

Closed

mpeterv mentioned this issue Nov 10, 2017

The check for max column in a line not calculate the unicode correctly. #136

Closed

mpeterv mentioned this issue Sep 6, 2018

Line length counts in bytes #170

Closed

mpeterv closed this as completed Sep 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to allow column position report in characters #45

Add option to allow column position report in characters #45

xpol commented Jan 5, 2016

mpeterv commented Jan 5, 2016

xpol commented Jan 5, 2016

mpeterv commented Feb 17, 2016

mpeterv commented Aug 27, 2017

johnd0e commented Aug 30, 2017 •

edited

Loading

mpeterv commented Aug 30, 2017

Alloyed commented Sep 4, 2017 •

edited

Loading

mpeterv commented Sep 4, 2017

mpeterv commented Sep 6, 2018 •

edited

Loading

mpeterv commented Sep 6, 2018

Alloyed commented Sep 7, 2018

mpeterv commented Sep 8, 2018

Add option to allow column position report in characters #45

Add option to allow column position report in characters #45

Comments

xpol commented Jan 5, 2016

mpeterv commented Jan 5, 2016

xpol commented Jan 5, 2016

mpeterv commented Feb 17, 2016

mpeterv commented Aug 27, 2017

johnd0e commented Aug 30, 2017 • edited Loading

mpeterv commented Aug 30, 2017

Alloyed commented Sep 4, 2017 • edited Loading

mpeterv commented Sep 4, 2017

mpeterv commented Sep 6, 2018 • edited Loading

mpeterv commented Sep 6, 2018

Alloyed commented Sep 7, 2018

mpeterv commented Sep 8, 2018

johnd0e commented Aug 30, 2017 •

edited

Loading

Alloyed commented Sep 4, 2017 •

edited

Loading

mpeterv commented Sep 6, 2018 •

edited

Loading