You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The basic-bytestring wrapper does not work correctly with left contexts when provided with characters which are encoded as multiple bytes in UTF-8.
The following program produces True,False while I expect it to produce True,True.
{
{-# LANGUAGE OverloadedStrings #-}
module Main where
import qualified Data.ByteString.Lazy.Char8 as B
}
%wrapper "basic-bytestring"
tokens :-
a^b {const True}
a {const True}
∃^∀ {const True}
∃ {const True}
. {const False}
{
main::IO ()
main = do
print . and . alexScanTokens $ "ab"
print . and . alexScanTokens $ "∃∀"
}
I think this is due to alexGetByte for this wrapper remembering the last byte rather than the last character.
Since converting input bytes to characters puts unnecessary costs on the users of this wrapper maybe we should just not implement left contexts in this case?
The text was updated successfully, but these errors were encountered:
@jmoy Your program does not perform UTF-8 encoding correctly. The fromString instance of Data.ByteString.Lazy.ByteString just maps each code point c to c `mod` 255. To correctly perform UTF-8 encoding, you can use Data.Text.Lazy.Text and Data.Text.Lazy.Encoding.encodeUtf8. The modified program is as follows.
{
{-# LANGUAGE OverloadedStrings #-}
module Main where
import qualified Data.Text.Lazy as T
import Data.Text.Lazy.Encoding
}
%wrapper "basic-bytestring"
tokens :-
a^b {const True}
a {const True}
∃^∀ {const True}
∃ {const True}
. {const False}
{
main::IO ()
main = do
print . and . alexScanTokens . encodeUtf8 $ "ab"
print . and . alexScanTokens . encodeUtf8 $ "∃∀"
}
The
basic-bytestring
wrapper does not work correctly with left contexts when provided with characters which are encoded as multiple bytes in UTF-8.The following program produces
True,False
while I expect it to produceTrue,True
.I think this is due to
alexGetByte
for this wrapper remembering the last byte rather than the last character.Since converting input bytes to characters puts unnecessary costs on the users of this wrapper maybe we should just not implement left contexts in this case?
The text was updated successfully, but these errors were encountered: