Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cabal MUST ignore and override user locale variables #6076

Closed
fare opened this issue Jun 12, 2019 · 8 comments
Closed

Cabal MUST ignore and override user locale variables #6076

fare opened this issue Jun 12, 2019 · 8 comments

Comments

@fare
Copy link

fare commented Jun 12, 2019

While building source code, compilers and build tools should ALWAYS process each and every source file using the encoding with which the file was written by its authors and released by its maintainers, and NEVER process any of those files with the locale inherited from the end-user when they introduce any discrepancy whatsoever. The proper thing to do is thus to NEVER, EVER heed the user-inherited locale variables LANG and LC_* — the very idea flies in the face of the determinism aimed at by cabal. If some interactive flag allows to explicitly inherit those variables, any discrepancy in encoding should still lead to a prominent warning unless explicitly hushed.

The only imaginable defaults that make any sense for the locale are POSIX and en_US.UTF-8. The POSIX default would impose needless pain for no gain whatsoever in a day where UTF-8 is now a widely accepted and supported standard, so the only sensible and useful default is en_US.UTF-8 (I would have proposed the more neutral C.UTF-8 but it doesn't work on Darwin).

I was faced with this bug while building with stack a Haskell program that depended on language-javascript, and had a painful debug session until I found how to configure a suitable shell.nix for stack.yaml. Drilling to root causes led me to find that it's a fundamental bug in all of Nix, Cabal, Hackage and Stack. Remarkably, I fixed the very same issue in Common Lisp, where the build system ASDF now assumes that all source code is UTF-8 by default, unless overridden by the library maintainers, and never ever heeding user locale. The switch was slightly painful, hounding maintainers of tens of libraries and actually pulling the switch only a year after warning everyone. The switch should be simpler for Cabal, as I suspect no one uses latin1, latin2, euc-jp or koi8-r anymore in any Haskell package.

See also:
https://www.snoyman.com/blog/2016/12/beware-of-readfile
agda/agda#2922
input-output-hk/cardano-sl@ed8c892

NB: I filed the same issue against nixpkgs and stack:
NixOS/nixpkgs#63014
commercialhaskell/stack#4859

Steps to reproduce

(unset LANG LC_ALL LC_CTYPE LC_NUMERIC LC_TIME LC_COLLATE LC_MONETARY LC_MESSAGES LC_PAPER LC_NAME LC_ADDRESS LC_TELEPHONE LC_MEASUREMENT LC_IDENTIFICATION;
stack build language-javascript )

Expected

It should compile successfully, as if I had built with LC_ALL=en_US.UTF-8

Actual

--  While building package language-javascript-0.6.0.12 using:
      /home/fare/.stack/setup-exe-cache/x86_64-linux-nix/Cabal-simple_mPHDZzAJ_2.4.0.1_ghc-8.6.5 --builddir=.stack-work/dist/x86_64-linux-nix/Cabal-2.4.0.1 build --ghc-options " -ddump-hi -ddump-to-file -fdiagnostics-color=always"
    Process exited with code: ExitFailure 1
    Logs have been written to: /home/fare/.stack/global-project/.stack-work/logs/language-javascript-0.6.0.12.log

    Configuring language-javascript-0.6.0.12...
    Preprocessing library for language-javascript-0.6.0.12..
    happy: src/Language/JavaScript/Parser/Grammar7.y: hGetContents: invalid argument (invalid byte sequence)

Using Cabal 2.4.0.1.

@23Skidoo
Copy link
Member

Makes sense, yes.

@phadej
Copy link
Collaborator

phadej commented Jun 12, 2019

This particular Unicode issue is with happy, not Cabal.

[polinukli] /code/mess % file Grammar7.y 
Grammar7.y: UTF-8 Unicode text
[polinukli] /code/mess % happy Grammar7.y 
unused rules: 3
shift/reduce conflicts:  246
reduce/reduce conflicts: 375
[polinukli] /code/mess % unset LANG LC_ALL LC_CTYPE LC_NUMERIC LC_TIME LC_COLLATE LC_MONETARY LC_MESSAGES LC_PAPER LC_NAME LC_ADDRESS LC_TELEPHONE LC_MEASUREMENT LC_IDENTIFICATION;
[polinukli] /code/mess % happy Grammar7.y                                                                                                                                           
happy: Grammar7.y: hGetContents: invalid argument (invalid byte sequence)

@phadej phadej closed this as completed Jun 12, 2019
@fare
Copy link
Author

fare commented Jun 12, 2019

It's a problem with anything that is built through Cabal, not just with happy.

@geraldus
Copy link

Faced exact same error message when building my website in Docker container (Debian based).

happy: src/Language/JavaScript/Parser/Grammar7.y: hGetContents: invalid argument (invalid byte sequence)

Solved issue by setting locale:

  • Edit /etc/locales.gen/ (uncomment at least one locale)
  • Run locale-gen

Upgrading stack from 1.19 to 2.1.3 was not helpful.

Hope this will help somebody.

@2piix
Copy link

2piix commented Jun 9, 2020

@phadej: This isn't a happy bug.

It's totally okay if happy fails when there isn't a locale set. It's not okay for Cabal to forget to set a locale when it calls happy to parse things out. Especially in light of the issues @fare brought up.

@phadej
Copy link
Collaborator

phadej commented Jun 9, 2020

To what Cabal should set locale, How Cabal can know which locale to set?

User have to configure their systems. In fact, just today we checked Haskell Report (and GHC manual), and there isn't any specific wording that Haskell source files have to be in some specific encoding!

@fare
Copy link
Author

fare commented Jun 9, 2020

The only locale both portable and useful is en_US.UTF-8.

C.POSIX is even more portable, but even less useful: it forces all code to be in ASCII.

@phadej
Copy link
Collaborator

phadej commented Jun 9, 2020

It's still a bug in happy. Why Cabal should workaround a bug in happy. You yourself write

compilers and build tools should ALWAYS process each and every source file using the encoding with which the file was written by its authors and released by its maintainers,

Happy is a compiler. Make happy do hSetEncoding on files it reads. Cabal specifies that .cabal files should be UTF-8 encoded, I cannot find anything in happy.

We won't workaround bugs which are easily fixable in the tools themselves. Adding workarounds is not sustainable.


Yes it might mean that you need to fix almost every tool, but that's the right approach.


Or just set LANG=en_US.UTF-8. How that is a problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants