Cabal MUST ignore and override user locale variables #6076

fare · 2019-06-12T15:40:43Z

While building source code, compilers and build tools should ALWAYS process each and every source file using the encoding with which the file was written by its authors and released by its maintainers, and NEVER process any of those files with the locale inherited from the end-user when they introduce any discrepancy whatsoever. The proper thing to do is thus to NEVER, EVER heed the user-inherited locale variables LANG and LC_* — the very idea flies in the face of the determinism aimed at by cabal. If some interactive flag allows to explicitly inherit those variables, any discrepancy in encoding should still lead to a prominent warning unless explicitly hushed.

The only imaginable defaults that make any sense for the locale are POSIX and en_US.UTF-8. The POSIX default would impose needless pain for no gain whatsoever in a day where UTF-8 is now a widely accepted and supported standard, so the only sensible and useful default is en_US.UTF-8 (I would have proposed the more neutral C.UTF-8 but it doesn't work on Darwin).

I was faced with this bug while building with stack a Haskell program that depended on language-javascript, and had a painful debug session until I found how to configure a suitable shell.nix for stack.yaml. Drilling to root causes led me to find that it's a fundamental bug in all of Nix, Cabal, Hackage and Stack. Remarkably, I fixed the very same issue in Common Lisp, where the build system ASDF now assumes that all source code is UTF-8 by default, unless overridden by the library maintainers, and never ever heeding user locale. The switch was slightly painful, hounding maintainers of tens of libraries and actually pulling the switch only a year after warning everyone. The switch should be simpler for Cabal, as I suspect no one uses latin1, latin2, euc-jp or koi8-r anymore in any Haskell package.

See also:
https://www.snoyman.com/blog/2016/12/beware-of-readfile
agda/agda#2922
input-output-hk/cardano-sl@ed8c892

NB: I filed the same issue against nixpkgs and stack:
NixOS/nixpkgs#63014
commercialhaskell/stack#4859

Steps to reproduce

(unset LANG LC_ALL LC_CTYPE LC_NUMERIC LC_TIME LC_COLLATE LC_MONETARY LC_MESSAGES LC_PAPER LC_NAME LC_ADDRESS LC_TELEPHONE LC_MEASUREMENT LC_IDENTIFICATION;
stack build language-javascript )

Expected

It should compile successfully, as if I had built with LC_ALL=en_US.UTF-8

Actual

--  While building package language-javascript-0.6.0.12 using:
      /home/fare/.stack/setup-exe-cache/x86_64-linux-nix/Cabal-simple_mPHDZzAJ_2.4.0.1_ghc-8.6.5 --builddir=.stack-work/dist/x86_64-linux-nix/Cabal-2.4.0.1 build --ghc-options " -ddump-hi -ddump-to-file -fdiagnostics-color=always"
    Process exited with code: ExitFailure 1
    Logs have been written to: /home/fare/.stack/global-project/.stack-work/logs/language-javascript-0.6.0.12.log

    Configuring language-javascript-0.6.0.12...
    Preprocessing library for language-javascript-0.6.0.12..
    happy: src/Language/JavaScript/Parser/Grammar7.y: hGetContents: invalid argument (invalid byte sequence)

Using Cabal 2.4.0.1.

The text was updated successfully, but these errors were encountered:

23Skidoo · 2019-06-12T16:46:41Z

Makes sense, yes.

phadej · 2019-06-12T18:26:51Z

This particular Unicode issue is with happy, not Cabal.

[polinukli] /code/mess % file Grammar7.y 
Grammar7.y: UTF-8 Unicode text
[polinukli] /code/mess % happy Grammar7.y 
unused rules: 3
shift/reduce conflicts:  246
reduce/reduce conflicts: 375
[polinukli] /code/mess % unset LANG LC_ALL LC_CTYPE LC_NUMERIC LC_TIME LC_COLLATE LC_MONETARY LC_MESSAGES LC_PAPER LC_NAME LC_ADDRESS LC_TELEPHONE LC_MEASUREMENT LC_IDENTIFICATION;
[polinukli] /code/mess % happy Grammar7.y                                                                                                                                           
happy: Grammar7.y: hGetContents: invalid argument (invalid byte sequence)

fare · 2019-06-12T18:46:30Z

It's a problem with anything that is built through Cabal, not just with happy.

geraldus · 2019-08-16T20:00:57Z

Faced exact same error message when building my website in Docker container (Debian based).

happy: src/Language/JavaScript/Parser/Grammar7.y: hGetContents: invalid argument (invalid byte sequence)

Solved issue by setting locale:

Edit /etc/locales.gen/ (uncomment at least one locale)
Run locale-gen

Upgrading stack from 1.19 to 2.1.3 was not helpful.

Hope this will help somebody.

2piix · 2020-06-09T20:25:08Z

@phadej: This isn't a happy bug.

It's totally okay if happy fails when there isn't a locale set. It's not okay for Cabal to forget to set a locale when it calls happy to parse things out. Especially in light of the issues @fare brought up.

phadej · 2020-06-09T20:38:28Z

To what Cabal should set locale, How Cabal can know which locale to set?

User have to configure their systems. In fact, just today we checked Haskell Report (and GHC manual), and there isn't any specific wording that Haskell source files have to be in some specific encoding!

fare · 2020-06-09T21:29:57Z

The only locale both portable and useful is en_US.UTF-8.

C.POSIX is even more portable, but even less useful: it forces all code to be in ASCII.

phadej · 2020-06-09T21:43:16Z

It's still a bug in happy. Why Cabal should workaround a bug in happy. You yourself write

compilers and build tools should ALWAYS process each and every source file using the encoding with which the file was written by its authors and released by its maintainers,

Happy is a compiler. Make happy do hSetEncoding on files it reads. Cabal specifies that .cabal files should be UTF-8 encoded, I cannot find anything in happy.

We won't workaround bugs which are easily fixable in the tools themselves. Adding workarounds is not sustainable.

Yes it might mean that you need to fix almost every tool, but that's the right approach.

Or just set LANG=en_US.UTF-8. How that is a problem?

fare mentioned this issue Jun 12, 2019

Stack MUST ignore and override user locale variables commercialhaskell/stack#4859

Closed

phadej closed this as completed Jun 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cabal MUST ignore and override user locale variables #6076

Cabal MUST ignore and override user locale variables #6076

fare commented Jun 12, 2019 •

edited

Loading

23Skidoo commented Jun 12, 2019

phadej commented Jun 12, 2019 •

edited

Loading

fare commented Jun 12, 2019

geraldus commented Aug 16, 2019

2piix commented Jun 9, 2020

phadej commented Jun 9, 2020

fare commented Jun 9, 2020

phadej commented Jun 9, 2020

Cabal MUST ignore and override user locale variables #6076

Cabal MUST ignore and override user locale variables #6076

Comments

fare commented Jun 12, 2019 • edited Loading

Steps to reproduce

Expected

Actual

23Skidoo commented Jun 12, 2019

phadej commented Jun 12, 2019 • edited Loading

fare commented Jun 12, 2019

geraldus commented Aug 16, 2019

2piix commented Jun 9, 2020

phadej commented Jun 9, 2020

fare commented Jun 9, 2020

phadej commented Jun 9, 2020

fare commented Jun 12, 2019 •

edited

Loading

phadej commented Jun 12, 2019 •

edited

Loading