Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compile time quadratic in number of routes, with -O2 #986

Open
bergey opened this issue Jun 21, 2018 · 43 comments
Open

compile time quadratic in number of routes, with -O2 #986

bergey opened this issue Jun 21, 2018 · 43 comments
Labels

Comments

@bergey
Copy link

bergey commented Jun 21, 2018

I've been digging a bit to understand where time is spent while building my $WORK codebase, and found this interesting behavior. I'm hoping others interested in investigating the root cause will discuss here. I'm open to changes in application code, servant, or GHC, if we can narrow the cause enough.

The small examples I'm using for benchmarking are at https://github.com/bergey/studious-telegram/tree/trunk/servant
I'm timing with bench invoked as bench 'stack clean && stack build. I'm running on a GCP VM with nothing else running on it.
With stack build --fast (which I believe passes -O0 to GHC), the build time doesn't change dramatically as the number of HTTP routes increases. Without --fast:

number of routes time (seconds)
1 4
8 4.8
16 7
32 16.3
64 74

The only mitigation I've found so far is testing as much as possible in GHCi or with --fast. This may be related to #870; I have not yet tested with anything except stackage-11.14, which has GHC-8.2.2 and servant-0.13.0.1.

@phadej phadej added the bug label Jun 21, 2018
@alpmestan
Copy link
Contributor

alpmestan commented Jun 21, 2018

I think we've seen a couple of similar issues. A rather efficient workaround is to split up the API and servers in several modules (which most people tend to do naturally anyways, when web apps become that large), as things suddenly become a lot less overwhelming for GHC.

But yes, any information would be useful. If you can try other servant & GHC versions, that would be lovely.

This is however definitely annoying and I would personally be delighted to see any improvement there, may it come from servant's code that would put less pressure on GHC, or from GHC dealing with this type of code more efficiently. The first thing to do though, for anyone interested in investigating this, to figure out what's going on and where all that time is spent, would be to run a profiling-enabled GHC over your module.

@bergey
Copy link
Author

bergey commented Jun 21, 2018

I tested with a few other versions.
LTS-9.21 - servant-0.11, GHC-8.0.1
LTS-6.35 - servant-0.7.1, GHC-7.10.3
I'd like to try other servant / ghc combinations, but these were the fastest to test.

number of routes LTS-6.35 LTS-9.21 LTS-11.14
1 4.1 5.1 4.04
8 4.4 5.1 4.86
16 4.6 5.6 7.02
32 6.7 6.7 16.34
64 13.5 11.3 74
128 60 44.4

So it's been quadratic for some time, it seems as though it got substantially worse either in GHC 8.2 or since servant-0.13. I'll take a look at which next time I work on this - and also on the profiling-enabled GHC.

My production, non-benchmark code base does indeed split the routes across modules. I tried to measure some difference from that earlier today, but perhaps I didn't have enough routes to see an effect. It's good to hear that splitting is helpful.

@phadej
Copy link
Contributor

phadej commented Jun 21, 2018

cabal new-build servant-bench8  6,25s user 0,80s system 100% cpu 6,999 total
cabal new-build servant-bench16  7,23s user 0,88s system 100% cpu 8,089 total
cabal new-build servant-bench32  12,14s user 0,77s system 100% cpu 12,870 total:1
cabal new-build servant-bench64  43,82s user 1,22s system 99% cpu 45,050 total

Building with -ddump-to-file -ddump-simpl-stats

16

1770 UnfoldingDone
  287 Servant.Server.Internal.$fHasServerTYPE:<|>context_$choistServerWithContext
  263 $dHasServer_ahv6

32

5282 UnfoldingDone
  1335 Servant.Server.Internal.$fHasServerTYPE:<|>context_$choistServerWithContext
  783 $dHasServer_ahxq

64

17682 UnfoldingDone
  5735 Servant.Server.Internal.$fHasServerTYPE:<|>context_$choistServerWithContext
  2591 $dHasServer_ahC4

hoist?

As we inline HasServer hoistServerWithContexts are generated, though we don't use them.
GHC is silly here.

This reminds me of https://ghc.haskell.org/trac/ghc/ticket/13253,
particularly something like: https://ghc.haskell.org/trac/ghc/ticket/13253#comment:22

There's also a patch in https://ghc.haskell.org/trac/ghc/ticket/13253#comment:24,
if alpmestan could try to apply it on top of 8.4.3 (which also has the same behaviour),
and see if helps here?

@rdnetto
Copy link

rdnetto commented Jan 15, 2019

This is still a problem in GHC 8.6.3.

This example takes me 60 sec to build with only 20 routes. Running client against the routes individually reduces that to 37 sec (as we still pass serve the entire API).

rdnetto added a commit to rdnetto/swagger-codegen-generators that referenced this issue Jan 15, 2019
In testing, this reduced compile time from 1 min to 37 sec for the sample project.

See haskell-servant/servant#986 for details.
roberth added a commit to cachix/cachix that referenced this issue Jun 10, 2019
@alpmestan
Copy link
Contributor

I looked a bit more into this. I generated a few modules that look like:

{-# LANGUAGE DataKinds, TypeOperators, TypeFamilies #-}
module API_2 where
import Data.Proxy
import Servant
type API = "1" :> Get '[JSON] String :<|> "2" :> Get '[JSON] String
api = Proxy :: Proxy API
server :: Server API
server = return "1" :<|> return "2"
app = serve api server

where I'm varying the number of endpoints from a module to another. I have APIs with 1,2,3,4,5,10,20,30,40,50 endpoints. And I run this little script:

#! /usr/bin/env sh

rm -f API_*

for i in 1 2 3 4 5 10 20 30 40 50; do
    runghc gen_api.hs $i
    time ghc -c --make API_$i.hs -o API_$i.o \
	-freduction-depth=0 -O -ddump-timings -v2 \
	-ddump-simpl -ddump-simpl-stats -ddump-to-file \
	-dsuppress-all \
	+RTS -s -RTS |& tee API_$i.log.txt
done
# of endpoints Compile time (s)
1 2.101
2 2.321
3 3.301
4 3.591
5 4.371
10 7.961
20 28.101
30 58.111
40 98.181
50 139.021

I can see some interesting things in the log. For example, if we look at the Core optimisation stats for the 50-endpoints API...

We see (see the bottom of this comment) that parsing/renaming/typechecking/desugaring don't take very long, and that after desugaring, the code is not too big yet:

Result size of Desugar (before optimization)
  = {terms: 942, types: 39,926, coercions: 18,615, joins: 0/161}
Result size of Desugar (after optimization)
  = {terms: 828, types: 39,801, coercions: 18,815, joins: 0/154}

The simplifier runs on that, the result size doesn't change all that much. But then, boom, we specialise:

Result size of Simplifier
  = {terms: 1,060, types: 29,382, coercions: 38,442, joins: 0/1}
!!! Simplifier [API_50]: finished in 498.86 milliseconds, allocated 46.616 megabytes
*** Specialise [API_50]:
Result size of Specialise
  = {terms: 15,554, types: 810,695, coercions: 85,830, joins: 0/3}
!!! Specialise [API_50]: finished in 333.94 milliseconds, allocated 43.001 megabytes

We float out:

*** Float out(FOS {Lam = Just 0,
                   Consts = True,
                   OverSatApps = False}) [API_50]:
Result size of Float out(FOS {Lam = Just 0,
                              Consts = True,
                              OverSatApps = False})
  = {terms: 29,526, types: 1,215,461, coercions: 85,830, joins: 0/0}
!!! Float out(FOS {Lam = Just 0,
                   Consts = True,
                   OverSatApps = False}) [API_50]: finished in 1163.50 milliseconds, allocated 133.012 megabytes

and try to simplify:

*** Simplifier [API_50]:
Result size of Simplifier iteration=1
  = {terms: 29,138, types: 1,324,321, coercions: 111,537, joins: 0/0}
Result size of Simplifier iteration=2
  = {terms: 144,992,
     types: 5,290,053,
     coercions: 1,377,705,
     joins: 0/0}
Result size of Simplifier iteration=3
  = {terms: 144,792,
     types: 5,286,303,
     coercions: 1,370,505,
     joins: 0/0}
Result size of Simplifier
  = {terms: 144,792,
     types: 5,286,303,
     coercions: 1,370,505,
     joins: 0/0}
!!! Simplifier [API_50]: finished in 31122.77 milliseconds, allocated 4832.778 megabytes
*** Simplifier [API_50]:
Result size of Simplifier iteration=1
  = {terms: 144,678,
     types: 5,285,913,
     coercions: 1,370,505,
     joins: 0/0}
Result size of Simplifier
  = {terms: 144,678,
     types: 5,285,913,
     coercions: 1,370,505,
     joins: 0/0}
!!! Simplifier [API_50]: finished in 14560.75 milliseconds, allocated 2147.932 megabytes
*** Simplifier [API_50]:
Result size of Simplifier
  = {terms: 144,678,
     types: 5,285,913,
     coercions: 1,370,505,
     joins: 0/0}
!!! Simplifier [API_50]: finished in 6388.17 milliseconds, allocated 1072.456 megabytes

and from that point onwards, we're working with thousands and thousands of terms, millions of types and coercions. Many passes take whole seconds and gigabytes of allocations, until...

*** Demand analysis [API_50]:
Result size of Demand analysis
  = {terms: 21,880,
     types: 2,231,518,
     coercions: 1,294,470,
     joins: 0/0}
!!! Demand analysis [API_50]: finished in 5560.72 milliseconds, allocated 2786.071 megabytes
*** CoreTidy [API_50]:
Result size of Tidy Core
  = {terms: 3,507, types: 134,928, coercions: 39,021, joins: 0/0}
!!! CoreTidy [API_50]: finished in 1105.27 milliseconds, allocated 349.239 megabytes
Created temporary directory: /run/user/1001/ghc11842_0
*** CorePrep [API_50]:
Result size of CorePrep
  = {terms: 3,732, types: 134,408, coercions: 39,021, joins: 0/102}
!!! CorePrep [API_50]: finished in 39.19 milliseconds, allocated 6.130 megabytes
*** Stg2Stg:
*** CodeGen [API_50]:
!!! CodeGen [API_50]: finished in 760.67 milliseconds, allocated 146.051 megabytes

Core Tidy really deserves its name. Here are the timings & allocs for all the passes:

Pass Time (ms) Alloc (bytes)
Parser 7.068 2554992
Renamer/typechecker 731.063 109437544
Desugar 91.337 12701224
Simplifier 498.861 48880936
Specialise 333.945 45090256
Float out(FOS {Lam = Just 0, Consts = True, OverSatApps = False}) 1163.496 139472808
Simplifier 31122.767 5067534720
Simplifier 14560.745 2252270040
Simplifier 6388.172 1124551928
Float inwards 660.780 145896080
Called arity analysis 617.060 97280936
Simplifier 7205.292 1124552056
Demand analysis 7360.820 3396416952
Worker Wrapper binds 623.287 46714320
Simplifier 24290.701 3739472320
Exitification transformation 612.117 25169256
Float out(FOS {Lam = Just 0, Consts = True, OverSatApps = True}) 10644.614 1489902464
Common sub-expression 3329.560 682305136
Float inwards 372.326 26302928
Simplifier 10672.911 1871560408
Demand analysis 5560.725 2921407264
CoreTidy 1105.268 366203432
CorePrep 39.191 6427320
CodeGen 760.668 153145192

I'm not going to look at the Core for API_50...

$ wc -l API_*.dump-simpl
    944 API_10.dump-simpl
    219 API_1.dump-simpl
   2134 API_20.dump-simpl
    281 API_2.dump-simpl
   3724 API_30.dump-simpl
    349 API_3.dump-simpl
   5714 API_40.dump-simpl
    422 API_4.dump-simpl
   8104 API_50.dump-simpl
    499 API_5.dump-simpl

but let's see what we get for API_2.

==================== Tidy Core ====================

Result size of Tidy Core
  = {terms: 243, types: 840, coercions: 573, joins: 0/0}

-- RHS size: {terms: 1, types: 0, coercions: 0, joins: 0/0}
$s$fHasServerTYPE:<|>context_$croute2
$s$fHasServerTYPE:<|>context_$croute2 = "2"#

-- RHS size: {terms: 2, types: 0, coercions: 0, joins: 0/0}
$s$fHasServerTYPE:<|>context_$croute1
$s$fHasServerTYPE:<|>context_$croute1
  = unpackCString# $s$fHasServerTYPE:<|>context_$croute2

-- RHS size: {terms: 2, types: 3, coercions: 0, joins: 0/0}
lvl_r9QC
lvl_r9QC = Right $s$fHasServerTYPE:<|>context_$croute1

-- RHS size: {terms: 4, types: 12, coercions: 0, joins: 0/0}
app6
app6 = \ s_a6U0 -> (# s_a6U0, lvl_r9QC #)

-- RHS size: {terms: 1, types: 0, coercions: 0, joins: 0/0}
$s$fHasServerTYPE:<|>context_$croute5
$s$fHasServerTYPE:<|>context_$croute5 = "1"#

-- RHS size: {terms: 2, types: 0, coercions: 0, joins: 0/0}
$s$fHasServerTYPE:<|>context_$croute4
$s$fHasServerTYPE:<|>context_$croute4
  = unpackCString# $s$fHasServerTYPE:<|>context_$croute5

-- RHS size: {terms: 2, types: 3, coercions: 0, joins: 0/0}
lvl1_r9QD
lvl1_r9QD = Right $s$fHasServerTYPE:<|>context_$croute4

-- RHS size: {terms: 4, types: 12, coercions: 0, joins: 0/0}
app7
app7 = \ s_a6U0 -> (# s_a6U0, lvl1_r9QD #)

-- RHS size: {terms: 3, types: 6, coercions: 36, joins: 0/0}
app5
app5 = :<|> (app7 `cast` <Co:18>) (app6 `cast` <Co:18>)

-- RHS size: {terms: 1, types: 0, coercions: 98, joins: 0/0}
server
server = app5 `cast` <Co:98>

-- RHS size: {terms: 2, types: 5, coercions: 98, joins: 0/0}
app_result
app_result = Route (app5 `cast` <Co:98>)

-- RHS size: {terms: 1, types: 0, coercions: 0, joins: 0/0}
$trModule2
$trModule2 = "API_2"#

-- RHS size: {terms: 2, types: 0, coercions: 0, joins: 0/0}
$trModule1
$trModule1 = TrNameS $trModule2

-- RHS size: {terms: 1, types: 0, coercions: 0, joins: 0/0}
$trModule4
$trModule4 = "main"#

-- RHS size: {terms: 2, types: 0, coercions: 0, joins: 0/0}
$trModule3
$trModule3 = TrNameS $trModule4

-- RHS size: {terms: 3, types: 0, coercions: 0, joins: 0/0}
$trModule
$trModule = Module $trModule3 $trModule1

-- RHS size: {terms: 1, types: 3, coercions: 0, joins: 0/0}
api
api = Proxy

-- RHS size: {terms: 3, types: 1, coercions: 0, joins: 0/0}
lvl2_r9QE
lvl2_r9QE = : $fAcceptTYPEJSON4 $fAcceptTYPEJSON2

-- RHS size: {terms: 2, types: 11, coercions: 0, joins: 0/0}
lvl3_r9QF
lvl3_r9QF = \ _ -> lvl2_r9QE

-- RHS size: {terms: 1, types: 0, coercions: 0, joins: 0/0}
$s$fHasServerTYPE:<|>context_$croute3
$s$fHasServerTYPE:<|>context_$croute3 = 200

-- RHS size: {terms: 12, types: 16, coercions: 0, joins: 0/0}
lvl4_r9QG
lvl4_r9QG
  = \ s1_a8on ->
      case newByteArray# 10# s1_a8on of { (# ipv_a8or, ipv1_a8os #) ->
      $wouter
        ipv1_a8os 4# $s$fHasServerTYPE:<|>context_$croute1 0# ipv_a8or
      }

-- RHS size: {terms: 12, types: 16, coercions: 0, joins: 0/0}
lvl5_r9QH
lvl5_r9QH
  = \ s1_a8on ->
      case newByteArray# 10# s1_a8on of { (# ipv_a8or, ipv1_a8os #) ->
      $wouter
        ipv1_a8os 4# $s$fHasServerTYPE:<|>context_$croute4 0# ipv_a8or
      }

-- RHS size: {terms: 4, types: 2, coercions: 0, joins: 0/0}
lvl7_r9QJ
lvl7_r9QJ = ++_$s++ [] $fAcceptTYPEJSON4 $fAcceptTYPEJSON2

-- RHS size: {terms: 2, types: 11, coercions: 0, joins: 0/0}
$s$fAllCTRender:a_$callMime
$s$fAllCTRender:a_$callMime = \ _ -> lvl7_r9QJ

-- RHS size: {terms: 4, types: 2, coercions: 15, joins: 0/0}
$sencode
$sencode
  = \ eta_a9of ->
      toLazyByteString ((string1 eta_a9of) `cast` <Co:15>)

-- RHS size: {terms: 2, types: 4, coercions: 0, joins: 0/0}
lvl8_r9QK
lvl8_r9QK = \ _ -> $sencode

-- RHS size: {terms: 3, types: 5, coercions: 0, joins: 0/0}
$s$fAllCTRender:a_$s$fMimeRenderTYPEJSONa
$s$fAllCTRender:a_$s$fMimeRenderTYPEJSONa
  = C:MimeRender $fAcceptTYPEJSON lvl8_r9QK

-- RHS size: {terms: 5, types: 16, coercions: 0, joins: 0/0}
$s$fAllMimeRender:a0_$callMimeRender
$s$fAllMimeRender:a0_$callMimeRender
  = \ _ w2_a9nu ->
      $w$callMimeRender1
        $s$fAllCTRender:a_$s$fMimeRenderTYPEJSONa w2_a9nu

-- RHS size: {terms: 3, types: 9, coercions: 9, joins: 0/0}
$s$fAllCTRender:a_$s$fAllMimeRender:a0
$s$fAllCTRender:a_$s$fAllMimeRender:a0
  = C:AllMimeRender
      (lvl3_r9QF `cast` <Co:9>) $s$fAllMimeRender:a0_$callMimeRender

-- RHS size: {terms: 21, types: 30, coercions: 0, joins: 0/0}
$wlvl_r9QL
$wlvl_r9QL
  = \ ww_s9J5 ww1_s9J6 ww2_s9J7 ww3_s9J8 w_s9J2 ->
      case $w$sparseQuality2 ww_s9J5 ww1_s9J6 ww2_s9J7 ww3_s9J8 of {
        Nothing -> Nothing;
        Just x1_a9nm ->
          mapQuality_$smapQuality2
            (map
               $fAllCTRender:a1
               ($w$callMimeRender1
                  $s$fAllCTRender:a_$s$fMimeRenderTYPEJSONa w_s9J2))
            x1_a9nm
      }

-- RHS size: {terms: 12, types: 19, coercions: 1, joins: 0/0}
lvl9_r9QM
lvl9_r9QM
  = \ _ w1_s9J1 w2_s9J2 ->
      case w1_s9J1 `cast` <Co:1> of
      { PS ww1_s9J5 ww2_s9J6 ww3_s9J7 ww4_s9J8 ->
      $wlvl_r9QL ww1_s9J5 ww2_s9J6 ww3_s9J7 ww4_s9J8 w2_s9J2
      }

-- RHS size: {terms: 3, types: 9, coercions: 9, joins: 0/0}
$s$fAllCTRender:a
$s$fAllCTRender:a
  = C:AllCTRender
      ($s$fAllCTRender:a_$callMime `cast` <Co:9>) lvl9_r9QM

-- RHS size: {terms: 6, types: 33, coercions: 11, joins: 0/0}
lvl11_r9QO
lvl11_r9QO
  = \ @ env_a8nX ->
      $w$croute13
        $s$fAllCTRender:a
        ($fReflectMethodStdMethodGET_$creflectMethod `cast` <Co:4>)
        ($s$fHasServerTYPE:<|>context_$croute3 `cast` <Co:7>)
        Proxy

-- RHS size: {terms: 23, types: 113, coercions: 18, joins: 0/0}
$s$fHasServerTYPE:>context4_$croute
$s$fHasServerTYPE:>context4_$croute
  = \ @ env_a8nX w2_a8nY _ w4_a8o0 ->
      case w2_a8nY of { Proxy ->
      StaticRouter
        (case runRW# lvl4_r9QG of { (# ipv_a8ow, ipv1_a8ox #) ->
         case ipv1_a8ox of dt_a8oB { Text ipv2_a8Hq ipv3_a8Hr ipv4_a8Hs ->
         Bin 1# dt_a8oB (lvl11_r9QO (w4_a8o0 `cast` <Co:18>)) Tip Tip
         }
         })
        []
      }

-- RHS size: {terms: 23, types: 113, coercions: 18, joins: 0/0}
$s$fHasServerTYPE:>context4_$croute1
$s$fHasServerTYPE:>context4_$croute1
  = \ @ env_a8nX w2_a8nY _ w4_a8o0 ->
      case w2_a8nY of { Proxy ->
      StaticRouter
        (case runRW# lvl5_r9QH of { (# ipv_a8ow, ipv1_a8ox #) ->
         case ipv1_a8ox of dt_a8oB { Text ipv2_a8Hq ipv3_a8Hr ipv4_a8Hs ->
         Bin 1# dt_a8oB (lvl11_r9QO (w4_a8o0 `cast` <Co:18>)) Tip Tip
         }
         })
        []
      }

-- RHS size: {terms: 7, types: 6, coercions: 0, joins: 0/0}
app4
app4 = \ _ _ _ _ _ _ -> app_result

-- RHS size: {terms: 10, types: 13, coercions: 260, joins: 0/0}
app3
app3
  = Delayed
      (emptyDelayed4 `cast` <Co:34>)
      (emptyDelayed2 `cast` <Co:32>)
      (emptyDelayed2 `cast` <Co:32>)
      (emptyDelayed2 `cast` <Co:32>)
      (emptyDelayed2 `cast` <Co:32>)
      (emptyDelayed2 `cast` <Co:32>)
      (emptyDelayed2 `cast` <Co:32>)
      (emptyDelayed1 `cast` <Co:34>)
      app4

-- RHS size: {terms: 6, types: 31, coercions: 0, joins: 0/0}
app2
app2
  = $w$croute
      $s$fHasServerTYPE:>context4_$croute1
      $s$fHasServerTYPE:>context4_$croute
      Proxy
      $WEmptyContext
      app3

-- RHS size: {terms: 3, types: 1, coercions: 0, joins: 0/0}
app1
app1 = runRouterEnv app2 ()

-- RHS size: {terms: 2, types: 0, coercions: 0, joins: 0/0}
app
app = toApplication app1


------ Local rules for imported ids --------
"SPEC/API_2 $fHasServerTYPE:>context4_$croute @ "2" @ (Verb
                                                        'GET 200 '[JSON] String) @ '[]" [2]
    forall w1_s8Hz w_s8Hy.
      $fHasServerTYPE:>context4_$croute w_s8Hy w1_s8Hz
      = $s$fHasServerTYPE:>context4_$croute
"SPEC/API_2 $fHasServerTYPE:>context4_$croute @ "1" @ (Verb
                                                        'GET 200 '[JSON] String) @ '[]" [2]
    forall w1_s8Hw w_s8Hv.
      $fHasServerTYPE:>context4_$croute w_s8Hv w1_s8Hw
      = $s$fHasServerTYPE:>context4_$croute1
"SPEC/API_2 $fAllCTRender:a_$callMime @ JSON @ '[]" [2]
    forall w1_s9j3 w_s9j2.
      $fAllCTRender:a_$callMime w_s9j2 w1_s9j3
      = $s$fAllCTRender:a_$callMime
"SPEC/API_2 $fAllCTRender:a @ JSON @ '[] @ [Char]"
    forall v2_s90i v1_s90h v_s90g.
      $fAllCTRender:a v_s90g v1_s90h v2_s90i
      = $s$fAllCTRender:a
"SPEC/API_2 $fAllMimeRender:a0_$callMimeRender @ JSON @ [Char]" [2]
    forall w_s9nT.
      $fAllMimeRender:a0_$callMimeRender w_s9nT
      = $s$fAllMimeRender:a0_$callMimeRender
"SPEC/API_2 $fAllMimeRender:a0 @ JSON @ [Char]"
    forall v_s9np.
      $fAllMimeRender:a0 v_s9np
      = $s$fAllCTRender:a_$s$fAllMimeRender:a0
"SPEC/API_2 encode @ [Char]"
    forall $dToJSON_s9om. encode $dToJSON_s9om = $sencode
"SPEC/API_2 $fMimeRenderTYPEJSONa @ [Char]"
    forall v_s9oc.
      $fMimeRenderTYPEJSONa v_s9oc
      = $s$fAllCTRender:a_$s$fMimeRenderTYPEJSONa

I'm not sure GHC does anything particularly wrong here, but if anyone sees something, let me know. Perhaps the servant libraries could give better hints so as to make sure users don't end up with gigantic Core. It'd be interesting to see on what kind of Core terms the blowup happens.


Whole core stats:

*** Parser [API_50]:
!!! Parser [API_50]: finished in 7.07 milliseconds, allocated 2.437 megabytes
*** Renamer/typechecker [API_50]:
!!! Renamer/typechecker [API_50]: finished in 731.06 milliseconds, allocated 104.368 megabytes
*** Desugar [API_50]:
Result size of Desugar (before optimization)
  = {terms: 942, types: 39,926, coercions: 18,615, joins: 0/161}
Result size of Desugar (after optimization)
  = {terms: 828, types: 39,801, coercions: 18,815, joins: 0/154}
!!! Desugar [API_50]: finished in 91.34 milliseconds, allocated 12.113 megabytes
*** Simplifier [API_50]:
Result size of Simplifier iteration=1
  = {terms: 1,352, types: 45,874, coercions: 38,442, joins: 0/1}
Result size of Simplifier iteration=2
  = {terms: 1,060, types: 29,382, coercions: 38,442, joins: 0/1}
Result size of Simplifier
  = {terms: 1,060, types: 29,382, coercions: 38,442, joins: 0/1}
!!! Simplifier [API_50]: finished in 498.86 milliseconds, allocated 46.616 megabytes
*** Specialise [API_50]:
Result size of Specialise
  = {terms: 15,554, types: 810,695, coercions: 85,830, joins: 0/3}
!!! Specialise [API_50]: finished in 333.94 milliseconds, allocated 43.001 megabytes
*** Float out(FOS {Lam = Just 0,
                   Consts = True,
                   OverSatApps = False}) [API_50]:
Result size of Float out(FOS {Lam = Just 0,
                              Consts = True,
                              OverSatApps = False})
  = {terms: 29,526, types: 1,215,461, coercions: 85,830, joins: 0/0}
!!! Float out(FOS {Lam = Just 0,
                   Consts = True,
                   OverSatApps = False}) [API_50]: finished in 1163.50 milliseconds, allocated 133.012 megabytes
*** Simplifier [API_50]:
Result size of Simplifier iteration=1
  = {terms: 29,138, types: 1,324,321, coercions: 111,537, joins: 0/0}
Result size of Simplifier iteration=2
  = {terms: 144,992,
     types: 5,290,053,
     coercions: 1,377,705,
     joins: 0/0}
Result size of Simplifier iteration=3
  = {terms: 144,792,
     types: 5,286,303,
     coercions: 1,370,505,
     joins: 0/0}
Result size of Simplifier
  = {terms: 144,792,
     types: 5,286,303,
     coercions: 1,370,505,
     joins: 0/0}
!!! Simplifier [API_50]: finished in 31122.77 milliseconds, allocated 4832.778 megabytes
*** Simplifier [API_50]:
Result size of Simplifier iteration=1
  = {terms: 144,678,
     types: 5,285,913,
     coercions: 1,370,505,
     joins: 0/0}
Result size of Simplifier
  = {terms: 144,678,
     types: 5,285,913,
     coercions: 1,370,505,
     joins: 0/0}
!!! Simplifier [API_50]: finished in 14560.75 milliseconds, allocated 2147.932 megabytes
*** Simplifier [API_50]:
Result size of Simplifier
  = {terms: 144,678,
     types: 5,285,913,
     coercions: 1,370,505,
     joins: 0/0}
!!! Simplifier [API_50]: finished in 6388.17 milliseconds, allocated 1072.456 megabytes
*** Float inwards [API_50]:
Result size of Float inwards
  = {terms: 144,678,
     types: 5,285,913,
     coercions: 1,370,505,
     joins: 0/0}
!!! Float inwards [API_50]: finished in 660.78 milliseconds, allocated 139.137 megabytes
*** Called arity analysis [API_50]:
Result size of Called arity analysis
  = {terms: 144,678,
     types: 5,285,913,
     coercions: 1,370,505,
     joins: 0/0}
!!! Called arity analysis [API_50]: finished in 617.06 milliseconds, allocated 92.774 megabytes
*** Simplifier [API_50]:
Result size of Simplifier
  = {terms: 144,678,
     types: 5,285,913,
     coercions: 1,370,505,
     joins: 0/0}
!!! Simplifier [API_50]: finished in 7205.29 milliseconds, allocated 1072.456 megabytes
*** Demand analysis [API_50]:
Result size of Demand analysis
  = {terms: 144,678,
     types: 5,285,913,
     coercions: 1,370,505,
     joins: 0/0}
!!! Demand analysis [API_50]: finished in 7360.82 milliseconds, allocated 3239.076 megabytes
*** Worker Wrapper binds [API_50]:
Result size of Worker Wrapper binds
  = {terms: 149,098,
     types: 5,686,426,
     coercions: 1,434,172,
     joins: 0/671}
!!! Worker Wrapper binds [API_50]: finished in 623.29 milliseconds, allocated 44.550 megabytes
*** Simplifier [API_50]:
Result size of Simplifier iteration=1
  = {terms: 146,708,
     types: 5,488,159,
     coercions: 1,370,905,
     joins: 0/144}
Result size of Simplifier iteration=2
  = {terms: 146,397,
     types: 5,471,833,
     coercions: 1,370,505,
     joins: 0/0}
Result size of Simplifier
  = {terms: 146,397,
     types: 5,471,833,
     coercions: 1,370,505,
     joins: 0/0}
!!! Simplifier [API_50]: finished in 24290.70 milliseconds, allocated 3566.239 megabytes
*** Exitification transformation [API_50]:
Result size of Exitification transformation
  = {terms: 146,397,
     types: 5,471,833,
     coercions: 1,370,505,
     joins: 0/0}
!!! Exitification transformation [API_50]: finished in 612.12 milliseconds, allocated 24.003 megabytes
*** Float out(FOS {Lam = Just 0,
                   Consts = True,
                   OverSatApps = True}) [API_50]:
Result size of Float out(FOS {Lam = Just 0,
                              Consts = True,
                              OverSatApps = True})
  = {terms: 169,649,
     types: 6,797,454,
     coercions: 1,370,505,
     joins: 0/0}
!!! Float out(FOS {Lam = Just 0,
                   Consts = True,
                   OverSatApps = True}) [API_50]: finished in 10644.61 milliseconds, allocated 1420.882 megabytes
*** Common sub-expression [API_50]:
Result size of Common sub-expression
  = {terms: 43,422,
     types: 3,637,259,
     coercions: 1,276,012,
     joins: 0/0}
!!! Common sub-expression [API_50]: finished in 3329.56 milliseconds, allocated 650.697 megabytes
*** Float inwards [API_50]:
Result size of Float inwards
  = {terms: 43,422,
     types: 3,637,259,
     coercions: 1,276,012,
     joins: 0/0}
!!! Float inwards [API_50]: finished in 372.33 milliseconds, allocated 25.084 megabytes
*** Simplifier [API_50]:
Result size of Simplifier iteration=1
  = {terms: 23,508,
     types: 2,336,132,
     coercions: 1,295,370,
     joins: 0/50}
Result size of Simplifier iteration=2
  = {terms: 21,880,
     types: 2,231,518,
     coercions: 1,294,470,
     joins: 0/0}
Result size of Simplifier
  = {terms: 21,880,
     types: 2,231,518,
     coercions: 1,294,470,
     joins: 0/0}
!!! Simplifier [API_50]: finished in 10672.91 milliseconds, allocated 1784.859 megabytes
*** Demand analysis [API_50]:
Result size of Demand analysis
  = {terms: 21,880,
     types: 2,231,518,
     coercions: 1,294,470,
     joins: 0/0}
!!! Demand analysis [API_50]: finished in 5560.72 milliseconds, allocated 2786.071 megabytes
*** CoreTidy [API_50]:
Result size of Tidy Core
  = {terms: 3,507, types: 134,928, coercions: 39,021, joins: 0/0}
!!! CoreTidy [API_50]: finished in 1105.27 milliseconds, allocated 349.239 megabytes
Created temporary directory: /run/user/1001/ghc11842_0
*** CorePrep [API_50]:
Result size of CorePrep
  = {terms: 3,732, types: 134,408, coercions: 39,021, joins: 0/102}
!!! CorePrep [API_50]: finished in 39.19 milliseconds, allocated 6.130 megabytes
*** Stg2Stg:
*** CodeGen [API_50]:
!!! CodeGen [API_50]: finished in 760.67 milliseconds, allocated 146.051 megabytes

$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 8.6.5
$ ghc-pkg list servant*
/nix/store/j86yn21gz4zly8w31hjd9vd5gw9yd2h1-ghc-8.6.5-with-packages/lib/ghc-8.6.5/package.conf.d
    servant-0.15
    servant-server-0.15

@domenkozar
Copy link
Contributor

Would be interesting to take a look at this one with the GHC profiling story improved.

@codygman
Copy link

codygman commented Jul 23, 2021

These issues also happens with -O1. Using -fno-specialize made my 50 route example compile in <1s vs 10s.

I believe the core issue could be related to the 7 year old TypeFamilies painfully slow issue.

If I'm reading trac correctly, a fix related to exponential inlining should be in GHC 9.0.1. I haven't tested this since it appears servant isn't building with 9.0.1 yet I don't have time to build servant from source right now.

It might be interesting to see if -fomit-interface-pragmas as suggested in this comment help anything. (No meaningful change for me)

@alpmestan
Copy link
Contributor

I don't think the issue is just type families, they're used quite a bit in servant, but not everywhere. Quite likely the problem is similar to what we see with https://gitlab.haskell.org/ghc/ghc/-/issues/5642 and many other tickets, where we have a bunch of values that have different types, and where each sub-expression that combines those values ends up having its own type too. Simply generating the Core to describe all those things ends up taking quite a bit of time and space, and this all has to be processed by every single step of the optimisation pipeline etc. I suspect -ddump-timings or -dshow-passes would help illustrate this.

@codygman
Copy link

codygman commented Jul 23, 2021

I don't think the issue is just type families, they're used quite a bit in servant, but not everywhere. Quite likely the problem is similar to what we see with https://gitlab.haskell.org/ghc/ghc/-/issues/5642 and many other tickets, where we have a bunch of values that have different types, and where each sub-expression that combines those values ends up having its own type too. Simply generating the Core to describe all those things ends up taking quite a bit of time and space, and this all has to be processed by every single step of the optimisation pipeline etc. I suspect -ddump-timings or -dshow-passes would help illustrate this.

I took your example from above and expanded it to 37 routes (random copy pasting gave an odd number):

{-# LANGUAGE DataKinds, TypeOperators, TypeFamilies #-}
module API_2 where
import Data.Proxy
import Servant
type API = "1" :> Get '[JSON] String :<|> "2" :> Get '[JSON] String
api = Proxy :: Proxy API
server :: Server API
server = return "1" :<|> return "2"
app = serve api server

Compiling this with -O1 took 10s like:

ghc -O1 -fforce-recomp 37.hs +RTS -s

Turning off pre-inlining took it from 10s to 1s with command:

ghc -O1 -fforce-recomp -fno-pre-inlining 37.hs +RTS -s -RTS

Before I knew to try -fno-pre-inlining I ran -dshow-passes like:

ghc -O1 -fforce-recomp -ddump-timings -dshow-passes 37.hs +RTS -s -RTS

Resulting in:

,*** Simplifier [API_2]:
  Result size of Simplifier iteration=1
    = {terms: 13,936, types: 974,580, coercions: 58,364, joins: 0/41}
  Result size of Simplifier iteration=2
    = {terms: 50,614,
       types: 1,907,589,
       coercions: 571,691,
       joins: 0/699}
  Result size of Simplifier iteration=3
    = {terms: 50,654,
       types: 1,909,075,
       coercions: 572,239,
       joins: 0/76}
  Result size of Simplifier iteration=4
    = {terms: 50,578,
       types: 1,908,885,
       coercions: 572,239,
       joins: 0/38}
  Result size of Simplifier
    = {terms: 50,578,
       types: 1,908,885,
       coercions: 572,239,
       joins: 0/38}
  !!! Simplifier [API_2]: finished in 2286.12 milliseconds, allocated 2415.010 megabytes
  Simplifier [API_2]: alloc=2532321584 time=2286.117

I think that -fno-pre-inlining would be disabling this initial simplifier pass though:

 ,*** Simplifier [API_2]:
  Result size of Simplifier iteration=1
    = {terms: 712, types: 24,452, coercions: 22,028, joins: 0/2}
  Result size of Simplifier iteration=2
    = {terms: 640, types: 16,091, coercions: 22,028, joins: 0/1}
  Result size of Simplifier
    = {terms: 640, types: 16,091, coercions: 22,028, joins: 0/1}
  !!! Simplifier [API_2]: finished in 77.03 milliseconds, allocated 39.149 megabytes
  Simplifier [API_2]: alloc=41050960 time=77.035

Where things grow, but definitely don't blow up. Types blow up in the specialization pass right after:

 ,*** Specialise [API_2]:
  Result size of Specialise
    = {terms: 3,928, types: 374,371, coercions: 39,797, joins: 0/3}
  !!! Specialise [API_2]: finished in 73.37 milliseconds, allocated 108.150 megabytes
  Specialise [API_2]: alloc=113403720 time=73.374

Maybe if I do -dshow-passes with -fno-pre-inlining it will demonstrate this...

Edit: With -fno-pre-inlining there is no blow up in specialization for some reason.

,*** Simplifier [API_2]:
Result size of Simplifier iteration=1
  = {terms: 716, types: 24,946, coercions: 22,028, joins: 0/2}
Result size of Simplifier iteration=2
  = {terms: 714, types: 24,934, coercions: 22,028, joins: 0/1}
Result size of Simplifier
  = {terms: 714, types: 24,934, coercions: 22,028, joins: 0/1}
!!! Simplifier [API_2]: finished in 77.12 milliseconds, allocated 41.014 megabytes
Simplifier [API_2]: alloc=43006656 time=77.116
,*** Specialise [API_2]:
Result size of Specialise
  = {terms: 960, types: 32,776, coercions: 22,457, joins: 0/3}
!!! Specialise [API_2]: finished in 16.61 milliseconds, allocated 15.842 megabytes

Note that types were at 374,371 without -fno-pre-inlining and here they are 32,776.

I'm guessing if someone were to see the core after pre-inlining and after specialization they might be able to say what optimization causes the types blowup?

@codygman
Copy link

codygman commented Jul 23, 2021

GHC 9 takes things with that example from 10s to 16s, so going in the wrong direction...

Something interesting is the blow up seems to happen in the 3rd chunk of "Simplifier" instead of the inliner though using 9.0. Hopefully GHC 9.2 helps, but servant definitely doesn't build with it right now and I'm not sure how to make this problem more minimal yet.

@codygman
Copy link

I think you are right @alpmestan, I tried replicating the compile slowdown with mini servant and could not.

I noticed one thing that the minimal version is missing is content types. Is there a way to isolate testing just the content type pieces compile time?

I'm currently trying to figure out if that's possible understand how the content type machinery works with the serve function, but that will take some time for me.

I also wonder if ghc is just slow with type level lists and the operations that seem to be behind a lot of the content-type machinery

@alpmestan
Copy link
Contributor

I noticed one thing that the minimal version is missing is content types. Is there a way to isolate testing just the content type pieces compile time?

Probably it's not about the content types alone, but about adding them to everything else. However, you can grab the Accept/MimeRender/MimeUnrender classes and specify a few dummy content types that you'd instantiate those classes for, if you want to add that to your testing. This is all there is to content types and the serve function too. The HasServer instances for verbs or request bodies end up requiring suitable MimeRender/MimeUnrender instances for all the content types in the list, as you can see e.g here: https://github.com/haskell-servant/servant/blob/master/servant-server/src/Servant/Server/Internal.hs#L294 (the AllCTRender ctypes a constraint).

@Vlix
Copy link

Vlix commented Mar 13, 2022

Just ran the same benchmark as @bergey in the first post on my machine (laptop from like 2014-2015), and then changed all benchmark implementations to using NamedRoutes.

I built everything with lts-18.27, because I was getting panics from GHC 8.2.2.
And I ran bench --before 'stack clean' -L 120 'stack build --ghc-options="-O2"'

It seems it fixes the quadratic times in compilation:

number of routes time (sec) Old time (sec) NamedRoutes
1 7.5 8.1
8 9.0 9.2
16 13.2 10.6
32 30.7 13.4
64 141.2 19.3

@Vlix
Copy link

Vlix commented Mar 13, 2022

Here examples of how the modules look and the complete output of the bench:

Example of the anonymous routes
{-# LANGUAGE TypeOperators #-}
{-# LANGUAGE DataKinds #-}
{-# LANGUAGE OverloadedStrings #-}
module Main where

import Data.Semigroup
import Data.Text (Text)
import           Servant hiding (NotSecure)
import           Servant.Server
import           Network.Wai.Handler.Warp   (defaultSettings, runSettings,
                                             setLogger, setPort)

import qualified Data.Text as T

type MainAPI =
    "static" :> "1" :> Get '[JSON] Text
    :<|> "static" :> "2" :> Get '[JSON] Text
    :<|> "static" :> "3" :> Get '[JSON] Text
    :<|> "static" :> "4" :> Get '[JSON] Text
    :<|> "static" :> "5" :> Get '[JSON] Text
    :<|> "static" :> "6" :> Get '[JSON] Text
    :<|> "static" :> "7" :> Get '[JSON] Text
    :<|> "static" :> "8" :> Get '[JSON] Text

mainServer :: Server MainAPI
mainServer = foo :<|> foo :<|> foo :<|> foo
    :<|> foo :<|> foo :<|> foo :<|> foo

foo = return "foo"

main :: IO ()
main = runSettings defaultSettings $
    serveWithContext (Proxy :: Proxy MainAPI) EmptyContext mainServer
Benchmark output of anonymous routes
/flat-1

benchmarking stack build --ghc-options="-O2"
time                 7.427 s    (7.102 s .. 7.745 s)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 7.458 s    (7.395 s .. 7.551 s)
std dev              98.88 ms   (29.73 ms .. 142.7 ms)
variance introduced by outliers: 16% (moderately inflated)
/flat-8

benchmarking stack build --ghc-options="-O2"
time                 9.024 s    (8.667 s .. 9.838 s)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 9.027 s    (8.806 s .. 9.188 s)
std dev              226.5 ms   (119.0 ms .. 315.0 ms)
variance introduced by outliers: 19% (moderately inflated)
/flat-16

benchmarking stack build --ghc-options="-O2"
time                 13.36 s    (12.86 s .. 14.10 s)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 13.21 s    (12.81 s .. 13.36 s)
std dev              284.8 ms   (53.22 ms .. 372.0 ms)
variance introduced by outliers: 19% (moderately inflated)
/flat-32

benchmarking stack build --ghc-options="-O2"
time                 30.73 s    (30.19 s .. 30.98 s)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 30.72 s    (30.62 s .. 30.80 s)
std dev              108.2 ms   (48.80 ms .. 142.1 ms)
variance introduced by outliers: 19% (moderately inflated)
/flat-64

benchmarking stack build --ghc-options="-O2"
time                 144.4 s    (134.1 s .. 159.8 s)
                     0.998 R²   (0.998 R² .. 1.000 R²)
mean                 141.2 s    (138.7 s .. 143.6 s)
std dev              2.836 s    (2.113 s .. 3.384 s)
variance introduced by outliers: 19% (moderately inflated)
Example of NamedRoutes
{-# LANGUAGE DataKinds #-}
{-# LANGUAGE DeriveGeneric #-}
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE TypeOperators #-}
module Main where

import Data.Semigroup
import Data.Text (Text)
import GHC.Generics (Generic)
import           Servant hiding (NotSecure)
import           Servant.API.Generic
import           Servant.Server
import           Network.Wai.Handler.Warp   (defaultSettings, runSettings,
                                             setLogger, setPort)

import qualified Data.Text as T

type MainAPI = NamedRoutes API

data API mode = API
    { static1 :: mode :- "static" :> "1" :> Get '[JSON] Text
    , static2 :: mode :- "static" :> "2" :> Get '[JSON] Text
    , static3 :: mode :- "static" :> "3" :> Get '[JSON] Text
    , static4 :: mode :- "static" :> "4" :> Get '[JSON] Text
    , static5 :: mode :- "static" :> "5" :> Get '[JSON] Text
    , static6 :: mode :- "static" :> "6" :> Get '[JSON] Text
    , static7 :: mode :- "static" :> "7" :> Get '[JSON] Text
    , static8 :: mode :- "static" :> "8" :> Get '[JSON] Text
    } deriving Generic

mainServer :: Server MainAPI
mainServer = API
    { static1 = foo
    , static2 = foo
    , static3 = foo
    , static4 = foo
    , static5 = foo
    , static6 = foo
    , static7 = foo
    , static8 = foo
    }

foo = return "foo"

main :: IO ()
main = runSettings defaultSettings $
    serveWithContext (Proxy :: Proxy MainAPI) EmptyContext mainServer
Benchmark output of NamedRoutes
/flat-1

benchmarking stack build --ghc-options="-O2"
time                 7.882 s    (7.136 s .. 8.371 s)
                     0.999 R²   (0.997 R² .. 1.000 R²)
mean                 8.137 s    (7.986 s .. 8.224 s)
std dev              144.6 ms   (40.75 ms .. 192.8 ms)
variance introduced by outliers: 19% (moderately inflated)
/flat-8

benchmarking stack build --ghc-options="-O2"
time                 9.142 s    (8.178 s .. 9.885 s)
                     0.999 R²   (0.996 R² .. 1.000 R²)
mean                 9.275 s    (9.155 s .. 9.473 s)
std dev              194.6 ms   (66.72 ms .. 264.1 ms)
variance introduced by outliers: 19% (moderately inflated)
/flat-16

benchmarking stack build --ghc-options="-O2"
time                 10.66 s    (9.664 s .. 11.77 s)
                     0.999 R²   (0.995 R² .. 1.000 R²)
mean                 10.56 s    (10.40 s .. 10.76 s)
std dev              195.4 ms   (84.90 ms .. 271.9 ms)
variance introduced by outliers: 19% (moderately inflated)
/flat-32

benchmarking stack build --ghc-options="-O2"
time                 13.59 s    (12.54 s .. 14.52 s)
                     0.999 R²   (0.997 R² .. 1.000 R²)
mean                 13.35 s    (13.13 s .. 13.53 s)
std dev              248.5 ms   (82.63 ms .. 338.9 ms)
variance introduced by outliers: 19% (moderately inflated)
/flat-64

benchmarking stack build --ghc-options="-O2"
time                 19.27 s    (18.93 s .. 19.82 s)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 19.33 s    (19.22 s .. 19.40 s)
std dev              114.1 ms   (66.46 ms .. 160.4 ms)
variance introduced by outliers: 19% (moderately inflated)

If anyone else wants to bench the NamedRoutes: Here's my fork
You can bench the anonymous routes by checking out this commit: de54845850449acd222cd98aa70fa7003a1a60b3

@gdeest
Copy link
Contributor

gdeest commented Mar 13, 2022

It is extremely surprising to me that NamedRoutes would bring performance gains on flat APIs (some gains on nested routes would have been less surprising IMHO).

I am guessing NamedRoutes goes in the way of inlining, which somehow is a good thing here ?

@Vlix
Copy link

Vlix commented Mar 13, 2022

Is it maybe because the flat API isn't actually flat with the anonymous routes, but nested with every extra :<|>?

Route1 :<|> (Route2 :<|> (Route3 :<|> ...) ) )

@gdeest
Copy link
Contributor

gdeest commented Mar 13, 2022

The HasServer instance for NamedRoutes still relies on :<|> under the hood, see here:

https://hackage.haskell.org/package/servant-server-0.19/docs/src/Servant.Server.Internal.html#line-929

We piggy-back on the instance of ToServantApi api which is a vanilla servant API.

@ParetoOptimalDev
Copy link

ParetoOptimalDev commented Mar 14, 2022

I am guessing NamedRoutes goes in the way of inlining, which somehow is a good thing here ?

2 ideas then:

  • generate the core and diff each version or analyze like done here
  • Run each of the 64 route benchmarks with -dshow-passes and find the largest differences

Maybe there's a better way to say "print all fired optimizations" in a way you could diff though.

@tfausak
Copy link
Contributor

tfausak commented Mar 14, 2022

For what it's worth, I was not able to reproduce those benchmark results. For me, named routes were slower than the old way of doing things by a fair bit. You can see the entire benchmark script and results here: https://gist.github.com/tfausak/8019733fb5c703994d1665143c60ad0f

Click to expand example "old" module.
{-# LANGUAGE DataKinds #-}
{-# LANGUAGE TypeOperators #-}

module M ( main ) where

import qualified Network.Wai as Wai
import qualified Network.Wai.Handler.Warp as Warp
import qualified Servant

main :: IO ()
main = Warp.run 3000 application

application :: Wai.Application
application = Servant.serve proxy server

proxy :: Servant.Proxy Api
proxy = Servant.Proxy

server :: Servant.Server Api
server
    = pure 1
    Servant.:<|> pure 2

type Api
    = "1" Servant.:> Servant.Get '[Servant.JSON] Int
    Servant.:<|> "2" Servant.:> Servant.Get '[Servant.JSON] Int
Click to expand example "new" module.
{-# LANGUAGE DataKinds #-}
{-# LANGUAGE DeriveGeneric #-}
{-# LANGUAGE TypeOperators #-}

module M ( main ) where

import qualified Network.Wai as Wai
import qualified Network.Wai.Handler.Warp as Warp
import qualified Servant
import qualified Servant.API.Generic as Servant
import qualified Servant.Server.Generic as Servant
import qualified GHC.Generics as Generics

main :: IO ()
main = Warp.run 3000 application

application :: Wai.Application
application = Servant.genericServe routes

routes :: Routes Servant.AsServer
routes = Routes
    { _1 = pure 1
    , _2 = pure 2
    }

data Routes routes = Routes
    { _1 :: routes Servant.:- "1" Servant.:> Servant.Get '[Servant.JSON] Int
    , _2 :: routes Servant.:- "2" Servant.:> Servant.Get '[Servant.JSON] Int
    } deriving (Generics.Generic)

With 64 routes, the old module compiles in 4.4 seconds compared to 16.2 seconds for the new module.

I know that benchmarking is hard, so perhaps I've done something wrong here. Or maybe my server definitions are subtly different than @Vlix's in some way. But I wasn't able to get any module with named routes compiling faster than an equivalent module with plain old (:<|>).

@gdeest
Copy link
Contributor

gdeest commented Mar 14, 2022

@tfausak The only obvious difference between your two benchmarks is that in @Vlix's, all handlers are the same (foo). But maybe does it actually make a difference ?

@gdeest
Copy link
Contributor

gdeest commented Mar 14, 2022

Also, I notice that compilation times for the :<|> routes do not seem quadratic at all in the number of routes in your benchmarks.

@Vlix
Copy link

Vlix commented Mar 14, 2022

@tfausak
I'm not completely sure, but it looks like you're building with -O1, where we're all talking about compiling with -O2 (at least that's what the title of this issue states)

EDIT: Their try gave me an idea, though, brb

EDIT2: Ok, it's not done benching, but I think I'm already seeing what happened. I didn't build the "old timing" with servant-0.19 (just servant-0.18.3 because that's in lts-18.27), because it didn't need it (don't need NamedRoutes to bench the old stuff). And I was fine with the results, because they confirmed bergey's initial issue. Then tried with NamedRoutes and servant-0.19 and then contributed it to the NamedRoutes.
I tried the "old timing" benchmark with /servant-(server-)?0.19/ in the extra-deps and yeah, the quadratic building times disappear O_o
So SOMETHING happened between /servant-(server-)?0.18.3/ and /servant-(server-)?0.19/!

tl;dr MY BAD! It's not NamedRoutes! It's the latest release of servant-(server-)0.19!

@Vlix
Copy link

Vlix commented Mar 15, 2022

Ok, rebuilt again with just lts-18.27 and then with extra-deps: [servant-0.19, servant-server-0.19] (added the NamedRoutes times again just for reference, did not rebench those)

So yeah... if anyone knows the diff between 0.18.3 and 0.19, please explain 😅
(it has to be in either servant-0.19 or servant-server-0.19, because those are the only ones different between these two benches)

number of routes time (sec) anonymous routes (0.18.3) time (sec) anonymous routes (0.19) time (sec) NamedRoutes (0.19) -- for reference
1 7.6 7.6 8.1
8 9.2 8.3 9.2
16 13.3 8.8 10.6
32 30.9 10.6 13.4
64 137.8 15.1 19.3
Regular (flat) anonymous routes with servant(-server)-0.18.3
/flat-1

benchmarking stack build --ghc-options="-O2"
time                 7.568 s    (7.233 s .. 7.800 s)
                     0.999 R²   (0.997 R² .. 1.000 R²)
mean                 7.603 s    (7.491 s .. 7.649 s)
std dev              88.08 ms   (16.25 ms .. 120.2 ms)
variance introduced by outliers: 16% (moderately inflated)
/flat-8

benchmarking stack build --ghc-options="-O2"
time                 8.852 s    (8.185 s .. 9.748 s)
                     0.999 R²   (0.996 R² .. NaN R²)
mean                 9.192 s    (9.027 s .. 9.422 s)
std dev              219.0 ms   (86.83 ms .. 280.4 ms)
variance introduced by outliers: 19% (moderately inflated)
/flat-16

benchmarking stack build --ghc-options="-O2"
time                 13.17 s    (12.31 s .. 13.69 s)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 13.26 s    (13.17 s .. 13.35 s)
std dev              117.4 ms   (38.85 ms .. 160.3 ms)
variance introduced by outliers: 19% (moderately inflated)
/flat-32

benchmarking stack build --ghc-options="-O2"
time                 30.62 s    (30.06 s .. 31.17 s)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 30.87 s    (30.73 s .. 30.98 s)
std dev              162.3 ms   (107.5 ms .. 193.3 ms)
variance introduced by outliers: 19% (moderately inflated)
/flat-64

benchmarking stack build --ghc-options="-O2"
time                 140.1 s    (131.9 s .. 155.9 s)
                     0.999 R²   (0.997 R² .. 1.000 R²)
mean                 137.8 s    (135.7 s .. 139.5 s)
std dev              2.217 s    (959.9 ms .. 2.979 s)
variance introduced by outliers: 19% (moderately inflated)
Regular (flat) anonymous routes with servant(-server)-0.19
/flat-1

benchmarking stack build --ghc-options="-O2"
time                 7.424 s    (7.307 s .. 7.547 s)
                     1.000 R²   (0.998 R² .. 1.000 R²)
mean                 7.630 s    (7.522 s .. 7.754 s)
std dev              140.9 ms   (104.1 ms .. 175.0 ms)
variance introduced by outliers: 16% (moderately inflated)
/flat-8

benchmarking stack build --ghc-options="-O2"
time                 8.232 s    (7.737 s .. 9.237 s)
                     0.998 R²   (0.997 R² .. 1.000 R²)
mean                 8.264 s    (8.000 s .. 8.464 s)
std dev              272.4 ms   (141.9 ms .. 375.2 ms)
variance introduced by outliers: 19% (moderately inflated)
/flat-16

benchmarking stack build --ghc-options="-O2"
time                 8.760 s    (7.966 s .. 9.571 s)
                     0.999 R²   (0.996 R² .. 1.000 R²)
mean                 8.797 s    (8.600 s .. 8.876 s)
std dev              136.9 ms   (40.51 ms .. 183.0 ms)
variance introduced by outliers: 19% (moderately inflated)
/flat-32

benchmarking stack build --ghc-options="-O2"
time                 10.65 s    (9.727 s .. 11.22 s)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 10.59 s    (10.34 s .. 10.73 s)
std dev              240.8 ms   (119.2 ms .. 306.2 ms)
variance introduced by outliers: 19% (moderately inflated)
/flat-64

benchmarking stack build --ghc-options="-O2"
time                 14.81 s    (14.36 s .. 15.46 s)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 15.11 s    (14.93 s .. 15.40 s)
std dev              269.4 ms   (1.918 ms .. 330.8 ms)
variance introduced by outliers: 19% (moderately inflated)

@codygman
Copy link

You could run git bisect with a script for 32 routes that fails if it takes more than 20 seconds to narrow things down if there are a lot of commits.

@gdeest
Copy link
Contributor

gdeest commented Mar 15, 2022

tl;dr MY BAD! It's not NamedRoutes! It's the latest release of servant-(server-)0.19!

The plot thickens :-) Off the top of my head, I can't think of any change in 0.19 that could be responsible for such improvements. I am going to try bissecting on my end.

@gdeest
Copy link
Contributor

gdeest commented Mar 15, 2022

@Vlix I am not able to reproduce your results either. Compilation times for a module of 64 anonymous routes seems pretty much the same on HEAD and on 0.18.3:

de923fc887f9e0197c8795fcb0e6967b196e0e17 (HEAD):

benchmarking cabal build flat-64
time                 5.153 s    (5.121 s .. NaN s)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 5.137 s    (5.131 s .. 5.147 s)
std dev              9.297 ms   (417.9 μs .. 11.61 ms)
variance introduced by outliers: 19% (moderately inflated)

e2b897d3c0655a84551532944f9cffaa9daa8168 (Prepare servant-0.18.3 release):

benchmarking cabal build flat-64
time                 5.047 s    (5.027 s .. 5.076 s)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 5.036 s    (5.032 s .. 5.041 s)
std dev              6.011 ms   (1.940 ms .. 8.131 ms)
variance introduced by outliers: 19% (moderately inflated)

This is using GHC 8.10.4 though.

@tfausak
Copy link
Contributor

tfausak commented Mar 15, 2022

I am able to reproduce the performance difference between Servant 0.18.3 and Servant 0.19. I'm using the same setup as before, which is with Stackage LTS 18.28, which uses GHC 8.10.7.

Number of routes Servant 0.18 build time (seconds) Servant 0.19 build time (seconds)
8 1.09 0.92
16 2.73 1.20
32 10.83 1.90
64 63.87 4.44

Comparing the output of stack ls dependencies, this is all that shows up in the diff:

46a47
> constraints 0.13.3
114,115c115,116
< servant 0.18.3
< servant-server 0.18.3
---
> servant 0.19
> servant-server 0.19
142a144
> type-equality 1

@gdeest
Copy link
Contributor

gdeest commented Mar 21, 2022

Just tried with GHC 8.10.7 in a slightly modified servant repository — still can't reproduce the huge gains witnessed by some, so obviously I am not able to git bissect. If anything, those gains seem quite brittle, but it would be good to get some understanding on what is going on here.

@ParetoOptimalDev It looks like you intended to go down that route, did you by any chance figure it out ?

@Vlix
Copy link

Vlix commented Mar 21, 2022

Just tried with GHC 8.10.7 in a slightly modified servant repository — still can't reproduce the huge gains witnessed by some, so obviously I am not able to git bissect. If anything, those gains seem quite brittle, but it would be good to get some understanding on what is going on here.

@ParetoOptimalDev It looks like you intended to go down that route, did you by any chance figure it out ?

Are you using a Stack LTS? Or are you using cabal? 🤔
(Could you maybe literally share the commands you run to do the benchmarking?)

@gdeest
Copy link
Contributor

gdeest commented Mar 21, 2022

Are you using a Stack LTS? Or are you using cabal?

This is using cabal, indeed, and not stack. We're gonna try and see what happens with stack.

I am benching with something along those lines:

cabal build --only-dependencies flat-64
bench -o results/(git rev-parse HEAD).html  --before 'rm -rf ./dist-newstyle/build/x86_64-linux/ghc-8.10.7/flat64-0.1.0.0' 'cabal build flat-64'

where flat-64 is as in the benchmarks linked above (but with -O2 forcefully added to ghc-options in the Cabal file).

My changes are in this commit: 142681c

If you use nix, you may run the tests in the same environment as I using nix-shell nix/shell2.nix.

where flat-64

@ParetoOptimalDev
Copy link

@ParetoOptimalDev It looks like you intended to go down that route, did you by any chance figure it out ?

Bisect can't accurately be done with cabal because -fforce-recomp didn't seem to work (WAT) and cabal clean doesn't allow a target.

cabal clean then just would benchmark recompiling all of servant and is potentially too much noise to guarantee a useful bisect.

If you use nix, you may run the tests in the same environment as I using nix-shell nix/shell2.nix.

I'll try this out later and see if I can reproduce with your code.

I am now suspect of getting compile time differences from cabal though. I guess we have to cabal clean, compile, make a change in the module, then time that compile.

@gdeest
Copy link
Contributor

gdeest commented Mar 21, 2022

Bisect can't accurately be done with cabal because -fforce-recomp didn't seem to work (WAT) and cabal clean doesn't allow a target.

Yeah, I noticed that as well, which is why I am removing the output manually before each run:

bench -o results/(git rev-parse HEAD).html  --before 'rm -rf ./dist-newstyle/build/x86_64-linux/ghc-8.10.7/flat64-0.1.0.0' 'cabal build flat-64'

It does seem to effectively do the same thing as cabal clean flat-64.

@ParetoOptimalDev
Copy link

which is why I am removing the output manually before each run

Does this cause the recompile to actually happen? I had inconsistent results removing stuff from dist and assumed there might be another cache layer or something.

@gdeest
Copy link
Contributor

gdeest commented Mar 22, 2022

Does this cause the recompile to actually happen? I had inconsistent results removing stuff from dist and assumed there might be another cache layer or something.

It did work, AFAICT.

@Vlix
Copy link

Vlix commented Mar 22, 2022

Ok, git bisect is a bit difficult, but I've just kept building in steps of 10 commits until I found where it degraded again, and (at least for me) this is the turning point:

git diff 51c8edb74d790eb1e914890a64ac4d92d9f04311 b7c6a9592967da8d8df5804339a755e4125da83f servant servant-server

Which seems to be the addition of serveWithContextT and the change of serveWithContext now being defined as serveWithContextT with id, so might it be the switch to using hoistServerWithContext? 🤔

Anyone else who can verify the same behaviour? (so slow with 51c8edb74d790eb1e914890a64ac4d92d9f04311 but quick with b7c6a9592967da8d8df5804339a755e4125da83f)


It's also the commit that removed support for GHC < 8.6.5 and removed all the superfluous Semigroup and Monoid imports, but I doubt it's that, right?

@thomasjm
Copy link
Contributor

@Vlix I can repro the performance difference between 51c8edb7 and b7c6a959. I did bench with 64 routes a couple times for each commit. The slow one is about 70s and the fast one 16s.

@PPKFS
Copy link

PPKFS commented Mar 24, 2022

Can also repro the difference between the two commits - 6.6s on b7c6, and 70s on 51c8.

However, I'm not able to reproduce any speedup from updating the API at work from .18 to .19.

@PPKFS
Copy link

PPKFS commented Mar 24, 2022

Ok, git bisect is a bit difficult, but I've just kept building in steps of 10 commits until I found where it degraded again, and (at least for me) this is the turning point:

git diff 51c8edb74d790eb1e914890a64ac4d92d9f04311 b7c6a9592967da8d8df5804339a755e4125da83f servant servant-server

Which seems to be the addition of serveWithContextT and the change of serveWithContext now being defined as serveWithContextT with id, so might it be the switch to using hoistServerWithContext? thinking

Anyone else who can verify the same behaviour? (so slow with 51c8edb74d790eb1e914890a64ac4d92d9f04311 but quick with b7c6a9592967da8d8df5804339a755e4125da83f)

It's also the commit that removed support for GHC < 8.6.5 and removed all the superfluous Semigroup and Monoid imports, but I doubt it's that, right?

I have managed to isolate it to the hoistServerWithContext change: https://github.com/PPKFS/servant/commit/4c380fbd4bba1feea907947ec7fa4d3218a3fcbc
This gives me about 6.5s benchmark.

I've then run it through time-ghc-modules:

  • slow takes 36.5s in simplification, whereas fast takes 2.3s;
  • slow takes 22.5s in demand analysis, whereas fast takes 0.8s;
  • slow takes 4s in Float out, whereas fast takes 0.1s.

I can't quite work out enough to know why that's the reason though.

@ParetoOptimalDev
Copy link

I can't quite work out enough to know why that's the reason though.

Perhaps generating the core and posting here and somewhere like haskell-cafe can help us find answers?

@ParetoOptimalDev
Copy link

ParetoOptimalDev commented Mar 26, 2022

I have managed to isolate it to the hoistServerWithContext change: PPKFS@4c380fb

Inlining the diff here for convenience:

- serveWithContext :: ( HasServer api context
+ serveWithContext :: forall api context. ( HasServer api context
                    , HasContextEntry (context .++ DefaultErrorFormatters) ErrorFormatters )
    => Proxy api -> Context context -> Server api -> Application
- serveWithContext p context server =
-   toApplication (runRouter format404 (route p context (emptyDelayed (Route server))))
+ serveWithContext p context server = toApplication (runRouter format404 (route p context (emptyDelayed router)))
  where
+    (router :: RouteResult (ServerT api Handler)) = Route $ (hoistServerWithContext p (Proxy :: Proxy context) (id :: Handler a -> Handler a) server :: ServerT api Handler)
    format404 = notFoundErrorFormatter . getContextEntry . mkContextWithErrorFormatter $ context

The thing that stands out for me here is the explicit quantification and explicit type signature.

I'm unsure how we could reduce this to a more minimal reproduction.

@PPKFS
Copy link

PPKFS commented Mar 28, 2022

I've not gotten to generating the core properly and posting it up somewhere, but I did find the following:

  • -fno-specialise causes the slow version to be just as fast as the fast version.
  • I don't think it's related to the exponential inlining bug if I'm reading the tracker right (it was backported to 9.0.1) because naively swapping to lts-19.0 (and thus 9.0.2) makes the fast version slow again. -fno-specialise solves this.

This is where the issue happens:

*** Simplifier [Main]:
Result size of Simplifier iteration=1
  = {terms: 1,447, types: 87,641, coercions: 77,791, joins: 0/257}
Result size of Simplifier iteration=2
  = {terms: 1,437, types: 86,593, coercions: 77,777, joins: 0/255}
Result size of Simplifier
  = {terms: 1,437, types: 86,593, coercions: 77,777, joins: 0/255}
!!! Simplifier [Main]: finished in 166.21 milliseconds, allocated 54.271 megabytes
*** Specialise [Main]:
Result size of Specialise
  = {terms: 3,876, types: 129,694, coercions: 87,147, joins: 0/3}
!!! Specialise [Main]: finished in 25.90 milliseconds, allocated 42.645 megabytes

And in the slow version, you can see the explosion:

*** Simplifier [Main]:
Result size of Simplifier iteration=1
  = {terms: 925, types: 48,142, coercions: 75,740, joins: 0/2}
Result size of Simplifier iteration=2
  = {terms: 919, types: 48,126, coercions: 75,726, joins: 0/1}
Result size of Simplifier
  = {terms: 919, types: 48,126, coercions: 75,726, joins: 0/1}
!!! Simplifier [Main]: finished in 69.16 milliseconds, allocated 50.923 megabytes
*** Specialise [Main]:
Result size of Specialise
  = {terms: 28,104, types: 1,762,291, coercions: 183,854, joins: 0/3}
!!! Specialise [Main]: finished in 177.59 milliseconds, allocated 74.029 megabytes

I also ran -ddump-simpl-stats (annoyingly ddump-simpl generates 300mb files) and noticed e.g. that the slow version has 65 RuleFired steps for each of SPEC/main $fHasServerTYPE whereas the fast one has 1 for each. Some general stuff I picked out:

Fast:

  • 3033 UnfoldingDone (588 Servant.Server.Internal.$w$croute6)
  • 1526 RuleFired (711 Class op hoistServerWithContext and class op route, and 1 for each SPEC)

Slow:

  • 54981 UnfoldingDone (13050 Servant.Server.Internal.$w$croute6)
  • 31615 RuleFired (13051 Class op route and 12900 hoistserver, and 60-65 for each SPEC)

I know this isn't the most useful thing, but I forgot to save the splices...

@ParetoOptimalDev
Copy link

ParetoOptimalDev commented Mar 28, 2022

Note that many of these types of issues are fixed in ghc 9.2, so it might be fixed there.

Edit: not sure if they all made it into 9.2

@PPKFS
Copy link

PPKFS commented Mar 28, 2022

Did I misread? I thought it was linked to this issue: https://gitlab.haskell.org/ghc/ghc/-/merge_requests/3497
Which was backported to 9.0.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests