Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fmt: Println(i) is 3x slower than in C #3463

Closed
krasin opened this issue Apr 3, 2012 · 25 comments
Closed

fmt: Println(i) is 3x slower than in C #3463

krasin opened this issue Apr 3, 2012 · 25 comments

Comments

@krasin
Copy link

krasin commented Apr 3, 2012

Go 1, Linux amd64.

What steps will reproduce the problem?

1. Consider the following Go program: (http://play.golang.org/p/ScKzejb3Oc )

package main

import "fmt"

func main() {
    for i := 0; i < 1<<25; i++ {
        fmt.Println(i)
    }
}

2. Consider the following Python program:

for i in xrange(1 << 25): print i

3. And the following C program:

#include <stdio.h>

int main(void) {
  for (int i = 0; i < 1 << 25; i++) {
    printf("%d\n", i);
  }
} 

4. Build Go program with:
$ go build print.go

and C program with:
$ gcc print.c -o print_c -c99

5. Run programs:

$ time ./print > /dev/null
real    0m18.336s
user    0m15.497s
sys 0m2.824s

$ time python print.py > /dev/null # I use python 2.7.2
real    0m9.484s
user    0m9.473s
sys 0m0.004s

$ time ./print_c > /dev/null
real    0m2.580s
user    0m2.560s
sys 0m0.020s


What is the expected output?
It's expected that Go is as fast a C and faster than Python. 

What do you see instead?
Go is painfully slow on this microbenchmark.

Which compiler are you using (5g, 6g, 8g, gccgo)?
6g (via go build)

Which operating system are you using?
Linux krasin-tupak 3.0.0-16-generic #29-Ubuntu SMP Tue Feb 14 12:48:51 UTC 2012 x86_64
x86_64 x86_64 GNU/Linux

Which version are you using?  (run 'go version')
go version go1

Please provide any additional information below.

At first, I have ran the above programs under strace and realized that C and Python
versions use buffered stdout, but Go invokes write(2) on every line. I have modified
print.c:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

int main(void) {
  for (int i = 0; i < 1 << 25; i++) {
    char buf[30];
    snprintf(buf, 30, "%d\n", i);
    write(1, buf, strlen(buf)+1);
  }
}

compiled with:
$ gcc print.c -o print_c -std=c99

$ time ./print_c > /dev/null
real    0m6.184s
user    0m4.104s
sys 0m2.072s

As we can see, this is not the only reason for Go to be so slow. It's not even the main
reason. C is 6x faster if stdout is buffered and 3x faster if non-buffered.

I feel like something can be optimized.
@krasin
Copy link
Author

krasin commented Apr 3, 2012

Comment 1:

Btw, where is goprof nowadays? http://blog.golang.org/2011/06/profiling-go-programs.html
is still not updated to Go 1.

@robpike
Copy link
Contributor

robpike commented Apr 3, 2012

Comment 2:

There's plenty of room for performance improvement. Not to contradict your findings, but
interlanguage microbenchmarks like these don't often correspond to real-world uses and
can put undue attention on internal details.
When I turned buffering on in Go rather than off in C and did the comparison, Go was
again 3X off compared to C. For a new language with all-native libraries (unlike Python,
whose I/O is done with C code), that's not too "painful". It will improve.
package main
import "bufio"
import "fmt"
import "os"
func main() {
    b := bufio.NewWriter(os.Stdout)
    for i := 0; i < 1<<25; i++ {
        fmt.Fprintln(b, i)
    }
}

@robpike
Copy link
Contributor

robpike commented Apr 3, 2012

Comment 3:

You want
go tool pprof
or 
go tool prof

@robpike
Copy link
Contributor

robpike commented Apr 3, 2012

Comment 4:

Changing the bug title because it's not true. By my experiments, buffered Go vs buffered
Python shows 13s for Go vs. 18s for Python.

Status changed to Accepted.

@robpike
Copy link
Contributor

robpike commented Apr 3, 2012

Comment 5:

Labels changed: added priority-later, performance, removed priority-triage.

@rsc
Copy link
Contributor

rsc commented Nov 27, 2013

Comment 6:

Labels changed: added go1.3maybe.

@dsymonds
Copy link
Contributor

dsymonds commented Dec 2, 2013

Comment 7:

Labels changed: added packagechange.

Owner changed to @robpike.

@rsc
Copy link
Contributor

rsc commented Dec 4, 2013

Comment 8:

Labels changed: added release-none, removed go1.3maybe.

@rsc
Copy link
Contributor

rsc commented Dec 4, 2013

Comment 9:

Labels changed: added repo-main.

@gopherbot
Copy link
Contributor

Comment 10 by leonhard.holz:

Hello everybody,
I investigated the original and the buffered-io version with go tools prof and here are
there results (go dev branch / i386):
#1
    User time (seconds): 27.06
    System time (seconds): 16.98
    Percent of CPU this job got: 132%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:33.19
(pprof) top20   
Total: 5896 samples
    2331  39.5%  39.5%     2331  39.5% ExternalCode
     349   5.9%  45.5%      874  14.8% fmt.(*fmt).integer
     283   4.8%  50.3%      283   4.8% runtime.uint64mod
     266   4.5%  54.8%      333   5.6% runtime.mallocgc
     254   4.3%  59.1%      254   4.3% scanblock
     153   2.6%  61.7%      153   2.6% runtime.uint64div
     142   2.4%  64.1%      142   2.4% runtime.memmove
     141   2.4%  66.5%      141   2.4% runtime.cas
     110   1.9%  68.3%     1108  18.8% fmt.(*pp).printArg
      97   1.6%  70.0%      241   4.1% runtime.entersyscall
      95   1.6%  71.6%      445   7.5% runtime.convT2I
      93   1.6%  73.2%     3189  54.1% main.main
      93   1.6%  74.7%      234   4.0% runtime.exitsyscall
      88   1.5%  76.2%       88   1.5% runtime.writebarrierptr
      84   1.4%  77.7%     1223  20.7% fmt.(*pp).doPrint
      72   1.2%  78.9%      253   4.3% fmt.newPrinter
      64   1.1%  80.0%      114   1.9% sync.(*Pool).pin
      60   1.0%  81.0%       89   1.5% fmt.(*fmt).pad
      56   0.9%  81.9%     2865  48.6% fmt.Fprintln
      56   0.9%  82.9%      823  14.0% os.(*File).write
(pprof) top20 -cum
Total: 5896 samples
      93   1.6%   1.6%     3189  54.1% main.main
       0   0.0%   1.6%     3189  54.1% runtime.gosched_m
       0   0.0%   1.6%     3189  54.1% runtime.main
      35   0.6%   2.2%     2900  49.2% fmt.Println
      56   0.9%   3.1%     2865  48.6% fmt.Fprintln
       0   0.0%   3.1%     2345  39.8% System
    2331  39.5%  42.7%     2331  39.5% ExternalCode
      84   1.4%  44.1%     1223  20.7% fmt.(*pp).doPrint
      43   0.7%  44.8%     1159  19.7% os.(*File).Write
     110   1.9%  46.7%     1108  18.8% fmt.(*pp).printArg
      37   0.6%  47.3%      911  15.5% fmt.(*pp).fmtInt64
     349   5.9%  53.2%      874  14.8% fmt.(*fmt).integer
      56   0.9%  54.2%      823  14.0% os.(*File).write
      27   0.5%  54.6%      533   9.0% syscall.Write
      20   0.3%  55.0%      506   8.6% syscall.write
      11   0.2%  55.2%      486   8.2% syscall.Syscall
      95   1.6%  56.8%      445   7.5% runtime.convT2I
      38   0.6%  57.4%      371   6.3% runtime.newobject
       0   0.0%  57.4%      362   6.1% GC
     266   4.5%  61.9%      333   5.6% runtime.mallocgc
#2
    User time (seconds): 26.69
    System time (seconds): 5.31
    Percent of CPU this job got: 138%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:23.04
(pprof) top20
Total: 2592 samples
     663  25.6%  25.6%      663  25.6% ExternalCode
     282  10.9%  36.5%      717  27.7% fmt.(*fmt).integer
     255   9.8%  46.3%      255   9.8% runtime.uint64mod
     127   4.9%  51.2%      127   4.9% runtime.memmove
     120   4.6%  55.8%      120   4.6% runtime.uint64div
     107   4.1%  60.0%      924  35.6% fmt.(*pp).printArg
      88   3.4%  63.3%       88   3.4% scanblock
      69   2.7%  66.0%       92   3.5% runtime.mallocgc
      65   2.5%  68.5%     1029  39.7% fmt.(*pp).doPrint
      58   2.2%  70.8%       58   2.2% runtime.writebarrierptr
      56   2.2%  72.9%       56   2.2% sync.runtime_procPin
      55   2.1%  75.0%      222   8.6% fmt.newPrinter
      53   2.0%  77.1%      212   8.2% fmt.(*pp).free
      52   2.0%  79.1%      121   4.7% sync.(*Pool).pin
      49   1.9%  81.0%     1796  69.3% main.main
      44   1.7%  82.7%       44   1.7% fmt.(*fmt).clearflags
      41   1.6%  84.3%       60   2.3% fmt.(*fmt).pad
      40   1.5%  85.8%      184   7.1% runtime.convT2E
      36   1.4%  87.2%     1558  60.1% fmt.Fprintln
      34   1.3%  88.5%       96   3.7% sync.(*Pool).Get
(pprof) top20 -cum
Total: 2592 samples
      49   1.9%   1.9%     1796  69.3% main.main
       0   0.0%   1.9%     1796  69.3% runtime.gosched_m
       0   0.0%   1.9%     1796  69.3% runtime.main
      36   1.4%   3.3%     1558  60.1% fmt.Fprintln
      65   2.5%   5.8%     1029  39.7% fmt.(*pp).doPrint
     107   4.1%   9.9%      924  35.6% fmt.(*pp).printArg
      17   0.7%  10.6%      734  28.3% fmt.(*pp).fmtInt64
     282  10.9%  21.5%      717  27.7% fmt.(*fmt).integer
     663  25.6%  47.0%      663  25.6% ExternalCode
       0   0.0%  47.0%      631  24.3% System
     255   9.8%  56.9%      255   9.8% runtime.uint64mod
      55   2.1%  59.0%      222   8.6% fmt.newPrinter
      53   2.0%  61.0%      212   8.2% fmt.(*pp).free
      40   1.5%  62.6%      184   7.1% runtime.convT2E
       0   0.0%  62.6%      165   6.4% GC
     127   4.9%  67.5%      127   4.9% runtime.memmove
      52   2.0%  69.5%      121   4.7% sync.(*Pool).pin
     120   4.6%  74.1%      120   4.6% runtime.uint64div
      33   1.3%  75.4%      110   4.2% sync.(*Pool).Put
      16   0.6%  76.0%      108   4.2% runtime.newobject
So in the current version 40% of the time is used for the write syscall and another 20%
in os.(*File).Write. Using buffered io avoids these calls so system time drops from 17s
to 5s. The integer to string conversion now is more significant. I wonder if there'd be
a way to exploit the base10 special casing of strconv.Itoa()?

@gopherbot
Copy link
Contributor

Comment 11:

CL https://golang.org/cl/145020043 mentions this issue.

@gopherbot
Copy link
Contributor

Comment 12:

CL https://golang.org/cl/144250043 mentions this issue.

@robpike
Copy link
Contributor

robpike commented Sep 22, 2014

Comment 13:

This issue was updated by revision 78b5321.

LGTM=josharian, rsc
R=golang-codereviews, josharian, rsc
CC=golang-codereviews
https://golang.org/cl/144250043

@robpike
Copy link
Contributor

robpike commented Sep 22, 2014

Comment 14:

After the most recent CL listed, I compared these two programs. They avoid buffering
issues and other unnecessary overheads so are equivalent; it is a fair comparison.
C:
#include <stdio.h>
int main(void) {
int i;
char buf[256];
  for (i = 0; i < 1 << 25; i++) {
    sprintf(buf, "%d\n", i);
  }
} 
Go:
package main
import "bytes"
import "fmt"
func main() {
    var b bytes.Buffer
    for i := 0; i < 1<<25; i++ {
        fmt.Fprintln(&b, i)
        b.Reset()
    }
}
Times:
% go build x.go && time x
real    0m12.041s
user    0m12.315s
sys 0m1.018s
% gcc -O2 x.c && time a.out
real    0m6.829s
user    0m6.819s
sys 0m0.006s
% 
That's well within a factor of two, which is reasonable considering the unavoidable
overheads of Go (reflection, interfaces, type safety).
According to an earlier report by me, it used to be 3X, so we've gained somewhat. More
could be done, of course.

@robpike
Copy link
Contributor

robpike commented Jan 6, 2015

As of 1.4, there is allocation in Fprintln(i) that didn't used to be there. Rerunning:

Go: 11.9s
C: 3.9s

So we're back to 3X. For comparison I changed the Go program to print &i instead of i, so there's no allocation. It went down to 10.2s, still 2.5X.

Overheads still too large.

@randall77
Copy link
Contributor

#8618

I don't think anyone is working on it yet.

On Mon, Jan 5, 2015 at 6:05 PM, Rob Pike [email protected] wrote:

As of 1.4, there is allocation in Fprintln(i) that didn't used to be
there. Rerunning:

Go: 11.9s
C: 3.9s

So we're back to 3X. For comparison I changed the Go program to print &i
instead of i, so there's no allocation. It went down to 10.2s, still 2.5X.

Overheads still too large.


Reply to this email directly or view it on GitHub
#3463 (comment).

@bradfitz
Copy link
Contributor

bradfitz commented Jan 6, 2015

In magic Github syntax, that is #8618 (so the other bug is then associated with this one)

@rsc rsc added this to the Unplanned milestone Apr 10, 2015
@rsc rsc removed the repo-main label Apr 14, 2015
@robpike robpike removed their assignment Sep 21, 2015
@martisch
Copy link
Contributor

benchmarking the c and go program from robpike on my FreeBSD Intel G3220:

gcc 4.8.5 -O2: 
real    0m5.796s
user    0m5.796s
sys 0m0.000s

go tip:
real    0m5.147s
user    0m5.228s
sys 0m0.008s

which i guess could mean we can close the issue unless others still have radically different numbers.

5.01s of 5.07s total (98.82%)
Dropped 24 nodes (cum <= 0.03s)
      flat  flat%   sum%        cum   cum%
     0.75s 14.79% 14.79%      1.15s 22.68%  fmt.(*fmt).fmt_integer
     0.69s 13.61% 28.40%      0.96s 18.93%  runtime.mallocgc
     0.44s  8.68% 37.08%      0.44s  8.68%  runtime.memmove
     0.27s  5.33% 42.41%      0.48s  9.47%  sync.(*Pool).pin
     0.24s  4.73% 47.14%      0.40s  7.89%  fmt.(*fmt).pad
     0.24s  4.73% 51.87%      0.52s 10.26%  sync.(*Pool).Get
     0.22s  4.34% 56.21%      0.34s  6.71%  runtime.typedmemmove
     0.22s  4.34% 60.55%      0.45s  8.88%  sync.(*Pool).Put
     0.17s  3.35% 63.91%      0.69s 13.61%  fmt.newPrinter
     0.16s  3.16% 67.06%      0.45s  8.88%  bytes.(*Buffer).Write
     0.16s  3.16% 70.22%      1.58s 31.16%  fmt.(*pp).printArg
     0.16s  3.16% 73.37%      0.16s  3.16%  sync.runtime_procPin
     0.14s  2.76% 76.13%      0.14s  2.76%  runtime.(*mspan).sweep.func1
     0.13s  2.56% 78.70%      0.13s  2.56%  bytes.(*Buffer).grow
     0.13s  2.56% 81.26%      4.96s 97.83%  main.main
     0.13s  2.56% 83.83%      1.27s 25.05%  runtime.convT2E
     0.12s  2.37% 86.19%      3.47s 68.44%  fmt.Fprintln
     0.11s  2.17% 88.36%      1.26s 24.85%  fmt.(*pp).fmtInteger
     0.09s  1.78% 90.14%      1.67s 32.94%  fmt.(*pp).doPrintln
     0.09s  1.78% 91.91%      0.54s 10.65%  fmt.(*pp).free
     0.06s  1.18% 93.10%      0.20s  3.94%  runtime.heapBitsSweepSpan
     0.05s  0.99% 94.08%      0.09s  1.78%  bytes.(*Buffer).Reset
     0.05s  0.99% 95.07%      0.16s  3.16%  runtime.assertE2T2
     0.04s  0.79% 95.86%      0.04s  0.79%  bytes.(*Buffer).Truncate
     0.04s  0.79% 96.65%         1s 19.72%  runtime.newobject
     0.04s  0.79% 97.44%      0.04s  0.79%  runtime.prefetchnta
     0.04s  0.79% 98.22%      0.04s  0.79%  sync/atomic.LoadPointer
     0.03s  0.59% 98.82%      0.03s  0.59%  sync.runtime_procUnpin

@robpike
Copy link
Contributor

robpike commented Mar 28, 2016

Wow, nice work everyone!

@nussjustin
Copy link
Contributor

Arch: amd64
OS: Linux 4.4.5-1-ARCH #1 SMP PREEMPT Thu Mar 10 07:38:19 CET 2016 x86_64 GNU/Linux

On a i7-4720HQ with GCC 5.3.0 and Go 1.6 and Go devel +aa482f9 I get these results for the code in comment 14:

Compiled with: gcc -C0
time: 2.34s user 0.00s system 99% cpu 2.345 total

Compiled with: gcc -C1
time: 2.27s user 0.00s system 99% cpu 2.275 total

Compiled with: gcc -C2
time: 2.33s user 0.00s system 99% cpu 2.333 total

Compiled with: gcc -C3
time: 2.33s user 0.00s system 99% cpu 2.331 total

Compiled with: go build (1.6)
time: 5.12s user 0.02s system 103% cpu 4.981 total

Compiled with: go build (devel +aa482f9)
time: 4.76s user 0.02s system 103% cpu 4.634 total

So it's still 2x slower for me with Go tip.

CPU Profile for Go tip is:

4.40s of 4.45s total (98.88%)
Dropped 22 nodes (cum <= 0.02s)
      flat  flat%   sum%        cum   cum%
     0.57s 12.81% 12.81%      0.79s 17.75%  fmt.(*fmt).fmt_integer
     0.51s 11.46% 24.27%      0.70s 15.73%  runtime.mallocgc
     0.47s 10.56% 34.83%      0.47s 10.56%  runtime.memmove
     0.36s  8.09% 42.92%      0.55s 12.36%  sync.(*Pool).pin
     0.22s  4.94% 47.87%      0.56s 12.58%  bytes.(*Buffer).Write
     0.20s  4.49% 52.36%      0.20s  4.49%  bytes.(*Buffer).grow
     0.18s  4.04% 56.40%      1.44s 32.36%  fmt.(*pp).doPrintln
     0.18s  4.04% 60.45%      0.18s  4.04%  sync.runtime_procPin
     0.17s  3.82% 64.27%      1.26s 28.31%  fmt.(*pp).printArg
     0.17s  3.82% 68.09%      0.41s  9.21%  runtime.typedmemmove
     0.15s  3.37% 71.46%      3.13s 70.34%  fmt.Fprintln
     0.13s  2.92% 74.38%      0.13s  2.92%  bytes.(*Buffer).Truncate
     0.13s  2.92% 77.30%      0.22s  4.94%  fmt.(*fmt).pad
     0.13s  2.92% 80.22%      0.50s 11.24%  fmt.newPrinter
     0.12s  2.70% 82.92%      0.12s  2.70%  runtime.(*mspan).sweep.func1
     0.12s  2.70% 85.62%      0.37s  8.31%  sync.(*Pool).Get
     0.09s  2.02% 87.64%      0.44s  9.89%  sync.(*Pool).Put
     0.08s  1.80% 89.44%      0.23s  5.17%  runtime.assertE2T2
     0.07s  1.57% 91.01%      0.86s 19.33%  fmt.(*pp).fmtInteger
     0.06s  1.35% 92.36%      0.96s 21.57%  runtime.convT2E
     0.06s  1.35% 93.71%      0.18s  4.04%  runtime.heapBitsSweepSpan
     0.06s  1.35% 95.06%      0.76s 17.08%  runtime.newobject
     0.05s  1.12% 96.18%      4.30s 96.63%  main.main
     0.05s  1.12% 97.30%      0.05s  1.12%  sync.runtime_procUnpin
     0.04s   0.9% 98.20%      0.48s 10.79%  fmt.(*pp).free
     0.03s  0.67% 98.88%      0.16s  3.60%  bytes.(*Buffer).Reset
         0     0% 98.88%      0.18s  4.04%  runtime.(*mspan).sweep
         0     0% 98.88%      0.12s  2.70%  runtime.gcBgMarkWorker
         0     0% 98.88%      4.44s 99.78%  runtime.goexit
         0     0% 98.88%      0.18s  4.04%  runtime.gosweepone.func1
         0     0% 98.88%      4.30s 96.63%  runtime.main
         0     0% 98.88%      0.18s  4.04%  runtime.sweepone
         0     0% 98.88%      0.21s  4.72%  runtime.systemstack

Graph:
cpu

@martisch
Copy link
Contributor

another data point around 1,25x on OS X macbook pro:

OS X 10.11.4 Intel i7 3520M

Apple LLVM version 7.3.0 (clang-703.0.29) -O2
real 0m4.202s
user 0m4.174s
sys 0m0.014s

gcc-mp-5 (MacPorts gcc5 5.3.0_0) 5.3.0 -O2
real 0m4.032s
user 0m4.010s
sys 0m0.012s

go version devel +aa482f9
real 0m5.309s
user 0m5.438s
sys 0m0.036s

Ubuntu 12.04.5 LTS:

Intel(R) Xeon(R) CPU E3-1246 v3 @ 3.50GHz

go version devel +aa482f9
real 0m3.920s
user 0m4.047s
sys 0m0.004s

gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3
real 0m2.149s
user 0m2.145s
sys 0m0.000s

Im going to install linux on the G3220 to see if its related to OS.

@martisch
Copy link
Contributor

Same Intel G3220 without GUI on Fedora 23:

gcc (GCC) 5.3.1 20151207 (Red Hat 5.3.1-2) -O2
real 0m2.534s
user 0m2.532s
sys 0m0.002s

go tip version devel +7e88826
real 0m5.039s
user 0m5.124s
sys 0m0.003s

readelf -s print_c | grep print
     3: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND sprintf@GLIBC_2.2.5 (2)
    36: 0000000000000000     0 FILE    LOCAL  DEFAULT  ABS print.c
    61: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND sprintf@@GLIBC_2.2.5

which could mean go is still 2x slower than glibc but not gcc on any platform.

Sorry i missed this was a linux/glibc specific thing.

@robpike
Copy link
Contributor

robpike commented Mar 29, 2016

So it's still up to 2x slower sometimes. As said a while back, Go has to do a lot more under the covers so this seems understandable.

I say we should leave the bug closed.

@rsc
Copy link
Contributor

rsc commented Mar 29, 2016

If someone wants to look into why glibc is so much faster than OS X, it's probably some trick that we could adopt for Go too. But no need to reopen this issue for it.

@martisch
Copy link
Contributor

FYI: I had a quick look at glibc source yesterday but did not see anything super special besides avoiding some overhead (i did not profile it). I then optimized fmt_integer to avoid some divisions, special cased printing one argument to avoid the calculation if a space is needed between arguments, special cased printing with the v verb with an extra direct switch and avoided some overhead if width is not present and removed some defensive cleaning of format flags which should be ok after the latest patches. This brought runtime down around 20-25% on my macbook. Avoiding the allocation/interface overhead can bring it down a good chunk further at which point it seems fmt should reach glibc level.

I still plan to submit a last CL for fmt_integer (5-15%) soon. For 1.7 fmt has gone through a lot of change already and in general i am not sure it is worth extending the code with more fast paths just to optimize this micro benchmark.

@golang golang locked and limited conversation to collaborators Mar 29, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

9 participants