-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fmt: Println(i) is 3x slower than in C #3463
Comments
Btw, where is goprof nowadays? http://blog.golang.org/2011/06/profiling-go-programs.html is still not updated to Go 1. |
There's plenty of room for performance improvement. Not to contradict your findings, but interlanguage microbenchmarks like these don't often correspond to real-world uses and can put undue attention on internal details. When I turned buffering on in Go rather than off in C and did the comparison, Go was again 3X off compared to C. For a new language with all-native libraries (unlike Python, whose I/O is done with C code), that's not too "painful". It will improve. package main import "bufio" import "fmt" import "os" func main() { b := bufio.NewWriter(os.Stdout) for i := 0; i < 1<<25; i++ { fmt.Fprintln(b, i) } } |
Labels changed: added packagechange. Owner changed to @robpike. |
Hello everybody, I investigated the original and the buffered-io version with go tools prof and here are there results (go dev branch / i386): #1 User time (seconds): 27.06 System time (seconds): 16.98 Percent of CPU this job got: 132% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:33.19 (pprof) top20 Total: 5896 samples 2331 39.5% 39.5% 2331 39.5% ExternalCode 349 5.9% 45.5% 874 14.8% fmt.(*fmt).integer 283 4.8% 50.3% 283 4.8% runtime.uint64mod 266 4.5% 54.8% 333 5.6% runtime.mallocgc 254 4.3% 59.1% 254 4.3% scanblock 153 2.6% 61.7% 153 2.6% runtime.uint64div 142 2.4% 64.1% 142 2.4% runtime.memmove 141 2.4% 66.5% 141 2.4% runtime.cas 110 1.9% 68.3% 1108 18.8% fmt.(*pp).printArg 97 1.6% 70.0% 241 4.1% runtime.entersyscall 95 1.6% 71.6% 445 7.5% runtime.convT2I 93 1.6% 73.2% 3189 54.1% main.main 93 1.6% 74.7% 234 4.0% runtime.exitsyscall 88 1.5% 76.2% 88 1.5% runtime.writebarrierptr 84 1.4% 77.7% 1223 20.7% fmt.(*pp).doPrint 72 1.2% 78.9% 253 4.3% fmt.newPrinter 64 1.1% 80.0% 114 1.9% sync.(*Pool).pin 60 1.0% 81.0% 89 1.5% fmt.(*fmt).pad 56 0.9% 81.9% 2865 48.6% fmt.Fprintln 56 0.9% 82.9% 823 14.0% os.(*File).write (pprof) top20 -cum Total: 5896 samples 93 1.6% 1.6% 3189 54.1% main.main 0 0.0% 1.6% 3189 54.1% runtime.gosched_m 0 0.0% 1.6% 3189 54.1% runtime.main 35 0.6% 2.2% 2900 49.2% fmt.Println 56 0.9% 3.1% 2865 48.6% fmt.Fprintln 0 0.0% 3.1% 2345 39.8% System 2331 39.5% 42.7% 2331 39.5% ExternalCode 84 1.4% 44.1% 1223 20.7% fmt.(*pp).doPrint 43 0.7% 44.8% 1159 19.7% os.(*File).Write 110 1.9% 46.7% 1108 18.8% fmt.(*pp).printArg 37 0.6% 47.3% 911 15.5% fmt.(*pp).fmtInt64 349 5.9% 53.2% 874 14.8% fmt.(*fmt).integer 56 0.9% 54.2% 823 14.0% os.(*File).write 27 0.5% 54.6% 533 9.0% syscall.Write 20 0.3% 55.0% 506 8.6% syscall.write 11 0.2% 55.2% 486 8.2% syscall.Syscall 95 1.6% 56.8% 445 7.5% runtime.convT2I 38 0.6% 57.4% 371 6.3% runtime.newobject 0 0.0% 57.4% 362 6.1% GC 266 4.5% 61.9% 333 5.6% runtime.mallocgc #2 User time (seconds): 26.69 System time (seconds): 5.31 Percent of CPU this job got: 138% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:23.04 (pprof) top20 Total: 2592 samples 663 25.6% 25.6% 663 25.6% ExternalCode 282 10.9% 36.5% 717 27.7% fmt.(*fmt).integer 255 9.8% 46.3% 255 9.8% runtime.uint64mod 127 4.9% 51.2% 127 4.9% runtime.memmove 120 4.6% 55.8% 120 4.6% runtime.uint64div 107 4.1% 60.0% 924 35.6% fmt.(*pp).printArg 88 3.4% 63.3% 88 3.4% scanblock 69 2.7% 66.0% 92 3.5% runtime.mallocgc 65 2.5% 68.5% 1029 39.7% fmt.(*pp).doPrint 58 2.2% 70.8% 58 2.2% runtime.writebarrierptr 56 2.2% 72.9% 56 2.2% sync.runtime_procPin 55 2.1% 75.0% 222 8.6% fmt.newPrinter 53 2.0% 77.1% 212 8.2% fmt.(*pp).free 52 2.0% 79.1% 121 4.7% sync.(*Pool).pin 49 1.9% 81.0% 1796 69.3% main.main 44 1.7% 82.7% 44 1.7% fmt.(*fmt).clearflags 41 1.6% 84.3% 60 2.3% fmt.(*fmt).pad 40 1.5% 85.8% 184 7.1% runtime.convT2E 36 1.4% 87.2% 1558 60.1% fmt.Fprintln 34 1.3% 88.5% 96 3.7% sync.(*Pool).Get (pprof) top20 -cum Total: 2592 samples 49 1.9% 1.9% 1796 69.3% main.main 0 0.0% 1.9% 1796 69.3% runtime.gosched_m 0 0.0% 1.9% 1796 69.3% runtime.main 36 1.4% 3.3% 1558 60.1% fmt.Fprintln 65 2.5% 5.8% 1029 39.7% fmt.(*pp).doPrint 107 4.1% 9.9% 924 35.6% fmt.(*pp).printArg 17 0.7% 10.6% 734 28.3% fmt.(*pp).fmtInt64 282 10.9% 21.5% 717 27.7% fmt.(*fmt).integer 663 25.6% 47.0% 663 25.6% ExternalCode 0 0.0% 47.0% 631 24.3% System 255 9.8% 56.9% 255 9.8% runtime.uint64mod 55 2.1% 59.0% 222 8.6% fmt.newPrinter 53 2.0% 61.0% 212 8.2% fmt.(*pp).free 40 1.5% 62.6% 184 7.1% runtime.convT2E 0 0.0% 62.6% 165 6.4% GC 127 4.9% 67.5% 127 4.9% runtime.memmove 52 2.0% 69.5% 121 4.7% sync.(*Pool).pin 120 4.6% 74.1% 120 4.6% runtime.uint64div 33 1.3% 75.4% 110 4.2% sync.(*Pool).Put 16 0.6% 76.0% 108 4.2% runtime.newobject So in the current version 40% of the time is used for the write syscall and another 20% in os.(*File).Write. Using buffered io avoids these calls so system time drops from 17s to 5s. The integer to string conversion now is more significant. I wonder if there'd be a way to exploit the base10 special casing of strconv.Itoa()? |
CL https://golang.org/cl/145020043 mentions this issue. |
CL https://golang.org/cl/144250043 mentions this issue. |
This issue was updated by revision 78b5321. LGTM=josharian, rsc R=golang-codereviews, josharian, rsc CC=golang-codereviews https://golang.org/cl/144250043 |
After the most recent CL listed, I compared these two programs. They avoid buffering issues and other unnecessary overheads so are equivalent; it is a fair comparison. C: #include <stdio.h> int main(void) { int i; char buf[256]; for (i = 0; i < 1 << 25; i++) { sprintf(buf, "%d\n", i); } } Go: package main import "bytes" import "fmt" func main() { var b bytes.Buffer for i := 0; i < 1<<25; i++ { fmt.Fprintln(&b, i) b.Reset() } } Times: % go build x.go && time x real 0m12.041s user 0m12.315s sys 0m1.018s % gcc -O2 x.c && time a.out real 0m6.829s user 0m6.819s sys 0m0.006s % That's well within a factor of two, which is reasonable considering the unavoidable overheads of Go (reflection, interfaces, type safety). According to an earlier report by me, it used to be 3X, so we've gained somewhat. More could be done, of course. |
As of 1.4, there is allocation in Fprintln(i) that didn't used to be there. Rerunning: Go: 11.9s So we're back to 3X. For comparison I changed the Go program to print &i instead of i, so there's no allocation. It went down to 10.2s, still 2.5X. Overheads still too large. |
I don't think anyone is working on it yet. On Mon, Jan 5, 2015 at 6:05 PM, Rob Pike [email protected] wrote:
|
In magic Github syntax, that is #8618 (so the other bug is then associated with this one) |
benchmarking the c and go program from robpike on my FreeBSD Intel G3220:
which i guess could mean we can close the issue unless others still have radically different numbers.
|
Wow, nice work everyone! |
Arch: On a i7-4720HQ with GCC 5.3.0 and Go 1.6 and Go devel +aa482f9 I get these results for the code in comment 14:
So it's still 2x slower for me with Go tip. CPU Profile for Go tip is:
|
another data point around 1,25x on OS X macbook pro: OS X 10.11.4 Intel i7 3520M Apple LLVM version 7.3.0 (clang-703.0.29) -O2 gcc-mp-5 (MacPorts gcc5 5.3.0_0) 5.3.0 -O2 go version devel +aa482f9 Ubuntu 12.04.5 LTS: Intel(R) Xeon(R) CPU E3-1246 v3 @ 3.50GHz go version devel +aa482f9 gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3 Im going to install linux on the G3220 to see if its related to OS. |
Same Intel G3220 without GUI on Fedora 23: gcc (GCC) 5.3.1 20151207 (Red Hat 5.3.1-2) -O2 go tip version devel +7e88826
which could mean go is still 2x slower than glibc but not gcc on any platform. Sorry i missed this was a linux/glibc specific thing. |
So it's still up to 2x slower sometimes. As said a while back, Go has to do a lot more under the covers so this seems understandable. I say we should leave the bug closed. |
If someone wants to look into why glibc is so much faster than OS X, it's probably some trick that we could adopt for Go too. But no need to reopen this issue for it. |
FYI: I had a quick look at glibc source yesterday but did not see anything super special besides avoiding some overhead (i did not profile it). I then optimized fmt_integer to avoid some divisions, special cased printing one argument to avoid the calculation if a space is needed between arguments, special cased printing with the v verb with an extra direct switch and avoided some overhead if width is not present and removed some defensive cleaning of format flags which should be ok after the latest patches. This brought runtime down around 20-25% on my macbook. Avoiding the allocation/interface overhead can bring it down a good chunk further at which point it seems fmt should reach glibc level. I still plan to submit a last CL for fmt_integer (5-15%) soon. For 1.7 fmt has gone through a lot of change already and in general i am not sure it is worth extending the code with more fast paths just to optimize this micro benchmark. |
The text was updated successfully, but these errors were encountered: