runtime: FreeBSD nanotime performance regression with different CPU setups #50947

TriAnMan · 2022-02-01T13:57:02Z

The setup

$ go version
go version go1.17.6 freebsd/amd64
$ freebsd-version
12.2-RELEASE

1.17.6 is currently the latest release.

go env Output

$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/root/.cache/go-build"
GOENV="/root/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="freebsd"
GOINSECURE=""
GOMODCACHE="/root/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="freebsd"
GOPATH="/root/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/freebsd_amd64"
GOVCS=""
GOVERSION="go1.17.6"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/dev/null"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build1526234500=/tmp/go-build -gno-record-gcc-switches"

The case

I've found that the performance of the scheduler on FreeBSD highly depends on the CPU setup I have. While it is OK for a one-CPU VirtualBox instance running on my Mac, it decreases a lot either for more-than-one-CPU VirtualBox or for any-number-of-CPUs DigitalOcean instances (as far as I know DigitalOcean uses KVM as a hypervisor for their instances).

This benchmark can be used as a minimal reproducer

$ cat scheduler_test.go

package whatever

import (
	"testing"
)

func BenchmarkScheduler(b *testing.B) {
	ch := make(chan struct{})

	go func() {
		for range ch {}
	}()

	for i := 0; i < b.N; i++ {
		ch <- struct{}{}
	}
}

go test -bench=. scheduler_test.go

While this benchmark runs on a single-CPU VirtualBox instance it performs quite convenient.

$ sysctl hw.ncpu
hw.ncpu: 1

$ go test -bench=. scheduler_test.go
goos: freebsd
goarch: amd64
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkScheduler 	 7658449	       152.8 ns/op
PASS
ok  	command-line-arguments	1.331s

However if the benchmark was run on more-than-one-CPU VirtualBox or on a DigitalOcean instance with any number of CPUs the significant decrease in performance emerges.

$ sysctl hw.ncpu
hw.ncpu: 2

$ go test -bench=. scheduler_test.go
goos: freebsd
goarch: amd64
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkScheduler-2   	 1108549	      1163 ns/op
PASS
ok  	command-line-arguments	2.213s

$ sysctl hw.ncpu
hw.ncpu: 1

$ go test -bench=. scheduler_test.go

goos: freebsd
goarch: amd64
cpu: DO-Premium-Intel
BenchmarkScheduler 	  706951	      1949 ns/op
PASS
ok  	command-line-arguments	1.405s

Profiling shows that the scheduler code for FreeBSD relies highly on time machinery. And runtime.nanotime() performance highly affects the scheduler.

Execution flow for 1CPU VirtualBox

Execution flow for 2CPUs VirtualBox

Execution flow for 1CPU DigitalOcean

I've also made benchmarks for the time.Now() function in the same setups. The results of them correlates with scheduler ones.

The text was updated successfully, but these errors were encountered:

TriAnMan · 2022-02-01T14:00:13Z

#17712 may be somehow related to this.

ianlancetaylor · 2022-02-01T20:13:15Z

CC @aclements @mknyszek @prattmic

prattmic · 2022-02-01T20:30:46Z

Your profiles show the three different cases of time source:

1CPU VirtualBox -> TSC
2CPU VirtualBox -> HPET (VMM emulated? I'm not too familiar with HPET virtualization)
1CPU DigitalOcean -> Neither, fall back to system call

I am not very familiar with FreeBSD timekeeping, but from my experience with Linux timekeeping your results match my expectations: using the TSC is most efficient, followed by userspace-accessible HPET, following by actually making a system call.

Regarding how to handle this, for the HPET case the resolution should be in figuring out why this system is in HPET mode. There shouldn't be much reason for a hypervisor on a CPU as modern as you have to not support use of the TSC on SMP VMs. Either something in the hypervisor is disabling access to the TSC, or something in the FreeBSD kernel is deciding the TSC isn't reliable.

For the DigitalOcean case, the first step is figuring out which clock mode the kernel is putting the VDSO in (perhaps print from the function linked above). I see that FreeBSD supports a few more modes including a KVM paravirtualized clock. If it is in one of those modes, perhaps our VDSO code needs to add support.

prattmic · 2022-02-01T20:38:26Z

Based on freebsd/freebsd-src@8cc15b0, it looks like FreeBSD doesn't trust the TSC on VirtualBox for some reason (N.B. test_tsc only runs on SMP systems).

TriAnMan · 2022-02-01T21:22:46Z

I've got sysctl kern.timecounter outputs for my setups.

1CPU VirtualBox

kern.timecounter.tsc_shift: 1
kern.timecounter.smp_tsc_adjust: 0
kern.timecounter.smp_tsc: 0
kern.timecounter.invariant_tsc: 1
kern.timecounter.fast_gettime: 1
kern.timecounter.tick: 1
kern.timecounter.choice: ACPI-fast(900) i8254(0) TSC-low(1000) dummy(-1000000)
kern.timecounter.hardware: TSC-low
kern.timecounter.alloweddeviation: 5
kern.timecounter.timehands_count: 2
kern.timecounter.stepwarnings: 0
kern.timecounter.tc.ACPI-fast.quality: 900
kern.timecounter.tc.ACPI-fast.frequency: 3579545
kern.timecounter.tc.ACPI-fast.counter: 295440892
kern.timecounter.tc.ACPI-fast.mask: 4294967295
kern.timecounter.tc.i8254.quality: 0
kern.timecounter.tc.i8254.frequency: 1193182
kern.timecounter.tc.i8254.counter: 50397
kern.timecounter.tc.i8254.mask: 65535
kern.timecounter.tc.TSC-low.quality: 1000
kern.timecounter.tc.TSC-low.frequency: 1296029345
kern.timecounter.tc.TSC-low.counter: 3889432018
kern.timecounter.tc.TSC-low.mask: 4294967295

2CPU VirtualBox

kern.timecounter.tsc_shift: 1
kern.timecounter.smp_tsc_adjust: 0
kern.timecounter.smp_tsc: 0
kern.timecounter.invariant_tsc: 1
kern.timecounter.fast_gettime: 1
kern.timecounter.tick: 1
kern.timecounter.choice: ACPI-fast(900) i8254(0) TSC-low(-100) dummy(-1000000)
kern.timecounter.hardware: ACPI-fast
kern.timecounter.alloweddeviation: 5
kern.timecounter.timehands_count: 2
kern.timecounter.stepwarnings: 0
kern.timecounter.tc.ACPI-fast.quality: 900
kern.timecounter.tc.ACPI-fast.frequency: 3579545
kern.timecounter.tc.ACPI-fast.counter: 3480961343
kern.timecounter.tc.ACPI-fast.mask: 4294967295
kern.timecounter.tc.i8254.quality: 0
kern.timecounter.tc.i8254.frequency: 1193182
kern.timecounter.tc.i8254.counter: 35839
kern.timecounter.tc.i8254.mask: 65535
kern.timecounter.tc.TSC-low.quality: -100
kern.timecounter.tc.TSC-low.frequency: 1296127369
kern.timecounter.tc.TSC-low.counter: 2127372050
kern.timecounter.tc.TSC-low.mask: 4294967295

1CPU DigitalOcean

kern.timecounter.tsc_shift: 1
kern.timecounter.smp_tsc_adjust: 0
kern.timecounter.smp_tsc: 0
kern.timecounter.invariant_tsc: 0
kern.timecounter.fast_gettime: 1
kern.timecounter.tick: 1
kern.timecounter.choice: i8254(0) ACPI-fast(900) HPET(950) TSC-low(800) dummy(-1000000)
kern.timecounter.hardware: HPET
kern.timecounter.alloweddeviation: 5
kern.timecounter.timehands_count: 2
kern.timecounter.stepwarnings: 0
kern.timecounter.tc.i8254.quality: 0
kern.timecounter.tc.i8254.frequency: 1193182
kern.timecounter.tc.i8254.counter: 43479
kern.timecounter.tc.i8254.mask: 65535
kern.timecounter.tc.ACPI-fast.quality: 900
kern.timecounter.tc.ACPI-fast.frequency: 3579545
kern.timecounter.tc.ACPI-fast.counter: 7730510
kern.timecounter.tc.ACPI-fast.mask: 16777215
kern.timecounter.tc.HPET.quality: 950
kern.timecounter.tc.HPET.frequency: 100000000
kern.timecounter.tc.HPET.counter: 2395020808
kern.timecounter.tc.HPET.mask: 4294967295
kern.timecounter.tc.TSC-low.quality: 800
kern.timecounter.tc.TSC-low.frequency: 1247085954
kern.timecounter.tc.TSC-low.counter: 3806080953
kern.timecounter.tc.TSC-low.mask: 4294967295

prattmic · 2022-02-01T21:29:08Z

Oops, this makes me realized I swapped VirtualBox and DigitalOcean. The VB profile shows fallback to syscall, while DigitalOcean shows use of HPET.

prattmic · 2022-02-01T22:32:26Z

Summary of what I see from that:

1CPU VirtualBox: TSC is fine.
2CPU VirtualBox: FreeBSD doesn't trust TSC on VirtualBox, and there is no HPET -> Syscall fallback is only option.
1CPU DigitalOcean: TSC-low(800) implies this path. Perhaps DigitalOcean is mucking with the vendor name in CPUID and breaking FreeBSD's detection -> HPET.

I don't think there is a lot for Go to do here.

For VirtualBox, FreeBSD should consider trusting their TSC, or perhaps more likely, convince them to fix their TSC.

For DigitalOcean, FreeBSD should look at how DigitalOcean reports CPU vendor so they can detect it as Intel and use TSC if appropriate.

The one thing we may want to do is drop HPET entirely if system calls are faster, as your benchmark implies. Of course, your benchmarks are on completely different machines, so we'd want benchmarks of syscall vs HPET on the same machine to see if that is actually true. In theory it shouldn't be, as the system call probably results in HPET use in the kernel.

TriAnMan · 2022-02-02T00:14:40Z

Got benchmarks and profiles for different kern.timecounter.hardware on 1CPU DigitalOcean

HPET

$ sysctl kern.timecounter.hardware=HPET
kern.timecounter.hardware: HPET -> HPET

$ go test -bench=. scheduler_test.go
goos: freebsd
goarch: amd64
cpu: DO-Premium-Intel
BenchmarkScheduler 	  616752	      1954 ns/op
PASS
ok  	command-line-arguments	2.228s

$ go test -bench=. -cpuprofile hpet.out scheduler_test.go
goos: freebsd
goarch: amd64
cpu: DO-Premium-Intel
BenchmarkScheduler 	  161212	      6563 ns/op
PASS
ok  	command-line-arguments	1.325s

ACPI-fast

$ sysctl kern.timecounter.hardware=ACPI-fast
kern.timecounter.hardware: HPET -> ACPI-fast

$ go test -bench=. scheduler_test.go
goos: freebsd
goarch: amd64
cpu: DO-Premium-Intel
BenchmarkScheduler 	  752794	      1663 ns/op
PASS
ok  	command-line-arguments	1.273s

$ go test -bench=. -cpuprofile acpi.out scheduler_test.go
goos: freebsd
goarch: amd64
cpu: DO-Premium-Intel
BenchmarkScheduler 	  190075	      5578 ns/op
PASS
ok  	command-line-arguments	1.317s

TSC-low

$ sysctl kern.timecounter.hardware=TSC-low
kern.timecounter.hardware: ACPI-fast -> TSC-low

$ go test -bench=. scheduler_test.go
goos: freebsd
goarch: amd64
cpu: DO-Premium-Intel
BenchmarkScheduler 	 3701788	       313.4 ns/op
PASS
ok  	command-line-arguments	1.494s

$ go test -bench=. -cpuprofile tsc.out scheduler_test.go
goos: freebsd
goarch: amd64
cpu: DO-Premium-Intel
BenchmarkScheduler 	 4263708	       296.1 ns/op
PASS
ok  	command-line-arguments	1.715s

As an interesting fact, a profiling with TSC-low seems not consuming additional resources.

TriAnMan · 2022-02-02T00:41:07Z

I have also benchmarked time.Now() for different kern.timecounter.hardware on 1CPU DigitalOcean

package whatever

import (
	"testing"
	"time"
)

func BenchmarkTime(b *testing.B) {
	for i := 0; i < b.N; i++ {
		time.Now()
	}
}

$ sysctl kern.timecounter.hardware=HPET
kern.timecounter.hardware: HPET -> HPET

$ go test -bench=. time_test.go
goos: freebsd
goarch: amd64
cpu: DO-Premium-Intel
BenchmarkTime 	  130621	      8971 ns/op
PASS
ok  	command-line-arguments	1.276s

$ sysctl kern.timecounter.hardware=ACPI-fast
kern.timecounter.hardware: HPET -> ACPI-fast

$ go test -bench=. time_test.go
goos: freebsd
goarch: amd64
cpu: DO-Premium-Intel
BenchmarkTime 	  160593	      7183 ns/op
PASS
ok  	command-line-arguments	1.239s

$ sysctl kern.timecounter.hardware=TSC-low
kern.timecounter.hardware: ACPI-fast -> TSC-low

$ go test -bench=. time_test.go
goos: freebsd
goarch: amd64
cpu: DO-Premium-Intel
BenchmarkTime 	16510144	        75.84 ns/op
PASS
ok  	command-line-arguments	1.330s

Notice huge advantage of TSC-low.

koobs · 2022-02-02T02:15:02Z

Edit: The reverse is true, see: #50947 (comment)

Original Comment:

FWIW, and hopefully it's helpful... with the following simple benchmark, I found HPET significantly faster than TSC-Low, which was being selected for my FreeBSD Guests (12.x and CURRENT, 2 core, 8gb each) on VirtualBox (Win64, Intel 2600K) prompting setting a HPET value in /etc/sysctl.conf. I had been testing to try to identify performance differences between VirtualBox's Paravirtualization provider settings.

If I also recall correctly, there were some timecounter changes around that time (benchmark result files are dated Apr 18 2021), and/or discussion about TSC/ACPI-Fast performance particularly in FreeBSD's hypervisor detection code (so its worth going over those logs, if one hasn't already).

I seem to recall that detection was 'incorrect' (or not optimal) for a time, and then updated as the HV strings had changed or something.

sysctl.conf and benchmark results

# [koobs@140-CURRENT-amd64-564d:~/timecounter] ministat -s -w80 hpet tsc-low acpi-fast
# x hpet
# + tsc-low
# * acpi-fast
# +--------------------------------------------------------------------------------+
# | x                                 *                                         +  |
# |xx                                **                                         ++ |
# |xx                               ***                                         ++ |
# |xx                               ****                                     + ++++|
# ||A                                                                              |
# |                                                                            |A_||
# |                                  |A|                                           |
# +--------------------------------------------------------------------------------+
#     N           Min           Max        Median           Avg        Stddev
# x  10        139595        143041        142209      141704.5     1204.2239
# +  10        355667        371410      366164.5      366057.4       4360.75
# Difference at 95.0% confidence
#         224353 +/- 3005.7
#         158.324% +/- 2.51149%
#         (Student's t, pooled s = 3198.93)
# *  10        237116        244564      241635.5      241141.3     2735.9878
# Difference at 95.0% confidence
#         99436.8 +/- 1986.06
#         70.1719% +/- 1.60272%
#         (Student's t, pooled s = 2113.74)

kern.timecounter.hardware=HPET # TSC-Low is selected but is much slower

Bench

#include <sys/time.h>
#include <stdio.h>
#include <time.h>

#define COUNT 1000000

int main() {
    struct timespec ts_start, ts_stop, ts_read;
    double time;
    int i;
    clock_gettime(CLOCK_MONOTONIC, &ts_start);
    for(i = 0; i < COUNT; i++) {
        clock_gettime(CLOCK_MONOTONIC, &ts_read);
    }
    clock_gettime(CLOCK_MONOTONIC, &ts_stop);
    time = (ts_stop.tv_sec - ts_start.tv_sec) + (ts_stop.tv_nsec
    - ts_start.tv_nsec) * 1E-9;
    printf("%.0f\n", COUNT / time);
}

TriAnMan · 2022-02-02T13:05:42Z

Enabling HPET for my VirtualBox and using @koobs benchmark along with time.Now() benchmark, got this results for different kern.timecounter.hardware and setups:

1CPU VBox

$ sysctl kern.timecounter.choice
kern.timecounter.choice: ACPI-fast(900) i8254(0) HPET(950) TSC-low(1000) dummy(-1000000)

$ sysctl kern.timecounter.hardware=HPET
$ ./clock_test
580179
$ go test -bench=. time_test.go
BenchmarkTime 	  312913	      3474 ns/op

$ sysctl kern.timecounter.hardware=ACPI-fast
$ ./clock_test
727058
$ go test -bench=. time_test.go
BenchmarkTime 	  427713	      2771 ns/op

$ sysctl kern.timecounter.hardware=TSC-low
$ ./clock_test
30036568
$ go test -bench=. time_test.go
BenchmarkTime 	24549586	        45.58 ns/op

2CPU VBox

$ sysctl kern.timecounter.choice
kern.timecounter.choice: ACPI-fast(900) i8254(0) HPET(950) TSC-low(-100) dummy(-1000000)

$ sysctl kern.timecounter.hardware=HPET
$ ./clock_test
576028
$ go test -bench=. time_test.go
BenchmarkTime-2   	  333536	      3542 ns/op

$ sysctl kern.timecounter.hardware=ACPI-fast
$ ./clock_test
828282
$ go test -bench=. time_test.go
BenchmarkTime-2   	  425350	      2443 ns/op

$ sysctl kern.timecounter.hardware=TSC-low
$ ./clock_test
1138499
$ go test -bench=. time_test.go
BenchmarkTime-2   	  575733	      1812 ns/op

1CPU DO

$ sysctl kern.timecounter.choice
kern.timecounter.choice: i8254(0) ACPI-fast(900) HPET(950) TSC-low(800) dummy(-1000000)

$ sysctl kern.timecounter.hardware=HPET
$ ./clock_test
232181
$ go test -bench=. time_test.go
BenchmarkTime 	  123722	      9597 ns/op

$ sysctl kern.timecounter.hardware=ACPI-fast
$ ./clock_test
262493
$ go test -bench=. time_test.go
BenchmarkTime 	  185848	      7891 ns/op

$ sysctl kern.timecounter.hardware=TSC-low
$ ./clock_test
28790175
$ go test -bench=. time_test.go
BenchmarkTime 	14390761	        80.54 ns/op

2CPU DO

$ sysctl kern.timecounter.choice
kern.timecounter.choice: i8254(0) ACPI-fast(900) HPET(950) TSC-low(-100) dummy(-1000000)

$ sysctl kern.timecounter.hardware=HPET
$ ./clock_test
251691
$ go test -bench=. time_test.go
BenchmarkTime-2   	  126259	      9750 ns/op

$ sysctl kern.timecounter.hardware=ACPI-fast
$ ./clock_test
307210
$ go test -bench=. time_test.go
BenchmarkTime-2   	  125708	      8008 ns/op

$ sysctl kern.timecounter.hardware=TSC-low
kern.timecounter.hardware: ACPI-fast -> TSC-low
$ ./clock_test
29018696
$ go test -bench=. time_test.go
BenchmarkTime-2   	15485086	        80.60 ns/op

TriAnMan · 2022-02-02T15:04:53Z

@koobs looking at your output of minstat I wonder which data was in the hpet tsc-low acpi-fast files?
If the data was results of multiple runs of your benchmark then + tsc-low shows better results than either x hpet or * acpi-fast: see Median or Average columns.

prattmic · 2022-02-02T17:34:18Z

FWIW, and hopefully it's helpful... with the following simple benchmark, I found HPET significantly faster than TSC-Low, which was being selected for my FreeBSD Guests (12.x and CURRENT, 2 core, 8gb each) on VirtualBox (Win64, Intel 2600K)

@koobs Looking into VirtualBox a little more, it seems that they have a variety of conditions to select between hardware TSC (no VM exit) and emulated TSC (VM exit on every RDTSC), most notable of which is that it sounds like (not 100% sure) they require VMX TSC offsetting support to use hardware TSC. Intel 2600K is a older Sandy Bridge core and consumer SKU, so perhaps it doesn't support TSC offsetting, hence a fallback to emulated TSC with poor performance. I think you'd need to read various VMX capability MSRs to verify.

@TriAnMan Very interesting that you consistently find ACPI-fast via a syscall faster than userspace HPET. Though it is still not clear if the syscall would still be faster if the system was still in kern.timecounter.hardware=HPET, but Go made a syscall anyways. I suspect that the kernel would use the same HPET source and still be slow.

I still don't think that there is anything for Go itself to do here. It is clear that FreeBSD's kernel clock selection is not always optimal. Go could second guess the kernel's choice in some cases (HPET vs syscall), but even if that helps, it would only help Go programs. If the kernel itself improves its clock selection then all programs on the system will benefit.

mpx · 2022-02-02T20:46:31Z

Perhaps some documentation could be considered if Go is particularly affected? Eg, platform caveats/notes.

This seems like something people might assume id Go's fault (and is unlikely to be widely understood).

koobs · 2022-02-02T22:26:09Z

@koobs looking at your output of minstat I wonder which data was in the hpet tsc-low acpi-fast files? If the data was results of multiple runs of your benchmark then + tsc-low shows better results than either x hpet or * acpi-fast: see Median or Average columns.

Ugh, thanks to you, I just realised the bench reports syscall counts and not latencies. I must have been drowning in yaks that day. My apologies :|

So contrary to my original: #50947 (comment) TSC-Low is the fastest, and HPET the slowest.

The numbers for acpi-fast, hpet and tsc-low are (with a 1 million rounds per run, on 2-core vbox guest)

ACPI-fast

237301 243384 242953 244564 243750 242997 240318 239218 239812 237116

HPET

140750 142011 142703 142432 139595 140107 143041 142407 142729 141270

TSC-Low

366357 365972 369583 364411 369548 371410 367791 364049 365786 355667

koobs · 2022-02-02T23:13:55Z

This was the relevent commit I referred to earlier re TSC & VirtualBox:

freebsd/freebsd-src@8cc15b0

Note: Things may have changed since, I have not reviewed TSC or other timecounter changes since then. I've not observed issues with TSC myself, but this is what FreeBSD is seeing and doing currently (CURRENT)

Timecounter "HPET" frequency 14318180 Hz quality 950
Timecounter "i8254" frequency 1193182 Hz quality 0
Timecounter "ACPI-fast" frequency 3579545 Hz quality 900
Timecounter "TSC-low" frequency 1700022709 Hz quality 0

kern.timecounter.choice: TSC-low(0) ACPI-fast(900) i8254(0) HPET(950) dummy(-1000000)

On 12.3 (the above commit was not merged to stable/12)

Timecounter "HPET" frequency 14318180 Hz quality 950
Timecounter "i8254" frequency 1193182 Hz quality 0
Timecounter "ACPI-fast" frequency 3579545 Hz quality 900

kern.timecounter.choice: ACPI-fast(900) i8254(0) HPET(950) TSC-low(-100) dummy(-1000000)

Additional References:

TriAnMan · 2022-02-03T20:43:35Z

I've made some benchmarks with the commit switching off TSC and HPET paths in Go: TriAnMan@6160d56

1CPU VBox

$ go version
go version devel go1.18-6160d56d11 Thu Feb 3 21:58:18 2022 +0300 freebsd/amd64

$ sysctl kern.timecounter.choice
kern.timecounter.choice: ACPI-fast(900) i8254(0) HPET(950) TSC-low(1000) dummy(-1000000)

$ sysctl kern.timecounter.hardware=HPET
$ go test -bench=. time_test.go
BenchmarkTime 	  303352	      3896 ns/op

$ sysctl kern.timecounter.hardware=ACPI-fast
$ go test -bench=. time_test.go
BenchmarkTime 	  425329	      2832 ns/op

$ sysctl kern.timecounter.hardware=TSC-low
$ go test -bench=. time_test.go
BenchmarkTime 	 3125544	       374.6 ns/op

2CPU VBox

$ go version
go version devel go1.18-6160d56d11 Thu Feb 3 21:58:18 2022 +0300 freebsd/amd64

$ sysctl kern.timecounter.choice
kern.timecounter.choice: ACPI-fast(900) i8254(0) HPET(950) TSC-low(-100) dummy(-1000000)

$ sysctl kern.timecounter.hardware=HPET
$ go test -bench=. time_test.go
BenchmarkTime-2   	  272592	      3955 ns/op

$ sysctl kern.timecounter.hardware=ACPI-fast
$ go test -bench=. time_test.go
BenchmarkTime-2   	  442884	      2477 ns/op

$ sysctl kern.timecounter.hardware=TSC-low
$ go test -bench=. time_test.go
BenchmarkTime-2   	  504488	      2195 ns/op

So it looks like syscall uses a timecounter configured with sysctl. The results for HPET and TSC-low are consistent with previous ones #50947 (comment) except having some overhead on making a syscall.

gopherbot · 2022-03-11T18:01:55Z

Change https://go.dev/cl/391856 mentions this issue: runtime: fast clock_gettime on FreeBSD, split getHPETTimecounter

paulzhol · 2022-03-11T18:19:53Z

I've sent a small improvement above, for the HPET timecounter path to switch to the system stack only once on first call. My system (AMD FX-8300 hardware) shows a small improvement in BenchmarkNow. However it behaves very differently than yours.

kern.timecounter.tsc_shift: 1
kern.timecounter.smp_tsc_adjust: 0
kern.timecounter.smp_tsc: 1
kern.timecounter.invariant_tsc: 1
kern.timecounter.fast_gettime: 1
kern.timecounter.tick: 1
kern.timecounter.choice: ACPI-safe(850) HPET(950) i8254(0) TSC-low(1000) dummy(-1000000)
kern.timecounter.hardware: TSC-low

For example forcing HPET, then disabling kern.timecounter.fast_gettime (which forces Go to fallback to the regular syscall)

go/src/runtime/vdso_freebsd.go

Lines 52 to 54 in 7b1ba97

    
           if timekeepSharedPage.enabled == 0 { 
        
           	return zeroBintime 
        
           }

go/src/runtime/vdso_freebsd.go

Lines 99 to 114 in 7b1ba97

    
           //go:nosplit 
        
           func nanotime1() int64 { 
        
           	bt := vdsoClockGettime(_CLOCK_MONOTONIC) 
        
           	if bt == zeroBintime { 
        
           		return fallback_nanotime() 
        
           	} 
        
           	return int64((1e9 * uint64(bt.sec)) + ((1e9 * uint64(bt.frac>>32)) >> 32)) 
        
           } 
        
           func walltime() (sec int64, nsec int32) { 
        
           	bt := vdsoClockGettime(_CLOCK_REALTIME) 
        
           	if bt == zeroBintime { 
        
           		return fallback_walltime() 
        
           	} 
        
           	return int64(bt.sec), int32((1e9 * uint64(bt.frac>>32)) >> 32) 
        
           }

There's a noticeable difference:

root@relic:~ # sysctl -w kern.timecounter.hardware=HPET
kern.timecounter.hardware: TSC-low -> HPET

paulzhol@relic:~/go/src/time % ../../bin/go test -run=NONE -bench=BenchmarkNow ./... > old_hpet.txt

root@relic:/tmp # sysctl kern.timecounter.fast_gettime=0
kern.timecounter.fast_gettime: 1 -> 0

paulzhol@relic:~/go/src/time % ../../bin/go test -run=NONE -bench=BenchmarkNow ./... > old_hpet_no_fast.txt

paulzhol@relic:~/go/src/time % ~/gocode/bin/benchcmp old_hpet_no_fast.txt old_hpet.txt
benchcmp is deprecated in favor of benchstat: https://pkg.go.dev/golang.org/x/perf/cmd/benchstat
benchmark                   old ns/op     new ns/op     delta
BenchmarkNow-8              1832          1420          -22.49%
BenchmarkNowUnixNano-8      1834          1421          -22.52%
BenchmarkNowUnixMilli-8     1835          1423          -22.45%
BenchmarkNowUnixMicro-8     1833          1423          -22.37%

Specifically ACPI-safe is much slower:

paulzhol@relic:~/go/src/time % ~/gocode/bin/benchcmp baseline_acpi.txt old_hpet.txt
benchcmp is deprecated in favor of benchstat: https://pkg.go.dev/golang.org/x/perf/cmd/benchstat
benchmark                   old ns/op     new ns/op     delta
BenchmarkNow-8              6843          1420          -79.25%
BenchmarkNowUnixNano-8      6946          1421          -79.54%
BenchmarkNowUnixMilli-8     6914          1423          -79.42%
BenchmarkNowUnixMicro-8     6923          1423          -79.45%

FreeBSD 12.3 added support for kvmclock, I believe it should be much faster than using HPET (which is essentially a non-paravirtualized emulated hardware device). I'll try to prepare a PR for it when I have the time.

ianlancetaylor · 2022-06-24T19:57:00Z

What is the current status here? This issue is currently in the 1.19 milestone. Should it move to 1.20? To Backlog? Thanks.

paulzhol · 2022-06-24T20:06:14Z

I did a small improvement on the HPET code path, it has a +2 review and wait-release in the PR. I'll do the kvmclock after I've finished with removal of the COMPAT_FREEBSD11. Moving to 1.20.

gopherbot · 2022-08-17T12:57:36Z

Change https://go.dev/cl/422977 mentions this issue: runtime: fast clock_gettime on FreeBSD, use physical/virtual ARM timer

gopherbot · 2022-08-17T12:57:37Z

Change https://go.dev/cl/422978 mentions this issue: runtime: fast clock_gettime on FreeBSD, add support for pvclock(kvmclock) and Hyper-V reference TSC

Call only initHPETTimecounter on the system stack. Use O_CLOEXEC flag when opening the HPET device. FreeBSD 12.3-RELEASE-p2, AMD FX-8300 paulzhol@relic:~/go/src/time % ~/gocode/bin/benchcmp old_hpet.txt new_hpet.txt benchcmp is deprecated in favor of benchstat: https://pkg.go.dev/golang.org/x/perf/cmd/benchstat benchmark old ns/op new ns/op delta BenchmarkNow-8 1420 1397 -1.62% BenchmarkNowUnixNano-8 1421 1404 -1.20% BenchmarkNowUnixMilli-8 1423 1405 -1.26% BenchmarkNowUnixMicro-8 1423 1404 -1.34% Update #50947 Change-Id: I553b5427fb0b86d7e070af4516b36326bc0aaf00 Reviewed-on: https://go-review.googlesource.com/c/go/+/391856 Reviewed-by: Michael Knyszek <[email protected]> Reviewed-by: Michael Pratt <[email protected]>

as setup by the kernel on GOARCH=arm64. Update #50947 Change-Id: I2f44be9b36e9ce8d264eccc0aa3df10825c5f4f9 Reviewed-on: https://go-review.googlesource.com/c/go/+/422977 Reviewed-by: Heschi Kreinick <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Meng Zhuo <[email protected]> Reviewed-by: Dmitri Goutnik <[email protected]> Run-TryBot: Yuval Pavel Zholkover <[email protected]>

mknyszek · 2023-06-09T18:28:51Z

What else is left to do here? There appears to be one outstanding CL. Since this was already moved forward a few times, I'm moving it to the backlog for now. Thanks.

ianlancetaylor added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Feb 1, 2022

ianlancetaylor added this to the Go1.19 milestone Feb 1, 2022

ianlancetaylor added the OS-FreeBSD label Feb 1, 2022

prattmic changed the title ~~runtime: FreeBSD scheduler performance regression with different CPU setups~~ runtime: FreeBSD nanotime performance regression with different CPU setups Feb 1, 2022

prattmic added the Performance label Feb 1, 2022

joedian added this to Release Blockers Jun 15, 2022

paulzhol modified the milestones: Go1.19, Go1.20 Jun 24, 2022

paulzhol self-assigned this Jun 24, 2022

gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 7, 2022

prattmic added this to Go Compiler / Runtime Jul 25, 2022

prattmic moved this to Triage Backlog in Go Compiler / Runtime Jul 25, 2022

ferrmin mentioned this issue Sep 5, 2022

runtime: fast clock_gettime on FreeBSD, use physical/virtual ARM timer ferrmin/go#460

Merged

mknyszek moved this from Triage Backlog to Todo in Go Compiler / Runtime Nov 23, 2022

gopherbot modified the milestones: Go1.20, Go1.21 Feb 1, 2023

mknyszek modified the milestones: Go1.21, Backlog Jun 9, 2023

mknyszek moved this from Todo to In Progress in Go Compiler / Runtime Jun 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: FreeBSD nanotime performance regression with different CPU setups #50947

runtime: FreeBSD nanotime performance regression with different CPU setups #50947

TriAnMan commented Feb 1, 2022

TriAnMan commented Feb 1, 2022

ianlancetaylor commented Feb 1, 2022

prattmic commented Feb 1, 2022

prattmic commented Feb 1, 2022

TriAnMan commented Feb 1, 2022

prattmic commented Feb 1, 2022 •

edited

Loading

prattmic commented Feb 1, 2022

TriAnMan commented Feb 2, 2022

TriAnMan commented Feb 2, 2022 •

edited

Loading

koobs commented Feb 2, 2022 •

edited

Loading

TriAnMan commented Feb 2, 2022

TriAnMan commented Feb 2, 2022

prattmic commented Feb 2, 2022

mpx commented Feb 2, 2022 •

edited

Loading

koobs commented Feb 2, 2022

koobs commented Feb 2, 2022

TriAnMan commented Feb 3, 2022 •

edited

Loading

gopherbot commented Mar 11, 2022

paulzhol commented Mar 11, 2022 •

edited

Loading

ianlancetaylor commented Jun 24, 2022

paulzhol commented Jun 24, 2022

gopherbot commented Aug 17, 2022

gopherbot commented Aug 17, 2022

mknyszek commented Jun 9, 2023

runtime: FreeBSD nanotime performance regression with different CPU setups #50947

runtime: FreeBSD nanotime performance regression with different CPU setups #50947

Comments

TriAnMan commented Feb 1, 2022

The setup

The case

TriAnMan commented Feb 1, 2022

ianlancetaylor commented Feb 1, 2022

prattmic commented Feb 1, 2022

prattmic commented Feb 1, 2022

TriAnMan commented Feb 1, 2022

prattmic commented Feb 1, 2022 • edited Loading

prattmic commented Feb 1, 2022

TriAnMan commented Feb 2, 2022

TriAnMan commented Feb 2, 2022 • edited Loading

koobs commented Feb 2, 2022 • edited Loading

sysctl.conf and benchmark results

Bench

TriAnMan commented Feb 2, 2022

TriAnMan commented Feb 2, 2022

prattmic commented Feb 2, 2022

mpx commented Feb 2, 2022 • edited Loading

koobs commented Feb 2, 2022

koobs commented Feb 2, 2022

TriAnMan commented Feb 3, 2022 • edited Loading

gopherbot commented Mar 11, 2022

paulzhol commented Mar 11, 2022 • edited Loading

ianlancetaylor commented Jun 24, 2022

paulzhol commented Jun 24, 2022

gopherbot commented Aug 17, 2022

gopherbot commented Aug 17, 2022

mknyszek commented Jun 9, 2023

prattmic commented Feb 1, 2022 •

edited

Loading

TriAnMan commented Feb 2, 2022 •

edited

Loading

koobs commented Feb 2, 2022 •

edited

Loading

mpx commented Feb 2, 2022 •

edited

Loading

TriAnMan commented Feb 3, 2022 •

edited

Loading

paulzhol commented Mar 11, 2022 •

edited

Loading