Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestJemalloc in pkg/server/status failed on Power Linux (ppc64le) platform #43522

Closed
amitsadaphule opened this issue Dec 24, 2019 · 6 comments · Fixed by #46255
Closed

TestJemalloc in pkg/server/status failed on Power Linux (ppc64le) platform #43522

amitsadaphule opened this issue Dec 24, 2019 · 6 comments · Fixed by #46255
Labels
B-unsupported-arch Non-x86_64 architectures: PPC, MIPS, etc

Comments

@amitsadaphule
Copy link
Contributor

Description

I built the cockroachdb code (commit 319080c) on ppc64le (with a minor change necessary in the rocksdb makefile to suppress -march option usage in compilation) and when I executed the test cases in pkg/server/status, I found a test failure there.

To Reproduce

Here is the command I used to execute this particular test:

make test PKG=./pkg/server/status TESTFLAGS='-v -count=1'

And here is the log:

Running make with -j8
GOPATH set to /root/go
Cleaning old generated files.
GOFLAGS= go test   -tags ' make ppc64le_redhat_linux' -ldflags '-X github.com/cockroachdb/cockroach/pkg/build.typ=development -extldflags "-ldl" -X "github.com/cockroachdb/cockroach/pkg/build.tag=v20.1.0-alpha.20191118-888-g319080c-dirty" -X "github.com/cockroachdb/cockroach/pkg/build.rev=319080c701cb4e13c347e86c4afb2f1c2af78def" -X "github.com/cockroachdb/cockroach/pkg/build.cgoTargetTriple=ppc64le-redhat-linux"  ' -run "."  -timeout 20m ./pkg/server/status -v -count=1
=== RUN   TestHealthCheckMetricsMap
--- PASS: TestHealthCheckMetricsMap (0.00s)
=== RUN   TestJemalloc
--- FAIL: TestJemalloc (0.00s)
    jemalloc_test.go:37: allocated stat not incremented on allocation: 2474112
    jemalloc_test.go:37: allocated stat not incremented on allocation: 2474112
    jemalloc_test.go:37: allocated stat not incremented on allocation: 2474112
    jemalloc_test.go:37: allocated stat not incremented on allocation: 2474112
    jemalloc_test.go:37: allocated stat not incremented on allocation: 2474112
    jemalloc_test.go:37: allocated stat not incremented on allocation: 2474112
    jemalloc_test.go:37: allocated stat not incremented on allocation: 2474112
    jemalloc_test.go:37: allocated stat not incremented on allocation: 2474112
    jemalloc_test.go:37: allocated stat not incremented on allocation: 2474112
=== RUN   TestMetricsRecorder
I191224 14:11:10.650288 54 server/status/recorder.go:599  available memory from cgroups (8.0 EiB) exceeds system memory 129 GiB, using system memory
--- PASS: TestMetricsRecorder (0.14s)
=== RUN   TestSumDiskCounters
--- PASS: TestSumDiskCounters (0.01s)
=== RUN   TestSumNetCounters
--- PASS: TestSumNetCounters (0.00s)
=== RUN   TestSubtractDiskCounters
--- PASS: TestSubtractDiskCounters (0.00s)
=== RUN   TestSubtractNetCounters
--- PASS: TestSubtractNetCounters (0.00s)
FAIL
FAIL    github.com/cockroachdb/cockroach/pkg/server/status      0.214s
FAIL
make: *** [test] Error 1

However, when I executed the same test on x86_64, the test did PASS.

Additional data

I tried debugging and researching the issue to find the root cause. The reason was the stats were not being updated after allocation by jemalloc package. When I dug deeper and did some research, I found that there were 2 possible solutions to the issue:

  1. Before reading the stats, flush thread cache i.e. modify jemalloc_get_stats function which is a part of test code inside pkg/server/status by adding mallctl("thread.tcache.flush", NULL, NULL, NULL, 0);
  2. Disable thread cache in jemalloc itself by doing a configure with --disable-tcache flag.

Option #1 fixes the test failure, but the other part of the code (if it is not flushing thread cache) will still suffer from the issue.
Option #2 fixes the root cause, but may cause performance overhead as it does introduce syncronization in allocation requests.

Please suggest which of these options to use. OR a different solution if neither of these is the right way to fix the issue.

Environment:

  • CockroachDB [commit 319080c]
  • Architecture and OS: [ppc64le/RHEL7.6]
@jordanlewis jordanlewis added the B-unsupported-arch Non-x86_64 architectures: PPC, MIPS, etc label Dec 26, 2019
@jordanlewis
Copy link
Member

We don't officially support PPC architecture. Not sure which option is correct. Maybe @petermattis has an opinion or can further route.

@petermattis
Copy link
Collaborator

Thanks for the report, @amitsadaphule. Did you debugging and research indicate why this problem exists on PPC, but not on amd64? The allocation and call to jemalloc_get_stats should be happening on the same thread. I don't see where in the jemalloc docs that a call to "thread.tcache.flush" is necessary before retrieving statistics.

Disabling the thread cache sounds like a non-starter for performance.

@amitsadaphule
Copy link
Contributor Author

amitsadaphule commented Dec 27, 2019

Thanks @jordanlewis for routing the question towards @petermattis.

Thanks for your input @petermattis.

  1. About the root cause for this issue in jemalloc on PPC, I haven't been able to figure it out yet. Apparently, the issue is also seen on x86_64 (stats.allocation reports incorrect values jemalloc/jemalloc#757 (comment)). I'll try to debug on PPC and find out the root cause.
  2. Regarding the "thread.tcache.flush" call, I actually got that from one of the issue reported on jemalloc github repo (stats.allocation reports incorrect values jemalloc/jemalloc#757 (comment))

Thanks for confirming that --disable-tcache is not recommended since it'll cause performance issues.

@petermattis
Copy link
Collaborator

@amitsadaphule The test is simply verifying that the jemalloc stats are changing based on allocations. I'd be fine with sprinkling a mallctl("thread.tcache.flush", ...) inside the test loop. Or you can try changing the constant in allocateMemory. That constant seems to be "tuned" in some manner. I can imagine that the 16KB exceeds the thread cache size on x86_64, but not on PPC. Try bumping it to 32KB, 64KB, or 128KB.

@amitsadaphule
Copy link
Contributor Author

@petermattis Thanks for clarifying. I'll try bumping up the constant in allocateMemory to see if thread cache size is causing different behavior. If that does not work out, I'll go with mallctl("thread.tcache.flush", ...).

@amitsadaphule
Copy link
Contributor Author

@petermattis The test passes if I allocate 256KB. I'll use that value for this test on PPC. Thanks for your help on this!

craig bot pushed a commit that referenced this issue Mar 31, 2020
46255: server: fix misleading TestJemalloc result for ppc64le r=petermattis a=amitsadaphule

Fixes #43522

The test failed on ppc64le since the 16KiB constant in allocateMemory
didn't exceed the thread cache on ppc64le. Brief failure log shown below:
```
=== RUN   TestJemalloc
--- FAIL: TestJemalloc (0.00s)
    jemalloc_test.go:37: allocated stat not incremented on allocation: 2322976
```
Configuring jemalloc with `--disable-tcache to disable the thread cache` or
flushing thread cache as `mallctl("thread.tcache.flush", NULL, NULL, NULL, 0);`
showed expected behavior. But had performance and code clutter trade-offs resp.
So, updated the constant to suit ppc64le which also works for x86 and others.

Release justification: Fixes misleading test result for jemalloc stat on PPC
Release note: None

Co-authored-by: Amit Sadaphule <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
B-unsupported-arch Non-x86_64 architectures: PPC, MIPS, etc
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants