-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node Crash during YCSB Multi Region tests #35473
Comments
Node 3 panicked |
There's no panic in 3.cockroach.log. Did you post the right file? Why are nodes 13-18 not running? 2.cockroach.log (which is really node 5) had an OOM (caught by the go allocator, not the kernel oom killer). The immediate cause appears to be memory allocated while marshaling a BatchResponse:
Memory (go alloc) was high prior to the crash:
The goroutine dumps include some very deep recursion (about 75 layers):
|
@drewdeally please help us out and look for log messages when filing crash reports. There generally is something. Here's what I found. Note that node ids unfortunately don't correspond to roachprod indexes. n5 (roachprod index 2): Can you please upload the heap profiles collected on that node (head_profiler under the log dir)? A debug.zip would also have them. n? (roachprod index 1):
Neither n3 nor roachprod index 3 (n6) seem dead (the Nit: I've conducted a quick poll around and nobody knew what "abend" means :). Google barely knows. |
@drewdeally there should be heap profiles in the logs directory; can you grab those? (or give us a debug zip instead of these logs, but be sure to bring the downed nodes back up) I think node n6 got oom-killed so there's nothing in the logs. |
Also, how was YCSB run? (zipfian or uniform?) |
#35433 should alleviate the tripping of circuitbreakers for nodes other than the one which is down |
I can not explain drew-demo-0003, but I had to restart. I continued on with testing. Here are all the logs with debug |
@tbg may I retest this scenario? |
@drewdeally could you post reproduction steps? I closed this issue because the investigation had gone cold. |
@nvanbenschoten we're not running YCSB in any multi-region configurations in roachtest. Should we? |
Chatted with @drewdeally directly - the workload was running fine; it's unlikely that YCSB is the problem here. The cluster was used for a large-scale recovery experiment. Drew said he would set this up again and hand off the cluster to us if a reproduction is made. |
I don't think it would show us anything interesting. The way it's run in these demos is to completely partition the load so that there's no interaction between traffic in different regions. So it's no more interesting than running kv in a multi-region config with a table per region. |
@drewdeally did you ever repro this? Closing for now, reopen if you do have repro to look at |
./cockroach version
Build Tag: v2.2.0-alpha.20190211
Build Time: 2019/02/07 23:44:57
Distribution: CCL
Platform: linux amd64 (x86_64-unknown-linux-gnu)
Go Version: go1.11.4
C Compiler: gcc 6.3.0
Build SHA-1: 4bdf0ad
Build Type: release
RoachProd
Node 2 Abend and a node went suspect. including all logs
1.cockroach.log
2.cockroach.log
3.cockroach.log
4.cockroach.log
5.cockroach.log
6.cockroach.log
7.cockroach.log
8.cockroach.log
9.cockroach.log
10.cockroach.log
11.cockroach.log
12.cockroach.log
The text was updated successfully, but these errors were encountered: