Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic non nullable column "job:status" during tpc-c import on 60 node 4 cpu cluster #34878

Closed
awoods187 opened this issue Feb 13, 2019 · 21 comments
Assignees
Labels
A-sql-execution Relating to SQL execution. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting.

Comments

@awoods187
Copy link
Contributor

awoods187 commented Feb 13, 2019

Describe the problem

Two dead nodes during tpc-c import on 60 node 4 cpu cluster with this failure in the CLI:

Andrews-MBP-2:cockroach andrewwoods$ roachprod run $CLUSTER:1 -- "./cockroach workload fixtures import tpcc --warehouses=10000 --db=tpcc"
Error: importing fixture: importing table stock: read tcp 127.0.0.1:42464->127.0.0.1:26257: read: connection reset by peer
Error:  exit status 1

To Reproduce
export CLUSTER=andy-60
roachprod create $CLUSTER -n 61 --clouds=aws --aws-machine-type-ssd=c5d.xlarge
roachprod run $CLUSTER -- "DEV=$(mount | grep /mnt/data1 | awk '{print $1}'); sudo umount /mnt/data1; sudo mount -o discard,defaults,nobarrier ${DEV} /mnt/data1/; mount | grep /mnt/data1"
roachprod stage $CLUSTER:1-60 cockroach
roachprod stage $CLUSTER:61 workload
roachprod start $CLUSTER:1-60 --racks=20 -e COCKROACH_ENGINE_MAX_SYNC_DURATION=24h
roachprod run $CLUSTER:1 -- "./cockroach workload fixtures import tpcc --warehouses=10000 --db=tpcc"

Expected behavior
Restore to complete

Additional data / screenshots
This same panic happened on Node 1 and Node 8

panic: Non-nullable column "jobs:status" with no value! Index scanned was "primary" with the index key columns (id) and the values (426018841420595201)

goroutine 45470 [running]:
github.com/cockroachdb/cockroach/pkg/sql/row.(*Fetcher).finalizeRow(0xc01915f4a0, 0x38e8160, 0xc012e2aff0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/row/fetcher.go:1258 +0xa36
github.com/cockroachdb/cockroach/pkg/sql/row.(*Fetcher).NextRow(0xc01915f4a0, 0x38e8160, 0xc012e2aff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/row/fetcher.go:994 +0x182
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*rowFetcherWrapper).Next(0xc011721900, 0xc007031d40, 0x1, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/tablereader.go:160 +0x4a
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*tableReader).Next(0xc01915f000, 0x38e80a0, 0xc012e2aff0, 0xc007031d80, 0x2c36a00)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/tablereader.go:277 +0x4c
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.Run(0x38e80a0, 0xc007031d80, 0x38f4420, 0xc01915f000, 0x38d9860, 0xc00d64ca80)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/base.go:175 +0x35
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*ProcessorBase).Run(0xc01915f000, 0x38e80a0, 0xc007031d80)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/processors.go:801 +0x92
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*Flow).startInternal.func1(0xc00ae70540, 0xc00da24110, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/flow.go:589 +0x67
created by github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*Flow).startInternal
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/flow.go:588 +0x342

cockroach.log

cockroach.log

Environment:
v2.2.0-alpha.20190211-112-g72edf20

@awoods187
Copy link
Contributor Author

awoods187 commented Feb 13, 2019

A third node died 30 minutes after the first two nodes and well after the import failed due to OOM. From sudo dmesg:

[ 1946.769407] Out of memory: Kill process 5826 (cockroach) score 968 or sacrifice child
[ 1946.772013] Killed process 5826 (cockroach) total-vm:11531636kB, anon-rss:7548560kB, file-rss:1164kB

memprof.fraction_system_memory.000000007273963520_2019-02-13T18_30_15.158.zip
cockroach.log

@nvanbenschoten
Copy link
Member

Here are the top 6 sources of memory allocation in that profile:

(pprof) top
Showing nodes accounting for 1533.01MB, 97.51% of 1572.20MB total
Dropped 266 nodes (cum <= 7.86MB)
Showing top 10 nodes out of 101
      flat  flat%   sum%        cum   cum%
  413.35MB 26.29% 26.29%   413.35MB 26.29%  github.com/cockroachdb/cockroach/pkg/storage.encodeRaftCommand
  328.57MB 20.90% 47.19%   328.57MB 20.90%  github.com/cockroachdb/cockroach/pkg/roachpb.(*AddSSTableRequest).Unmarshal
  304.99MB 19.40% 66.59%   304.99MB 19.40%  github.com/cockroachdb/cockroach/pkg/storage/storagepb.(*ReplicatedEvalResult_AddSSTable).Unmarshal
  166.48MB 10.59% 77.18%   166.48MB 10.59%  github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft/raftpb.(*Entry).Unmarshal
  117.75MB  7.49% 84.67%   117.75MB  7.49%  github.com/cockroachdb/cockroach/pkg/storage/engine.gobytes
   96.50MB  6.14% 90.81%    96.50MB  6.14%  bytes.makeSlice

That looks pretty standard for a fixture import. We shouldn't be getting into fights with the OOM killer when only 1500 MB are allocated. The dmesg output indicates that we missed a rapid allocation of memory.

@awoods187
Copy link
Contributor Author

This repros super reliably...hit it 10 minutes into the import on a new cluster killing 4 nodes.

@awoods187
Copy link
Contributor Author

After talking it over with @tbg I think this is in SQL Execution land @jordanlewis

@awoods187 awoods187 changed the title Panic non nullable column "job:status" during tpc-c import on 60 node 8 cpu cluster Panic non nullable column "job:status" during tpc-c import on 60 node 4 cpu cluster Feb 19, 2019
@awoods187
Copy link
Contributor Author

@dt any thoughts on the import OOM? (@jordanlewis will take a look at the Jobs table panic)

@awoods187
Copy link
Contributor Author

FYI this reproduces everytime (most recently with OOMs)

@awoods187 awoods187 added the S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting label Feb 26, 2019
@jordanlewis
Copy link
Member

@yuzefovich please try to reproduce this

@danhhz
Copy link
Contributor

danhhz commented Mar 4, 2019

I just repro'd the "Non-nullable column "jobs:status" with no value" too

@awoods187 awoods187 added A-sql-execution Relating to SQL execution. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. labels Mar 6, 2019
@yuzefovich
Copy link
Member

yuzefovich commented Mar 6, 2019

When running cockroach-v19.1.0-beta.20190304-160-g5fdb938 on 60 node cluster, I'm constantly hitting OOM errors in a sense that the cockroach process is killed. I tried different number of warehouses (10k, 5k, 1k) and hit OOM every time, on 500 warehouses once a node panicked itself due to OOM. After wiping out the cluster and attempting to put a locally built 02/11 alpha on each node, I got scp: cockroach-v2.2.0-alpha.20190211: No space left on device on the node that panicked on 500 warehouses import.

So far I haven't seen the panic in title of this issue. @dt I wonder whether there have been any changes in between 02/11 alpha and 03/04 beta that could explain increase in memory usage while doing imports.

Will try running 02/11 alpha next.

@yuzefovich
Copy link
Member

On locally built 02/11 alpha on a 60 node cluster (the same cluster has been wiped and reused for all the runs):
1k warehouses - cockroach process killed.
500 warehouses - import succeeded.
1k warehouses - cockroach process killed on two nodes.
1k warehouses - cockroach process killed on two nodes and one node ran out of disk space.
(I think the latter runs lasted longer, i.e. I looked at the cluster later in comparison to the first 1k run.)
I still haven't hit the panic.

@knz
Copy link
Contributor

knz commented Mar 6, 2019

this may be a dup of #35040 which I've been working on recently.

@yuzefovich
Copy link
Member

While looking into this with @dt, we found the reason for two out of disk errors I hit last week. What happens is that / is mounted to a small block device, so /tmp lives on that small device as well by default. Whenever oomkiller kills cockroach process because it is using too much memory, oomkiller takes a core dump and attempts to store it within /tmp. That core dump should be on the order of gigabytes in size, but the small block device itself has only around 6.5 GB total, so depending on the size of the core dump it occasionally couldn't fit onto the disk.

The solution is to mount /tmp onto a data device that has a lot more space. It can be done on already existing cluster as
roachprod run $CLUSTER "sudo mkdir -p /mnt/data1/tmp && sudo chmod 777 /mnt/data1/tmp && sudo mount -o bind /mnt/data1/tmp /tmp"
or added to "no barrier" command like
roachprod run $CLUSTER -- "DEV=$(mount | grep /mnt/data1 | awk '{print $1}'); sudo umount /mnt/data1; sudo mount -o discard,defaults,nobarrier ${DEV} /mnt/data1/; mount | grep /mnt/data1; sudo mkdir -p /mnt/data1/tmp && sudo chmod 777 /mnt/data1/tmp && sudo mount -o bind /mnt/data1/tmp /tmp".

@danhhz
Copy link
Contributor

danhhz commented Mar 15, 2019

Oof. Sorry you had to find this as well. I filed this a couple days ago. #35621

@yuzefovich
Copy link
Member

Finally, the import of 10k warehouses succeeded when I ran SET CLUSTER SETTING kv.import.batch_size='8MB'; on the cluster. The default value is 32MB, so I tried 16MB first but hit OOM described in #35773. Lowering that setting slows down the import, so the whole process took about 2 hours 15 minutes with 8MB. As I didn't encounter the panic described in the issue, I'm guessing that it was fixed by Raphael's patch #35371. In my opinion, we can lower the severity level of this issue and see whether we hit the panic again.

@yuzefovich yuzefovich added S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting. and removed S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting labels Mar 18, 2019
@yuzefovich
Copy link
Member

I lowered the priority since I haven't encountered the panic in the title of the issue, but I'm not sure whether it was fixed or not, so let's keep the issue open. OOM discussion will continue in #35773.

@jordanlewis
Copy link
Member

Closing as unactionable - Yahor tried and failed to reproduce the panic.

@awoods187
Copy link
Contributor Author

I just reproduced this right now on andy-60
image
image

@awoods187 awoods187 reopened this Sep 10, 2019
@jordanlewis
Copy link
Member

This is an OOM during import, without the panic you reported. That's why we had the other issue opened, which is correctly assigned to the Bulk IO team. This does not reproduce the problem that this issue was opened for, so I'm going to close again unless you have counterevidence.

@jordanlewis
Copy link
Member

(#35773 was the OOM during import issue)

@awoods187
Copy link
Contributor Author

I doublechecked all of the failed nodes and none of them have the panic on Non-nullable column "jobs:status" with no value! Looks like it has been solved!

@irfansharif
Copy link
Contributor

Looks like it has been solved!

Don't think it has. #36851 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-execution Relating to SQL execution. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting.
Projects
None yet
Development

No branches or pull requests

7 participants