Panic non nullable column "job:status" during tpc-c import on 60 node 4 cpu cluster #34878

awoods187 · 2019-02-13T18:29:56Z

Describe the problem

Two dead nodes during tpc-c import on 60 node 4 cpu cluster with this failure in the CLI:

Andrews-MBP-2:cockroach andrewwoods$ roachprod run $CLUSTER:1 -- "./cockroach workload fixtures import tpcc --warehouses=10000 --db=tpcc"
Error: importing fixture: importing table stock: read tcp 127.0.0.1:42464->127.0.0.1:26257: read: connection reset by peer
Error:  exit status 1

To Reproduce
export CLUSTER=andy-60
roachprod create $CLUSTER -n 61 --clouds=aws --aws-machine-type-ssd=c5d.xlarge
roachprod run $CLUSTER -- "DEV=$(mount | grep /mnt/data1 | awk '{print $1}'); sudo umount /mnt/data1; sudo mount -o discard,defaults,nobarrier ${DEV} /mnt/data1/; mount | grep /mnt/data1"
roachprod stage $CLUSTER:1-60 cockroach
roachprod stage $CLUSTER:61 workload
roachprod start $CLUSTER:1-60 --racks=20 -e COCKROACH_ENGINE_MAX_SYNC_DURATION=24h
roachprod run $CLUSTER:1 -- "./cockroach workload fixtures import tpcc --warehouses=10000 --db=tpcc"

Expected behavior
Restore to complete

Additional data / screenshots
This same panic happened on Node 1 and Node 8

panic: Non-nullable column "jobs:status" with no value! Index scanned was "primary" with the index key columns (id) and the values (426018841420595201)

goroutine 45470 [running]:
github.com/cockroachdb/cockroach/pkg/sql/row.(*Fetcher).finalizeRow(0xc01915f4a0, 0x38e8160, 0xc012e2aff0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/row/fetcher.go:1258 +0xa36
github.com/cockroachdb/cockroach/pkg/sql/row.(*Fetcher).NextRow(0xc01915f4a0, 0x38e8160, 0xc012e2aff0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/row/fetcher.go:994 +0x182
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*rowFetcherWrapper).Next(0xc011721900, 0xc007031d40, 0x1, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/tablereader.go:160 +0x4a
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*tableReader).Next(0xc01915f000, 0x38e80a0, 0xc012e2aff0, 0xc007031d80, 0x2c36a00)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/tablereader.go:277 +0x4c
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.Run(0x38e80a0, 0xc007031d80, 0x38f4420, 0xc01915f000, 0x38d9860, 0xc00d64ca80)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/base.go:175 +0x35
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*ProcessorBase).Run(0xc01915f000, 0x38e80a0, 0xc007031d80)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/processors.go:801 +0x92
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*Flow).startInternal.func1(0xc00ae70540, 0xc00da24110, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/flow.go:589 +0x67
created by github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*Flow).startInternal
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/flow.go:588 +0x342

cockroach.log

Environment:
v2.2.0-alpha.20190211-112-g72edf20

The text was updated successfully, but these errors were encountered:

awoods187 · 2019-02-13T19:29:35Z

A third node died 30 minutes after the first two nodes and well after the import failed due to OOM. From sudo dmesg:

[ 1946.769407] Out of memory: Kill process 5826 (cockroach) score 968 or sacrifice child
[ 1946.772013] Killed process 5826 (cockroach) total-vm:11531636kB, anon-rss:7548560kB, file-rss:1164kB

memprof.fraction_system_memory.000000007273963520_2019-02-13T18_30_15.158.zip
cockroach.log

nvanbenschoten · 2019-02-13T19:35:53Z

Here are the top 6 sources of memory allocation in that profile:

(pprof) top
Showing nodes accounting for 1533.01MB, 97.51% of 1572.20MB total
Dropped 266 nodes (cum <= 7.86MB)
Showing top 10 nodes out of 101
      flat  flat%   sum%        cum   cum%
  413.35MB 26.29% 26.29%   413.35MB 26.29%  github.com/cockroachdb/cockroach/pkg/storage.encodeRaftCommand
  328.57MB 20.90% 47.19%   328.57MB 20.90%  github.com/cockroachdb/cockroach/pkg/roachpb.(*AddSSTableRequest).Unmarshal
  304.99MB 19.40% 66.59%   304.99MB 19.40%  github.com/cockroachdb/cockroach/pkg/storage/storagepb.(*ReplicatedEvalResult_AddSSTable).Unmarshal
  166.48MB 10.59% 77.18%   166.48MB 10.59%  github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft/raftpb.(*Entry).Unmarshal
  117.75MB  7.49% 84.67%   117.75MB  7.49%  github.com/cockroachdb/cockroach/pkg/storage/engine.gobytes
   96.50MB  6.14% 90.81%    96.50MB  6.14%  bytes.makeSlice

That looks pretty standard for a fixture import. We shouldn't be getting into fights with the OOM killer when only 1500 MB are allocated. The dmesg output indicates that we missed a rapid allocation of memory.

awoods187 · 2019-02-13T20:01:25Z

This repros super reliably...hit it 10 minutes into the import on a new cluster killing 4 nodes.

awoods187 · 2019-02-19T16:35:44Z

After talking it over with @tbg I think this is in SQL Execution land @jordanlewis

awoods187 · 2019-02-19T19:49:13Z

@dt any thoughts on the import OOM? (@jordanlewis will take a look at the Jobs table panic)

awoods187 · 2019-02-20T22:08:56Z

FYI this reproduces everytime (most recently with OOMs)

jordanlewis · 2019-02-28T19:12:39Z

@yuzefovich please try to reproduce this

danhhz · 2019-03-04T22:45:52Z

I just repro'd the "Non-nullable column "jobs:status" with no value" too

yuzefovich · 2019-03-06T18:06:37Z

When running cockroach-v19.1.0-beta.20190304-160-g5fdb938 on 60 node cluster, I'm constantly hitting OOM errors in a sense that the cockroach process is killed. I tried different number of warehouses (10k, 5k, 1k) and hit OOM every time, on 500 warehouses once a node panicked itself due to OOM. After wiping out the cluster and attempting to put a locally built 02/11 alpha on each node, I got scp: cockroach-v2.2.0-alpha.20190211: No space left on device on the node that panicked on 500 warehouses import.

So far I haven't seen the panic in title of this issue. @dt I wonder whether there have been any changes in between 02/11 alpha and 03/04 beta that could explain increase in memory usage while doing imports.

Will try running 02/11 alpha next.

yuzefovich · 2019-03-06T19:13:58Z

On locally built 02/11 alpha on a 60 node cluster (the same cluster has been wiped and reused for all the runs):
1k warehouses - cockroach process killed.
500 warehouses - import succeeded.
1k warehouses - cockroach process killed on two nodes.
1k warehouses - cockroach process killed on two nodes and one node ran out of disk space.
(I think the latter runs lasted longer, i.e. I looked at the cluster later in comparison to the first 1k run.)
I still haven't hit the panic.

knz · 2019-03-06T19:26:05Z

this may be a dup of #35040 which I've been working on recently.

yuzefovich · 2019-03-15T00:50:56Z

While looking into this with @dt, we found the reason for two out of disk errors I hit last week. What happens is that / is mounted to a small block device, so /tmp lives on that small device as well by default. Whenever oomkiller kills cockroach process because it is using too much memory, oomkiller takes a core dump and attempts to store it within /tmp. That core dump should be on the order of gigabytes in size, but the small block device itself has only around 6.5 GB total, so depending on the size of the core dump it occasionally couldn't fit onto the disk.

The solution is to mount /tmp onto a data device that has a lot more space. It can be done on already existing cluster as
roachprod run $CLUSTER "sudo mkdir -p /mnt/data1/tmp && sudo chmod 777 /mnt/data1/tmp && sudo mount -o bind /mnt/data1/tmp /tmp"
or added to "no barrier" command like
roachprod run $CLUSTER -- "DEV=$(mount | grep /mnt/data1 | awk '{print $1}'); sudo umount /mnt/data1; sudo mount -o discard,defaults,nobarrier ${DEV} /mnt/data1/; mount | grep /mnt/data1; sudo mkdir -p /mnt/data1/tmp && sudo chmod 777 /mnt/data1/tmp && sudo mount -o bind /mnt/data1/tmp /tmp".

danhhz · 2019-03-15T01:20:03Z

Oof. Sorry you had to find this as well. I filed this a couple days ago. #35621

yuzefovich · 2019-03-15T04:36:24Z

Finally, the import of 10k warehouses succeeded when I ran SET CLUSTER SETTING kv.import.batch_size='8MB'; on the cluster. The default value is 32MB, so I tried 16MB first but hit OOM described in #35773. Lowering that setting slows down the import, so the whole process took about 2 hours 15 minutes with 8MB. As I didn't encounter the panic described in the issue, I'm guessing that it was fixed by Raphael's patch #35371. In my opinion, we can lower the severity level of this issue and see whether we hit the panic again.

yuzefovich · 2019-03-18T15:50:56Z

I lowered the priority since I haven't encountered the panic in the title of the issue, but I'm not sure whether it was fixed or not, so let's keep the issue open. OOM discussion will continue in #35773.

jordanlewis · 2019-09-10T14:56:52Z

Closing as unactionable - Yahor tried and failed to reproduce the panic.

awoods187 · 2019-09-10T15:33:33Z

I just reproduced this right now on andy-60

jordanlewis · 2019-09-10T18:10:42Z

This is an OOM during import, without the panic you reported. That's why we had the other issue opened, which is correctly assigned to the Bulk IO team. This does not reproduce the problem that this issue was opened for, so I'm going to close again unless you have counterevidence.

jordanlewis · 2019-09-10T18:10:55Z

(#35773 was the OOM during import issue)

awoods187 · 2019-09-10T18:39:49Z

I doublechecked all of the failed nodes and none of them have the panic on Non-nullable column "jobs:status" with no value! Looks like it has been solved!

irfansharif · 2020-04-09T11:27:21Z

Looks like it has been solved!

Don't think it has. #36851 (comment).

awoods187 changed the title ~~Panic non nullable column "job:status" during tpc-c import on 60 node 8 cpu cluster~~ Panic non nullable column "job:status" during tpc-c import on 60 node 4 cpu cluster Feb 19, 2019

awoods187 added the S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting label Feb 26, 2019

jordanlewis assigned yuzefovich Feb 28, 2019

awoods187 added A-sql-execution Relating to SQL execution. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. labels Mar 6, 2019

dt mentioned this issue Mar 15, 2019

IMPORT: OOM on larger cluster sizes #35773

Closed

yuzefovich added S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting. and removed S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting labels Mar 18, 2019

roncrdb mentioned this issue Apr 1, 2019

sql: creating an index, user experienced OOM #36381

Closed

jordanlewis closed this as completed Sep 10, 2019

awoods187 reopened this Sep 10, 2019

jordanlewis closed this as completed Sep 10, 2019

irfansharif mentioned this issue Apr 9, 2020

error about null value in jobs table during SHOW JOBS #36851

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Panic non nullable column "job:status" during tpc-c import on 60 node 4 cpu cluster #34878

Panic non nullable column "job:status" during tpc-c import on 60 node 4 cpu cluster #34878

awoods187 commented Feb 13, 2019 •

edited

Loading

awoods187 commented Feb 13, 2019 •

edited

Loading

nvanbenschoten commented Feb 13, 2019

awoods187 commented Feb 13, 2019

awoods187 commented Feb 19, 2019

awoods187 commented Feb 19, 2019

awoods187 commented Feb 20, 2019

jordanlewis commented Feb 28, 2019

danhhz commented Mar 4, 2019 •

edited

Loading

yuzefovich commented Mar 6, 2019 •

edited

Loading

yuzefovich commented Mar 6, 2019

knz commented Mar 6, 2019

yuzefovich commented Mar 15, 2019

danhhz commented Mar 15, 2019

yuzefovich commented Mar 15, 2019

yuzefovich commented Mar 18, 2019

jordanlewis commented Sep 10, 2019

awoods187 commented Sep 10, 2019

jordanlewis commented Sep 10, 2019

jordanlewis commented Sep 10, 2019

awoods187 commented Sep 10, 2019

irfansharif commented Apr 9, 2020

Panic non nullable column "job:status" during tpc-c import on 60 node 4 cpu cluster #34878

Panic non nullable column "job:status" during tpc-c import on 60 node 4 cpu cluster #34878

Comments

awoods187 commented Feb 13, 2019 • edited Loading

awoods187 commented Feb 13, 2019 • edited Loading

nvanbenschoten commented Feb 13, 2019

awoods187 commented Feb 13, 2019

awoods187 commented Feb 19, 2019

awoods187 commented Feb 19, 2019

awoods187 commented Feb 20, 2019

jordanlewis commented Feb 28, 2019

danhhz commented Mar 4, 2019 • edited Loading

yuzefovich commented Mar 6, 2019 • edited Loading

yuzefovich commented Mar 6, 2019

knz commented Mar 6, 2019

yuzefovich commented Mar 15, 2019

danhhz commented Mar 15, 2019

yuzefovich commented Mar 15, 2019

yuzefovich commented Mar 18, 2019

jordanlewis commented Sep 10, 2019

awoods187 commented Sep 10, 2019

jordanlewis commented Sep 10, 2019

jordanlewis commented Sep 10, 2019

awoods187 commented Sep 10, 2019

irfansharif commented Apr 9, 2020

awoods187 commented Feb 13, 2019 •

edited

Loading

awoods187 commented Feb 13, 2019 •

edited

Loading

danhhz commented Mar 4, 2019 •

edited

Loading

yuzefovich commented Mar 6, 2019 •

edited

Loading