-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Panic non nullable column "job:status" during tpc-c import on 60 node 4 cpu cluster #34878
Comments
A third node died 30 minutes after the first two nodes and well after the import failed due to OOM. From
memprof.fraction_system_memory.000000007273963520_2019-02-13T18_30_15.158.zip |
Here are the top 6 sources of memory allocation in that profile:
That looks pretty standard for a fixture import. We shouldn't be getting into fights with the OOM killer when only 1500 MB are allocated. The dmesg output indicates that we missed a rapid allocation of memory. |
This repros super reliably...hit it 10 minutes into the import on a new cluster killing 4 nodes. |
After talking it over with @tbg I think this is in SQL Execution land @jordanlewis |
@dt any thoughts on the import OOM? (@jordanlewis will take a look at the Jobs table panic) |
FYI this reproduces everytime (most recently with OOMs) |
@yuzefovich please try to reproduce this |
I just repro'd the "Non-nullable column "jobs:status" with no value" too |
When running So far I haven't seen the panic in title of this issue. @dt I wonder whether there have been any changes in between 02/11 alpha and 03/04 beta that could explain increase in memory usage while doing imports. Will try running 02/11 alpha next. |
On locally built 02/11 alpha on a 60 node cluster (the same cluster has been wiped and reused for all the runs): |
this may be a dup of #35040 which I've been working on recently. |
While looking into this with @dt, we found the reason for two out of disk errors I hit last week. What happens is that The solution is to mount |
Oof. Sorry you had to find this as well. I filed this a couple days ago. #35621 |
Finally, the import of 10k warehouses succeeded when I ran |
I lowered the priority since I haven't encountered the panic in the title of the issue, but I'm not sure whether it was fixed or not, so let's keep the issue open. OOM discussion will continue in #35773. |
Closing as unactionable - Yahor tried and failed to reproduce the panic. |
This is an OOM during import, without the panic you reported. That's why we had the other issue opened, which is correctly assigned to the Bulk IO team. This does not reproduce the problem that this issue was opened for, so I'm going to close again unless you have counterevidence. |
(#35773 was the OOM during import issue) |
I doublechecked all of the failed nodes and none of them have the panic on |
Don't think it has. #36851 (comment). |
Describe the problem
Two dead nodes during tpc-c import on 60 node 4 cpu cluster with this failure in the CLI:
To Reproduce$CLUSTER -- "DEV=$ (mount | grep /mnt/data1 | awk '{print $1}'); sudo umount /mnt/data1; sudo mount -o discard,defaults,nobarrier $ {DEV} /mnt/data1/; mount | grep /mnt/data1"
export CLUSTER=andy-60
roachprod create $CLUSTER -n 61 --clouds=aws --aws-machine-type-ssd=c5d.xlarge
roachprod run
roachprod stage $CLUSTER:1-60 cockroach
roachprod stage $CLUSTER:61 workload
roachprod start $CLUSTER:1-60 --racks=20 -e COCKROACH_ENGINE_MAX_SYNC_DURATION=24h
roachprod run $CLUSTER:1 -- "./cockroach workload fixtures import tpcc --warehouses=10000 --db=tpcc"
Expected behavior
Restore to complete
Additional data / screenshots
This same panic happened on Node 1 and Node 8
cockroach.log
cockroach.log
Environment:
v2.2.0-alpha.20190211-112-g72edf20
The text was updated successfully, but these errors were encountered: