Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fatal error during sideloading: IO error: No space left on deviceWhile appending to file #35178

Closed
awoods187 opened this issue Feb 25, 2019 · 18 comments
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) no-issue-activity S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption. T-disaster-recovery X-stale

Comments

@awoods187
Copy link
Contributor

awoods187 commented Feb 25, 2019

Describe the problem

A fatal crash killed a node during tpc-c import on a 6 node 4 cpu test.
image

Andrews-MBP-2:~ andrewwoods$ roachprod run $CLUSTER:1 -- "./cockroach workload fixtures import tpcc --warehouses=2000 --db=tpcc"
Error: importing fixture: importing table order_line: pq: communication error: rpc error: code = Canceled desc = context canceled
Error:  exit status 1

To Reproduce
roachprod create $CLUSTER -n 7 --clouds=aws --aws-machine-type-ssd=c5d.xlarge
roachprod run $CLUSTER:1-6 -- "DEV=$(mount | grep /mnt/data1 | awk '{print $1}'); sudo umount /mnt/data1; sudo mount -o discard,defaults,nobarrier ${DEV} /mnt/data1/; mount | grep /mnt/data1"
roachprod stage $CLUSTER:1-6 cockroach
roachprod stage $CLUSTER:7 workload
roachprod start $CLUSTER:1-6
roachprod adminurl --open $CLUSTER:1
roachprod run $CLUSTER:1 -- "./cockroach workload fixtures import tpcc --warehouses=2000 --db=tpcc"

Expected behavior
Import completed without a crash.

Additional data / screenshots

F190225 15:33:51.059003 171 storage/store.go:3613  [n3,s3,r2281/2:/Table/59/1/182{1/6/1…-3/5/2…}] during sideloading: during sideloading: IO error: No space left on deviceWhile appending to file: /mnt/data1/cockroach/auxiliary/sideloading/r0XXXX/r2281/i24.t6: No space left on device
goroutine 171 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc000057b00, 0xc000057b60, 0x51fbb00, 0x10)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:1018 +0xd4
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x596d480, 0xc000000004, 0x51fbbce, 0x10, 0xe1d, 0xc0004ba3c0, 0xec)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:874 +0x95a
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x3914680, 0xc00739a0c0, 0x4, 0x2, 0x31567fa, 0x6, 0xc010d13eb0, 0x2, 0x2)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:85 +0x2d5
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x3914680, 0xc00739a0c0, 0x1, 0x4, 0x31567fa, 0x6, 0xc010d13eb0, 0x2, 0x2)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:71 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(0x3914680, 0xc00739a0c0, 0x31567fa, 0x6, 0xc010d13eb0, 0x2, 0x2)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:182 +0x7e
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processReady(0xc000f80600, 0x3914680, 0xc00739a0c0, 0x8e9)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3613 +0x4f4
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).worker(0xc0004e9a80, 0x3914680, 0xc0005c53e0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:214 +0x258
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).Start.func2(0x3914680, 0xc0005c53e0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:165 +0x3e
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc0002b48c0, 0xc0004ab8c0, 0xc0002b48a0)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:200 +0xe1
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:193 +0xa8

Environment:
v2.2.0-alpha.20190211-325-geaad50f

Jira issue: CRDB-4597

@awoods187
Copy link
Contributor Author

cockroach.log

@tbg
Copy link
Member

tbg commented Feb 25, 2019

Hi Andy, the error message indicates that there is no space left on the device which means the harddrive filled up. This probably strikes you as weird too, because the UI seems to indicate that you're not running close to capacity.

roachprod ssh <thenodethatdied>

Then:

df -h

Then:

du -sch /mnt/data1/*

Please post the output. The last command lists the sizes of the subdirectories and files in the cockroach data dir.

@awoods187
Copy link
Contributor Author

ubuntu@ip-172-31-46-139:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            3.8G     0  3.8G   0% /dev
tmpfs           764M  8.6M  755M   2% /run
/dev/nvme0n1p1  7.7G  1.2G  6.6G  15% /
tmpfs           3.8G     0  3.8G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           3.8G     0  3.8G   0% /sys/fs/cgroup
/dev/nvme1n1     92G   83G  3.9G  96% /mnt/data1
tmpfs           764M     0  764M   0% /run/user/1000
ubuntu@ip-172-31-46-139:~$ du -sch /mnt/data1/*
83G	/mnt/data1/cockroach
du: cannot read directory '/mnt/data1/lost+found': Permission denied
16K	/mnt/data1/lost+found
83G	total

@tbg
Copy link
Member

tbg commented Feb 25, 2019

Oh, sorry, I wanted du -sch /mnt/data1/cockroach/*.

How big is the TPCC-2000 dataset (as measured as typical disk usage per node post import)? The data directory contains ~90gb and is close to full. I don't know offhand if that is expected with the import you're running.

Also paging @mjibson in case this is one of those instances of one node having to sample everything. How would we find out if that were true? I guess we'd see a large temp instance dir, right?

@maddyblue
Copy link
Contributor

I would like to know the size of the temp instance dir, yes. Also just a week or so ago @dt implemented a thing that makes import recognize when a workload URL is present and it skips the sampling phase completely. Not sure if that's been turned on in the workload tool or not yet.

@dt
Copy link
Member

dt commented Feb 25, 2019

I made a couple changes, but they're separate:
The workload URL one is to do "direct" synthesis of workload datums without going to/from CSV encoding to bytes via an io.Writer. It is off by default, but even without it, workload fixtures import still hooks the CSV reader right up to a workload CSV producer without any disk I/O between, so that shouldn't have any effect except speed.

The other change I made isn't specific to workload URLs at all, but that is the one that that skips sampling and sorting. It is also off by default, though it can be turned on/off on a per-statement basis (or with a flag on workload.

@awoods187
Copy link
Contributor Author

@tbg
Copy link
Member

tbg commented Feb 26, 2019 via email

@dt
Copy link
Member

dt commented Feb 26, 2019

Can that cluster even fit a tpcc2k import?

My back of envelope math is that tpcc 100 is ~22gb, so tpcc1k is 220gb and tpcc 2k is 440gb?

DistSQL sorted import needs 2x the total import size available to import, so it has room to buffer everything it will import in temp before it starts importing it.

So taking that back of envelope math a bit further, each node should expect to need to buffer ~75gb and to ingest ~75gb, so you need at least 150gb disks on every node. Sampling isn't perfect, so we expect some unevenness in distribution, so we generally say you need some extra margin on those numbers too.

@awoods187
Copy link
Contributor Author

Good question @dt! This was using the c5.xlarge machine type which comes with 100gb per disk:
image

With six nodes this should have been sufficient to handle loading TPC-C 2K as that is 160GB unreplicated and therefore somewhere around 500gb replicated. Admittedly its a bit closer than we usually push it but it should have been able to handle it.

Now, because of the write amplification, it might push well above the limit for the cluster overall. However, I wouldn't expect that to kill the cluster until much later in the import.

@dt
Copy link
Member

dt commented Feb 26, 2019

I don't think there's anything to do here.

We already know and document that current IMPORT requires 2x available space, the node ran out of space and logged error message says as much. In the future, the no-buffer, sortless IMPORT might reduce that requirement but for now, I think this is behaving as documented?

@dt
Copy link
Member

dt commented Feb 26, 2019

that said, @awoods187, if you want to try the new experimental no-sort import, you could try passing --experimental-direct-ingestion to that workload command and see what that does (my guess is that it'll crash in a new and different way).

@awoods187 awoods187 added the S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption. label Feb 26, 2019
@tbg tbg assigned dt and unassigned tbg Feb 26, 2019
@tbg
Copy link
Member

tbg commented Feb 26, 2019

Thanks @dt! I'll let you close as you see fit.

@awoods187
Copy link
Contributor Author

I don't think we should close this issue. It's definitely a S-3-ux-surprise even if the write amplification is a "known problem." This is not good UX and we should fix it.

@awoods187 awoods187 added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Mar 6, 2019
@dt dt removed their assignment Mar 11, 2019
@awoods187
Copy link
Contributor Author

awoods187 commented Mar 22, 2019

@tbg should we close this now that you've backported the removal of this panic?

@tbg tbg assigned dt Mar 22, 2019
@tbg
Copy link
Member

tbg commented Mar 22, 2019

@awoods187 are you posting in the right issue? The panic here is that nodes run out of space.

@awoods187
Copy link
Contributor Author

Good call I actually meant #34224

@dt dt removed their assignment Jun 1, 2021
@github-actions
Copy link

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) no-issue-activity S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption. T-disaster-recovery X-stale
Projects
No open projects
Archived in project
Development

No branches or pull requests

5 participants