Fatal error during sideloading: IO error: No space left on deviceWhile appending to file #35178

awoods187 · 2019-02-25T15:49:51Z

Describe the problem

A fatal crash killed a node during tpc-c import on a 6 node 4 cpu test.

Andrews-MBP-2:~ andrewwoods$ roachprod run $CLUSTER:1 -- "./cockroach workload fixtures import tpcc --warehouses=2000 --db=tpcc"
Error: importing fixture: importing table order_line: pq: communication error: rpc error: code = Canceled desc = context canceled
Error:  exit status 1

To Reproduce
roachprod create $CLUSTER -n 7 --clouds=aws --aws-machine-type-ssd=c5d.xlarge
roachprod run $CLUSTER:1-6 -- "DEV=$(mount | grep /mnt/data1 | awk '{print $1}'); sudo umount /mnt/data1; sudo mount -o discard,defaults,nobarrier ${DEV} /mnt/data1/; mount | grep /mnt/data1"
roachprod stage $CLUSTER:1-6 cockroach
roachprod stage $CLUSTER:7 workload
roachprod start $CLUSTER:1-6
roachprod adminurl --open $CLUSTER:1
roachprod run $CLUSTER:1 -- "./cockroach workload fixtures import tpcc --warehouses=2000 --db=tpcc"

Expected behavior
Import completed without a crash.

Additional data / screenshots

F190225 15:33:51.059003 171 storage/store.go:3613  [n3,s3,r2281/2:/Table/59/1/182{1/6/1…-3/5/2…}] during sideloading: during sideloading: IO error: No space left on deviceWhile appending to file: /mnt/data1/cockroach/auxiliary/sideloading/r0XXXX/r2281/i24.t6: No space left on device
goroutine 171 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc000057b00, 0xc000057b60, 0x51fbb00, 0x10)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:1018 +0xd4
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x596d480, 0xc000000004, 0x51fbbce, 0x10, 0xe1d, 0xc0004ba3c0, 0xec)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:874 +0x95a
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x3914680, 0xc00739a0c0, 0x4, 0x2, 0x31567fa, 0x6, 0xc010d13eb0, 0x2, 0x2)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:85 +0x2d5
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x3914680, 0xc00739a0c0, 0x1, 0x4, 0x31567fa, 0x6, 0xc010d13eb0, 0x2, 0x2)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:71 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(0x3914680, 0xc00739a0c0, 0x31567fa, 0x6, 0xc010d13eb0, 0x2, 0x2)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:182 +0x7e
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processReady(0xc000f80600, 0x3914680, 0xc00739a0c0, 0x8e9)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3613 +0x4f4
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).worker(0xc0004e9a80, 0x3914680, 0xc0005c53e0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:214 +0x258
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).Start.func2(0x3914680, 0xc0005c53e0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:165 +0x3e
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc0002b48c0, 0xc0004ab8c0, 0xc0002b48a0)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:200 +0xe1
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:193 +0xa8

Environment:
v2.2.0-alpha.20190211-325-geaad50f

Jira issue: CRDB-4597

The text was updated successfully, but these errors were encountered:

awoods187 · 2019-02-25T15:50:23Z

cockroach.log

tbg · 2019-02-25T15:56:25Z

Hi Andy, the error message indicates that there is no space left on the device which means the harddrive filled up. This probably strikes you as weird too, because the UI seems to indicate that you're not running close to capacity.

roachprod ssh <thenodethatdied>

Then:

df -h

Then:

du -sch /mnt/data1/*

Please post the output. The last command lists the sizes of the subdirectories and files in the cockroach data dir.

awoods187 · 2019-02-25T16:01:36Z

ubuntu@ip-172-31-46-139:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            3.8G     0  3.8G   0% /dev
tmpfs           764M  8.6M  755M   2% /run
/dev/nvme0n1p1  7.7G  1.2G  6.6G  15% /
tmpfs           3.8G     0  3.8G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           3.8G     0  3.8G   0% /sys/fs/cgroup
/dev/nvme1n1     92G   83G  3.9G  96% /mnt/data1
tmpfs           764M     0  764M   0% /run/user/1000

ubuntu@ip-172-31-46-139:~$ du -sch /mnt/data1/*
83G	/mnt/data1/cockroach
du: cannot read directory '/mnt/data1/lost+found': Permission denied
16K	/mnt/data1/lost+found
83G	total

tbg · 2019-02-25T21:22:34Z

Oh, sorry, I wanted du -sch /mnt/data1/cockroach/*.

How big is the TPCC-2000 dataset (as measured as typical disk usage per node post import)? The data directory contains ~90gb and is close to full. I don't know offhand if that is expected with the import you're running.

Also paging @mjibson in case this is one of those instances of one node having to sample everything. How would we find out if that were true? I guess we'd see a large temp instance dir, right?

maddyblue · 2019-02-25T21:30:44Z

I would like to know the size of the temp instance dir, yes. Also just a week or so ago @dt implemented a thing that makes import recognize when a workload URL is present and it skips the sampling phase completely. Not sure if that's been turned on in the workload tool or not yet.

dt · 2019-02-25T22:41:27Z

I made a couple changes, but they're separate:
The workload URL one is to do "direct" synthesis of workload datums without going to/from CSV encoding to bytes via an io.Writer. It is off by default, but even without it, workload fixtures import still hooks the CSV reader right up to a workload CSV producer without any disk I/O between, so that shouldn't have any effect except speed.

The other change I made isn't specific to workload URLs at all, but that is the one that that skips sampling and sorting. It is also off by default, though it can be turned on/off on a per-statement basis (or with a flag on workload.

awoods187 · 2019-02-25T23:16:30Z

@tbg here it is https://gist.github.com/awoods187/02d708ac2d5017c1e573008e4ade49ad

tbg · 2019-02-26T07:37:03Z

36gb in temp first, @mjibson. Is that expected?

…

On Tue, Feb 26, 2019, 00:16 Andy Woods ***@***.***> wrote: @tbg <https://github.com/tbg> here it is https://gist.github.com/awoods187/02d708ac2d5017c1e573008e4ade49ad — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#35178 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE135KAiFoyS8jJ0SuDsTRqZfeLyWDFkks5vRG7ZgaJpZM4bQLSa> .

dt · 2019-02-26T12:47:09Z

Can that cluster even fit a tpcc2k import?

My back of envelope math is that tpcc 100 is ~22gb, so tpcc1k is 220gb and tpcc 2k is 440gb?

DistSQL sorted import needs 2x the total import size available to import, so it has room to buffer everything it will import in temp before it starts importing it.

So taking that back of envelope math a bit further, each node should expect to need to buffer ~75gb and to ingest ~75gb, so you need at least 150gb disks on every node. Sampling isn't perfect, so we expect some unevenness in distribution, so we generally say you need some extra margin on those numbers too.

awoods187 · 2019-02-26T12:55:13Z

Good question @dt! This was using the c5.xlarge machine type which comes with 100gb per disk:

With six nodes this should have been sufficient to handle loading TPC-C 2K as that is 160GB unreplicated and therefore somewhere around 500gb replicated. Admittedly its a bit closer than we usually push it but it should have been able to handle it.

Now, because of the write amplification, it might push well above the limit for the cluster overall. However, I wouldn't expect that to kill the cluster until much later in the import.

dt · 2019-02-26T13:37:06Z

I don't think there's anything to do here.

We already know and document that current IMPORT requires 2x available space, the node ran out of space and logged error message says as much. In the future, the no-buffer, sortless IMPORT might reduce that requirement but for now, I think this is behaving as documented?

dt · 2019-02-26T13:40:31Z

that said, @awoods187, if you want to try the new experimental no-sort import, you could try passing --experimental-direct-ingestion to that workload command and see what that does (my guess is that it'll crash in a new and different way).

tbg · 2019-02-26T14:08:24Z

Thanks @dt! I'll let you close as you see fit.

awoods187 · 2019-02-26T14:29:34Z

I don't think we should close this issue. It's definitely a S-3-ux-surprise even if the write amplification is a "known problem." This is not good UX and we should fix it.

awoods187 · 2019-03-22T20:23:09Z

~~@tbg should we close this now that you've backported the removal of this panic?~~

tbg · 2019-03-22T20:30:35Z

@awoods187 are you posting in the right issue? The panic here is that nodes run out of space.

awoods187 · 2019-03-22T20:34:12Z

Good call I actually meant #34224

github-actions · 2023-09-20T11:10:54Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

awoods187 assigned tbg Feb 25, 2019

awoods187 mentioned this issue Feb 25, 2019

Panic introduced in context deadline exceeded errors patch #35184

Closed

awoods187 added the S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption. label Feb 26, 2019

tbg assigned dt and unassigned tbg Feb 26, 2019

awoods187 added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Mar 6, 2019

dt removed their assignment Mar 11, 2019

tbg assigned dt Mar 22, 2019

kenliu added the T-disaster-recovery label Dec 5, 2020

dt removed their assignment Jun 1, 2021

github-actions bot added the no-issue-activity label Sep 20, 2023

github-actions bot added the X-stale label Oct 2, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 2, 2023

exalate-issue-sync bot closed this as completed Oct 2, 2023

github-project-automation bot added this to Disaster Recovery Backlog Aug 28, 2024

github-project-automation bot moved this to Done in Disaster Recovery Backlog Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fatal error during sideloading: IO error: No space left on deviceWhile appending to file #35178

Fatal error during sideloading: IO error: No space left on deviceWhile appending to file #35178

awoods187 commented Feb 25, 2019 •

edited by cockroach-jira-scripts

Loading

awoods187 commented Feb 25, 2019

tbg commented Feb 25, 2019

awoods187 commented Feb 25, 2019

tbg commented Feb 25, 2019

maddyblue commented Feb 25, 2019

dt commented Feb 25, 2019

awoods187 commented Feb 25, 2019

tbg commented Feb 26, 2019 via email

dt commented Feb 26, 2019

awoods187 commented Feb 26, 2019

dt commented Feb 26, 2019

dt commented Feb 26, 2019

tbg commented Feb 26, 2019

awoods187 commented Feb 26, 2019

awoods187 commented Mar 22, 2019 •

edited

Loading

tbg commented Mar 22, 2019

awoods187 commented Mar 22, 2019

github-actions bot commented Sep 20, 2023

Fatal error during sideloading: IO error: No space left on deviceWhile appending to file #35178

Fatal error during sideloading: IO error: No space left on deviceWhile appending to file #35178

Comments

awoods187 commented Feb 25, 2019 • edited by cockroach-jira-scripts Loading

awoods187 commented Feb 25, 2019

tbg commented Feb 25, 2019

awoods187 commented Feb 25, 2019

tbg commented Feb 25, 2019

maddyblue commented Feb 25, 2019

dt commented Feb 25, 2019

awoods187 commented Feb 25, 2019

tbg commented Feb 26, 2019 via email

dt commented Feb 26, 2019

awoods187 commented Feb 26, 2019

dt commented Feb 26, 2019

dt commented Feb 26, 2019

tbg commented Feb 26, 2019

awoods187 commented Feb 26, 2019

awoods187 commented Mar 22, 2019 • edited Loading

tbg commented Mar 22, 2019

awoods187 commented Mar 22, 2019

github-actions bot commented Sep 20, 2023

awoods187 commented Feb 25, 2019 •

edited by cockroach-jira-scripts

Loading

awoods187 commented Mar 22, 2019 •

edited

Loading