-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fatal error during sideloading: IO error: No space left on deviceWhile appending to file #35178
Comments
Hi Andy, the error message indicates that there is no space left on the device which means the harddrive filled up. This probably strikes you as weird too, because the UI seems to indicate that you're not running close to capacity.
Then:
Then:
Please post the output. The last command lists the sizes of the subdirectories and files in the cockroach data dir. |
|
Oh, sorry, I wanted How big is the TPCC-2000 dataset (as measured as typical disk usage per node post import)? The data directory contains ~90gb and is close to full. I don't know offhand if that is expected with the import you're running. Also paging @mjibson in case this is one of those instances of one node having to sample everything. How would we find out if that were true? I guess we'd see a large temp instance dir, right? |
I would like to know the size of the temp instance dir, yes. Also just a week or so ago @dt implemented a thing that makes import recognize when a workload URL is present and it skips the sampling phase completely. Not sure if that's been turned on in the workload tool or not yet. |
I made a couple changes, but they're separate: The other change I made isn't specific to workload URLs at all, but that is the one that that skips sampling and sorting. It is also off by default, though it can be turned on/off on a per-statement basis (or with a flag on |
36gb in temp first, @mjibson. Is that expected?
…On Tue, Feb 26, 2019, 00:16 Andy Woods ***@***.***> wrote:
@tbg <https://github.com/tbg> here it is
https://gist.github.com/awoods187/02d708ac2d5017c1e573008e4ade49ad
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#35178 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE135KAiFoyS8jJ0SuDsTRqZfeLyWDFkks5vRG7ZgaJpZM4bQLSa>
.
|
Can that cluster even fit a tpcc2k import? My back of envelope math is that tpcc 100 is ~22gb, so tpcc1k is 220gb and tpcc 2k is 440gb? DistSQL sorted import needs 2x the total import size available to import, so it has room to buffer everything it will import in temp before it starts importing it. So taking that back of envelope math a bit further, each node should expect to need to buffer ~75gb and to ingest ~75gb, so you need at least 150gb disks on every node. Sampling isn't perfect, so we expect some unevenness in distribution, so we generally say you need some extra margin on those numbers too. |
Good question @dt! This was using the c5.xlarge machine type which comes with 100gb per disk: With six nodes this should have been sufficient to handle loading TPC-C 2K as that is 160GB unreplicated and therefore somewhere around 500gb replicated. Admittedly its a bit closer than we usually push it but it should have been able to handle it. Now, because of the write amplification, it might push well above the limit for the cluster overall. However, I wouldn't expect that to kill the cluster until much later in the import. |
I don't think there's anything to do here. We already know and document that current IMPORT requires 2x available space, the node ran out of space and logged error message says as much. In the future, the no-buffer, sortless IMPORT might reduce that requirement but for now, I think this is behaving as documented? |
that said, @awoods187, if you want to try the new experimental no-sort import, you could try passing |
Thanks @dt! I'll let you close as you see fit. |
I don't think we should close this issue. It's definitely a S-3-ux-surprise even if the write amplification is a "known problem." This is not good UX and we should fix it. |
|
@awoods187 are you posting in the right issue? The panic here is that nodes run out of space. |
Good call I actually meant #34224 |
We have marked this issue as stale because it has been inactive for |
Describe the problem
A fatal crash killed a node during tpc-c import on a 6 node 4 cpu test.
To Reproduce$CLUSTER:1-6 -- "DEV=$ (mount | grep /mnt/data1 | awk '{print $1}'); sudo umount /mnt/data1; sudo mount -o discard,defaults,nobarrier $ {DEV} /mnt/data1/; mount | grep /mnt/data1"
roachprod create $CLUSTER -n 7 --clouds=aws --aws-machine-type-ssd=c5d.xlarge
roachprod run
roachprod stage $CLUSTER:1-6 cockroach
roachprod stage $CLUSTER:7 workload
roachprod start $CLUSTER:1-6
roachprod adminurl --open $CLUSTER:1
roachprod run $CLUSTER:1 -- "./cockroach workload fixtures import tpcc --warehouses=2000 --db=tpcc"
Expected behavior
Import completed without a crash.
Additional data / screenshots
Environment:
v2.2.0-alpha.20190211-325-geaad50f
Jira issue: CRDB-4597
The text was updated successfully, but these errors were encountered: