Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: Test behavior when disk fills up #19656

Closed
bdarnell opened this issue Oct 30, 2017 · 3 comments
Closed

storage: Test behavior when disk fills up #19656

bdarnell opened this issue Oct 30, 2017 · 3 comments
Assignees
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. A-testing Testing tools and infrastructure C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior. S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption.

Comments

@bdarnell
Copy link
Contributor

Prior to #19447, certain disk errors (the most likely being ENOSPC) were not being handled correctly, and we suspect that inconsistent reads could be served after this had happened. We need more testing of our behavior after disk writes have failed.

One way to do this would be a process that alternately writes a file to fill up the disk (or maybe just fallocate()), waits a bit, then deletes the file (and restarts the cockroach process if it crashed). Maybe this would make sense as a new jepsen nemesis.

@bdarnell bdarnell added C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior. A-storage Relating to our storage engine (Pebble) on-disk storage. A-testing Testing tools and infrastructure labels Apr 26, 2018
@tbg
Copy link
Member

tbg commented May 21, 2018

Adding this to Jepsen might still make sense for the correctness aspect of this, but we have the infra for doing this in roachtest-land available. The test that comes to mind is

  1. start a cluster with some background workload (one of the scaledata correctness tests comes to mind) and with a ballast file on one node
  2. fill up disk on the node with the ballast file
  3. wait until the process crashes
  4. verify that background workload does not stall (stability: client hangs after one node hits disk errors #7882)
  5. delete ballast file and restart node
  6. verify that node becomes healthy and participates in cluster again

@tbg tbg added S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption. and removed S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting labels May 21, 2018
@tbg tbg added this to the 2.1 milestone Jul 22, 2018
@bdarnell bdarnell modified the milestones: 2.1, 2.2 Aug 15, 2018
@petermattis petermattis removed this from the 2.2 milestone Oct 5, 2018
@tbg
Copy link
Member

tbg commented Oct 11, 2018

#31187 also added the infra to run on a charybdefs and inject these errors, in case we don't want fallocate.

@tbg
Copy link
Member

tbg commented Oct 11, 2018

Folding into #7882 (the other way than I originally planned to).

@tbg tbg closed this as completed Oct 11, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. A-testing Testing tools and infrastructure C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior. S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption.
Projects
None yet
Development

No branches or pull requests

3 participants