Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the workflow of xfs devices filesystem check and mount #132

Closed
wants to merge 5 commits into from

Conversation

27149chen
Copy link
Member

This PR improves the method checkAndRepairXfsFilesystem which will be used before mounting an xfs device.
changes are as following:

  1. Retry three times if xfs_repair is failed.
  2. Add a new function "replayXfsDirtyLogs" to replay dirty logs by mounting and immediately unmounting the filesystem.
  3. Fix some test cases and add add somre more.

Backgroud:

  1. If xfs_repair fails to repair the file system successfully, try giving the same xfs_repair command twice more; xfs_repair may be able to make more repairs on successive runs.
  2. Due to the design of the XFS log, a dirty log can only be replayed by the kernel, on a machine having the same CPU architecture as the machine which was writing to the log. xfs_repair cannot replay a dirty log and will exit with a status code of 2 when it detects a dirty log. In this situation, the log can be replayed by mounting and immediately unmounting the filesystem on the same class of machine that crashed.

References:

  1. http://man7.org/linux/man-pages/man8/xfs_repair.8.html
  2. http://fibrevillage.com/storage/666-how-to-repair-a-xfs-filesystem

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 17, 2020
@27149chen
Copy link
Member Author

/assign @jsafrane

@27149chen
Copy link
Member Author

/assign @gnufied

@27149chen
Copy link
Member Author

@jsafrane @gnufied anyone who can help to have a look? thanks.

@27149chen
Copy link
Member Author

/assign @jingxu97

@27149chen
Copy link
Member Author

/unassign @gnufied

@27149chen
Copy link
Member Author

/assign @saad-ali

@27149chen
Copy link
Member Author

@saad-ali @jingxu97 could you help to have a look at this pr? thanks.

@27149chen
Copy link
Member Author

/unassign @jsafrane

@27149chen
Copy link
Member Author

/assign @thockin

@27149chen
Copy link
Member Author

/unassign @saad-ali

@27149chen
Copy link
Member Author

/assign @dims

@dims
Copy link
Member

dims commented Feb 5, 2020

approve in principle ... please get a LGTM from sig-storage folks

mount/mount_linux.go Outdated Show resolved Hide resolved
defer os.RemoveAll(target)

klog.V(4).Infof("Attempting to mount disk %s at %s", source, target)
if err := mounter.Interface.Mount(source, target, "", []string{"defaults"}); err != nil {
Copy link
Member

@gnufied gnufied Feb 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use target field given in the mount function as mount location rather than tempdir? Could a kubelet/driver crash leave the volume mounted in tempdir and never get cleaned up?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe the target is not existing or not available for mounting for some potential reasons. tempdir is more safer to temporarily mount and unmount, and we will make sure it is deleted in the end.
But you are right, it is a risk if the caller crashes between mount and unmount.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think again, it is the caller of formatAndMount's responsibility to make sure the existence and availablity of the target, so I think we can use the target field here.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: 27149chen
To complete the pull request process, please assign dims
You can assign the PR to them by writing /assign @dims in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@27149chen
Copy link
Member Author

/assign @gnufied

Copy link
Member

@saad-ali saad-ali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/assign @jsafrane
Jan are you more familiar with XFS?

mount/mount_linux.go Show resolved Hide resolved
mount/mount_linux.go Show resolved Hide resolved
@gnufied
Copy link
Member

gnufied commented Feb 7, 2020

mostly lgtm.

@27149chen Do you know why we are seeing filesystem corruption on volumes restored from snapshots? Is that because snapshots were taken in inconsistent state? Is there work tracked somewhere to fix the root cause of the problem too?

@27149chen
Copy link
Member Author

27149chen commented Feb 10, 2020

mostly lgtm.

@27149chen Do you know why we are seeing filesystem corruption on volumes restored from snapshots? Is that because snapshots were taken in inconsistent state? Is there work tracked somewhere to fix the root cause of the problem too?

@gnufied yes.
When taking snapshot with continuous IO in container, there may be some data still be cached in filesystem buffer. So there maybe some corruptions between filesystem metadata (superblock) and system log. So we need to repair the corruptions before filesystem mount.
I think the ideal way is freezing the pod before snapshot, we do not support it yet, and there are still some discussions in progress like this one: container-storage-interface/spec#407

}
}()
klog.V(4).Infof("Attempting to mount disk %s at %s", source, target)
if err := mounter.Interface.Mount(source, target, "", []string{"defaults"}); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From man-page:

In this situation, the log can be replayed by mounting and
       immediately unmounting the filesystem on the same class of machine
       that crashed.  Please make sure that the machine's hardware is
       reliable before replaying to avoid compounding the problems.

Are we being too aggressive in automatically fixing errors here? Can this make problems worse somehow? Should this be configurable? If we merge this PR as it is - it will become the new default even for in-tree Kubernetes drivers, so we have to be careful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gnufied do you mean that the node might be bad or not the same class, so there is potential risk?

@27149chen
Copy link
Member Author

@gnufied , could you continue your review please? I asked a question, can you reply it? thanks.

@27149chen 27149chen mentioned this pull request Mar 1, 2020
@gnufied
Copy link
Member

gnufied commented Mar 24, 2020

@27149chen I am hoping someone who knows XFS more than me has a chance to review this PR before it gets merged. To me it looks good but my XFS knowledge is lacking and it makes big enough that change that could affect all of k8s including in-tree drivers (which are strictly in maintenance mode).

@27149chen
Copy link
Member Author

@27149chen I am hoping someone who knows XFS more than me has a chance to review this PR before it gets merged. To me it looks good but my XFS knowledge is lacking and it makes big enough that change that could affect all of k8s including in-tree drivers (which are strictly in maintenance mode).

@gnufied , please add your lgtm label if it looks good to you.
I will try to find an xfs expert to help review it.

@k8s-ci-robot
Copy link
Contributor

@27149chen: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 25, 2020
@sandeen
Copy link

sandeen commented Mar 25, 2020

Hi - I'm the upstream xfsprogs maintainer, and have been an XFS developer for nearly 20 years. The problem you're trying to solve here was recently brought to my attention.

It's my understanding that all of this effort to run xfs_repair stems from the fact that non-quiesced filesystems are getting snapshotted. This inevitably leads to corruption and data loss, which cannot be un-done by xfs_repair. To be honest, the only acceptable solution to the problem that this PR seems to be trying to solve is to go back and fix the snapshotting procedure to ensure that you have properly quiesced the filesystem (by unmounting it, or by issuing the FIFREEZE ioctl to the filesystem either via xfs_freeze, or directly), prior to the snapshot.

Anything else, honestly, is simply not going to work, and will consistently lead to data loss.

@27149chen
Copy link
Member Author

27149chen commented Mar 28, 2020

Hi - I'm the upstream xfsprogs maintainer, and have been an XFS developer for nearly 20 years. The problem you're trying to solve here was recently brought to my attention.

It's my understanding that all of this effort to run xfs_repair stems from the fact that non-quiesced filesystems are getting snapshotted. This inevitably leads to corruption and data loss, which cannot be un-done by xfs_repair. To be honest, the only acceptable solution to the problem that this PR seems to be trying to solve is to go back and fix the snapshotting procedure to ensure that you have properly quiesced the filesystem (by unmounting it, or by issuing the FIFREEZE ioctl to the filesystem either via xfs_freeze, or directly), prior to the snapshot.

Anything else, honestly, is simply not going to work, and will consistently lead to data loss.

@sandeen thank you very much for your professional comment, it really helps a lot.
Here are some questions I want to ask:

  1. You can find that we will run fsck in every remount, but as you know it is not work for xfs. So I want to know when and in which situation should we run xfs_repair in our code.
  2. I understand that the best way is to go back and fix the snapshotting procedure. But what if an issue still happens after snapshot (no matter the procedure is right or wrong)? What should we do? If we can not repaire it by running xfs_repair, is there any other ways?
  3. You may notice that always running fsck is not a good practice, like I said in [mount] It is expensive to run fsck on every remount #137, do you know how can we check if the filesystem is running well after mounting? So that we can run fsck after we detect that the filesystem has problem, instead of running it every time.

@sandeen
Copy link

sandeen commented Mar 28, 2020

1. You can find that we will run `fsck` in every remount, but as you know it is not work for xfs. So I want to know when and in which situation should we run `xfs_repair` in our code.

There is no reason to fsck/repair any metadata journaling filesystem before every mount. This is what the metadata log is for - it ensures consistency after a crash/power loss/etc.

For journaling filesystems, fsck/repair tools only need to be used after filesystem corruption has been detected, or if for some reason you need to verify filesystem integrity prior to some administrative operation. (For example, ext4 recommends a full e2fsck to validate the filesystem before doing a resize, because resize can be a very invasive, possibly risky operation if corruption is encountered during the operation.)

2. I understand that the best way is to go back and fix the snapshotting procedure. But what if an issue still happens after snapshot (no matter the procedure is right or wrong)? What should we do? If we can not repaire it by running `xfs_repair`, is there any other ways?

If you are properly snapshotting the filesystem, xfs_repair won't be needed. xfs_repair should be exception activity, rare enough that it would be done manually when intervention is required. i.e. something goes wrong (disk bit flip, admin error, code bug, etc), xfs notices the corruption and shuts down, the administrator notices the error, and runs xfs_repair.
Please note fsck tools like xfs_repair are not data recovery tools. All they can do is put the filesystem back in a consistent state. Sometimes this is done by repairing errors, and sometimes it is done by discarding badly corrupted parts of the filesystem.

Please understand the difference: xfs_repair will put a filesystem back into a consistent state. But it will not fully recover the prior filesystem state if you do something like take a non-atomic snapshot of the device.

3. You may notice that always running `fsck` is not a good practice, like I said in #137, do you know how can we check if the filesystem is running well after mounting? So that we can run `fsck` after we detect that the filesystem has problem, instead of running it every time.

XFS has runtime checks - both CRC verification as well as structure integrity validation - on every metadata read and write from disk. If it finds an error, in most cases it will shut down the fileystem. In general there should be no need to check the filesystem while it's mounted.
Can I ask why you have a need to frequently "check if the filesystem is running well after mounting?"
If you really do want/need to do this, there are a couple of options:

  1. use xfs_scrub - which is currently experimental
  2. properly snapshot the device, and run xfs_repair -n on that snapshot

@27149chen
Copy link
Member Author

@sandeen , thank you for your explaination. My understanding is that we don't need to run fsck/xfs_repair before mounting in our code, we should mount it directly, and if it succeeds, we don't need to do anything, and if it is failed, we should return error and ask the user fix it manually. Am I right?

@dims dims removed their assignment Apr 29, 2020
@sandeen
Copy link

sandeen commented May 1, 2020

@sandeen , thank you for your explaination. My understanding is that we don't need to run fsck/xfs_repair before mounting in our code, we should mount it directly, and if it succeeds, we don't need to do anything, and if it is failed, we should return error and ask the user fix it manually. Am I right?

Apologies for the late reply, I didn't see the notification about your question.

That would be my suggestion, yes.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 30, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 29, 2020
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants