Improve the workflow of xfs devices filesystem check and mount #132

27149chen · 2020-01-17T10:10:34Z

This PR improves the method checkAndRepairXfsFilesystem which will be used before mounting an xfs device.
changes are as following:

Retry three times if xfs_repair is failed.
Add a new function "replayXfsDirtyLogs" to replay dirty logs by mounting and immediately unmounting the filesystem.
Fix some test cases and add add somre more.

Backgroud:

If xfs_repair fails to repair the file system successfully, try giving the same xfs_repair command twice more; xfs_repair may be able to make more repairs on successive runs.
Due to the design of the XFS log, a dirty log can only be replayed by the kernel, on a machine having the same CPU architecture as the machine which was writing to the log. xfs_repair cannot replay a dirty log and will exit with a status code of 2 when it detects a dirty log. In this situation, the log can be replayed by mounting and immediately unmounting the filesystem on the same class of machine that crashed.

References:

27149chen · 2020-01-17T10:15:32Z

/assign @jsafrane

27149chen · 2020-01-22T09:22:51Z

/assign @gnufied

27149chen · 2020-01-28T07:00:06Z

@jsafrane @gnufied anyone who can help to have a look? thanks.

27149chen · 2020-01-30T03:46:50Z

/assign @jingxu97

27149chen · 2020-01-30T03:47:38Z

/unassign @gnufied

27149chen · 2020-01-30T03:48:17Z

/assign @saad-ali

27149chen · 2020-01-31T08:44:49Z

@saad-ali @jingxu97 could you help to have a look at this pr? thanks.

27149chen · 2020-02-03T09:01:57Z

/unassign @jsafrane

27149chen · 2020-02-03T09:02:34Z

/assign @thockin

27149chen · 2020-02-05T05:08:22Z

/unassign @saad-ali

27149chen · 2020-02-05T05:09:28Z

/assign @dims

dims · 2020-02-05T11:13:40Z

approve in principle ... please get a LGTM from sig-storage folks

mount/mount_linux.go

gnufied · 2020-02-05T12:51:38Z

mount/mount_linux.go

+	defer os.RemoveAll(target)
+
+	klog.V(4).Infof("Attempting to mount disk %s at %s", source, target)
+	if err := mounter.Interface.Mount(source, target, "", []string{"defaults"}); err != nil {


Why not use target field given in the mount function as mount location rather than tempdir? Could a kubelet/driver crash leave the volume mounted in tempdir and never get cleaned up?

I think maybe the target is not existing or not available for mounting for some potential reasons. tempdir is more safer to temporarily mount and unmount, and we will make sure it is deleted in the end.
But you are right, it is a risk if the caller crashes between mount and unmount.

Think again, it is the caller of formatAndMount's responsibility to make sure the existence and availablity of the target, so I think we can use the target field here.

k8s-ci-robot · 2020-02-05T13:03:27Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: 27149chen
To complete the pull request process, please assign dims
You can assign the PR to them by writing /assign @dims in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

mount/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

27149chen · 2020-02-05T15:31:24Z

/assign @gnufied

saad-ali

/assign @jsafrane
Jan are you more familiar with XFS?

mount/mount_linux.go

Signed-off-by: Lou <[email protected]>

gnufied · 2020-02-07T16:44:59Z

mostly lgtm.

@27149chen Do you know why we are seeing filesystem corruption on volumes restored from snapshots? Is that because snapshots were taken in inconsistent state? Is there work tracked somewhere to fix the root cause of the problem too?

27149chen · 2020-02-10T06:01:07Z

mostly lgtm.

@27149chen Do you know why we are seeing filesystem corruption on volumes restored from snapshots? Is that because snapshots were taken in inconsistent state? Is there work tracked somewhere to fix the root cause of the problem too?

@gnufied yes.
When taking snapshot with continuous IO in container, there may be some data still be cached in filesystem buffer. So there maybe some corruptions between filesystem metadata (superblock) and system log. So we need to repair the corruptions before filesystem mount.
I think the ideal way is freezing the pod before snapshot, we do not support it yet, and there are still some discussions in progress like this one: container-storage-interface/spec#407

gnufied · 2020-02-10T15:59:41Z

mount/mount_linux.go

+			}
+		}()
+		klog.V(4).Infof("Attempting to mount disk %s at %s", source, target)
+		if err := mounter.Interface.Mount(source, target, "", []string{"defaults"}); err != nil {


From man-page:

In this situation, the log can be replayed by mounting and immediately unmounting the filesystem on the same class of machine that crashed. Please make sure that the machine's hardware is reliable before replaying to avoid compounding the problems.

Are we being too aggressive in automatically fixing errors here? Can this make problems worse somehow? Should this be configurable? If we merge this PR as it is - it will become the new default even for in-tree Kubernetes drivers, so we have to be careful.

@gnufied do you mean that the node might be bad or not the same class, so there is potential risk?

27149chen · 2020-02-28T07:29:24Z

@gnufied , could you continue your review please? I asked a question, can you reply it? thanks.

gnufied · 2020-03-24T18:15:12Z

@27149chen I am hoping someone who knows XFS more than me has a chance to review this PR before it gets merged. To me it looks good but my XFS knowledge is lacking and it makes big enough that change that could affect all of k8s including in-tree drivers (which are strictly in maintenance mode).

27149chen · 2020-03-25T03:02:23Z

@27149chen I am hoping someone who knows XFS more than me has a chance to review this PR before it gets merged. To me it looks good but my XFS knowledge is lacking and it makes big enough that change that could affect all of k8s including in-tree drivers (which are strictly in maintenance mode).

@gnufied , please add your lgtm label if it looks good to you.
I will try to find an xfs expert to help review it.

k8s-ci-robot · 2020-03-25T03:02:30Z

@27149chen: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sandeen · 2020-03-25T18:35:05Z

Hi - I'm the upstream xfsprogs maintainer, and have been an XFS developer for nearly 20 years. The problem you're trying to solve here was recently brought to my attention.

It's my understanding that all of this effort to run xfs_repair stems from the fact that non-quiesced filesystems are getting snapshotted. This inevitably leads to corruption and data loss, which cannot be un-done by xfs_repair. To be honest, the only acceptable solution to the problem that this PR seems to be trying to solve is to go back and fix the snapshotting procedure to ensure that you have properly quiesced the filesystem (by unmounting it, or by issuing the FIFREEZE ioctl to the filesystem either via xfs_freeze, or directly), prior to the snapshot.

Anything else, honestly, is simply not going to work, and will consistently lead to data loss.

27149chen · 2020-03-28T05:16:53Z

Hi - I'm the upstream xfsprogs maintainer, and have been an XFS developer for nearly 20 years. The problem you're trying to solve here was recently brought to my attention.

It's my understanding that all of this effort to run xfs_repair stems from the fact that non-quiesced filesystems are getting snapshotted. This inevitably leads to corruption and data loss, which cannot be un-done by xfs_repair. To be honest, the only acceptable solution to the problem that this PR seems to be trying to solve is to go back and fix the snapshotting procedure to ensure that you have properly quiesced the filesystem (by unmounting it, or by issuing the FIFREEZE ioctl to the filesystem either via xfs_freeze, or directly), prior to the snapshot.

Anything else, honestly, is simply not going to work, and will consistently lead to data loss.

@sandeen thank you very much for your professional comment, it really helps a lot.
Here are some questions I want to ask:

You can find that we will run fsck in every remount, but as you know it is not work for xfs. So I want to know when and in which situation should we run xfs_repair in our code.
I understand that the best way is to go back and fix the snapshotting procedure. But what if an issue still happens after snapshot (no matter the procedure is right or wrong)? What should we do? If we can not repaire it by running xfs_repair, is there any other ways?
You may notice that always running fsck is not a good practice, like I said in [mount] It is expensive to run fsck on every remount #137, do you know how can we check if the filesystem is running well after mounting? So that we can run fsck after we detect that the filesystem has problem, instead of running it every time.

sandeen · 2020-03-28T17:25:56Z

1. You can find that we will run `fsck` in every remount, but as you know it is not work for xfs. So I want to know when and in which situation should we run `xfs_repair` in our code.

There is no reason to fsck/repair any metadata journaling filesystem before every mount. This is what the metadata log is for - it ensures consistency after a crash/power loss/etc.

For journaling filesystems, fsck/repair tools only need to be used after filesystem corruption has been detected, or if for some reason you need to verify filesystem integrity prior to some administrative operation. (For example, ext4 recommends a full e2fsck to validate the filesystem before doing a resize, because resize can be a very invasive, possibly risky operation if corruption is encountered during the operation.)

2. I understand that the best way is to go back and fix the snapshotting procedure. But what if an issue still happens after snapshot (no matter the procedure is right or wrong)? What should we do? If we can not repaire it by running `xfs_repair`, is there any other ways?

If you are properly snapshotting the filesystem, xfs_repair won't be needed. xfs_repair should be exception activity, rare enough that it would be done manually when intervention is required. i.e. something goes wrong (disk bit flip, admin error, code bug, etc), xfs notices the corruption and shuts down, the administrator notices the error, and runs xfs_repair.
Please note fsck tools like xfs_repair are not data recovery tools. All they can do is put the filesystem back in a consistent state. Sometimes this is done by repairing errors, and sometimes it is done by discarding badly corrupted parts of the filesystem.

Please understand the difference: xfs_repair will put a filesystem back into a consistent state. But it will not fully recover the prior filesystem state if you do something like take a non-atomic snapshot of the device.

3. You may notice that always running `fsck` is not a good practice, like I said in #137, do you know how can we check if the filesystem is running well after mounting? So that we can run `fsck` after we detect that the filesystem has problem, instead of running it every time.

XFS has runtime checks - both CRC verification as well as structure integrity validation - on every metadata read and write from disk. If it finds an error, in most cases it will shut down the fileystem. In general there should be no need to check the filesystem while it's mounted.
Can I ask why you have a need to frequently "check if the filesystem is running well after mounting?"
If you really do want/need to do this, there are a couple of options:

use xfs_scrub - which is currently experimental
properly snapshot the device, and run xfs_repair -n on that snapshot

27149chen · 2020-03-30T06:51:46Z

@sandeen , thank you for your explaination. My understanding is that we don't need to run fsck/xfs_repair before mounting in our code, we should mount it directly, and if it succeeds, we don't need to do anything, and if it is failed, we should return error and ask the user fix it manually. Am I right?

sandeen · 2020-05-01T19:28:27Z

@sandeen , thank you for your explaination. My understanding is that we don't need to run fsck/xfs_repair before mounting in our code, we should mount it directly, and if it succeeds, we don't need to do anything, and if it is failed, we should return error and ask the user fix it manually. Am I right?

Apologies for the late reply, I didn't see the notification about your question.

That would be my suggestion, yes.

fejta-bot · 2020-07-30T19:53:52Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2020-08-29T20:35:28Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2020-09-28T21:18:41Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2020-09-28T21:18:55Z

@fejta-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 17, 2020

k8s-ci-robot requested review from gnufied and jingxu97 January 17, 2020 10:10

k8s-ci-robot assigned jsafrane Jan 17, 2020

k8s-ci-robot assigned gnufied Jan 22, 2020

k8s-ci-robot assigned jingxu97 Jan 30, 2020

k8s-ci-robot unassigned gnufied Jan 30, 2020

k8s-ci-robot assigned saad-ali Jan 30, 2020

k8s-ci-robot unassigned jsafrane Feb 3, 2020

k8s-ci-robot assigned thockin Feb 3, 2020

k8s-ci-robot unassigned saad-ali Feb 5, 2020

k8s-ci-robot assigned dims Feb 5, 2020

gnufied reviewed Feb 5, 2020

View reviewed changes

mount/mount_linux.go Show resolved Hide resolved

gnufied reviewed Feb 5, 2020

View reviewed changes

mount/mount_linux.go Show resolved Hide resolved

gnufied reviewed Feb 5, 2020

View reviewed changes

mount/mount_linux.go Outdated Show resolved Hide resolved

gnufied reviewed Feb 5, 2020

View reviewed changes

k8s-ci-robot assigned gnufied Feb 5, 2020

saad-ali reviewed Feb 7, 2020

View reviewed changes

mount/mount_linux.go Show resolved Hide resolved

mount/mount_linux.go Show resolved Hide resolved

k8s-ci-robot assigned jsafrane Feb 7, 2020

always unmount after mount in replayXfsDirtyLogs

49780f3

Signed-off-by: Lou <[email protected]>

27149chen mentioned this pull request Feb 10, 2020

REQUEST: New membership for 27149chen kubernetes/org#1625

Closed

6 tasks

gnufied reviewed Feb 10, 2020

View reviewed changes

27149chen mentioned this pull request Feb 28, 2020

[mount] Addition of "checkAndRepairXfsFilesystem" inadvertently prevents XFS self-recovery via mounting #141

Closed

27149chen mentioned this pull request Mar 1, 2020

Ignore xfs dirty logs #144

Closed

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 25, 2020

dims removed their assignment Apr 29, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 30, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 29, 2020

k8s-ci-robot closed this Sep 28, 2020

Improve the workflow of xfs devices filesystem check and mount #132

Improve the workflow of xfs devices filesystem check and mount #132

Conversation

27149chen commented Jan 17, 2020

27149chen commented Jan 17, 2020

27149chen commented Jan 22, 2020

27149chen commented Jan 28, 2020

27149chen commented Jan 30, 2020

27149chen commented Jan 30, 2020

27149chen commented Jan 30, 2020

27149chen commented Jan 31, 2020

27149chen commented Feb 3, 2020

27149chen commented Feb 3, 2020

27149chen commented Feb 5, 2020

27149chen commented Feb 5, 2020

dims commented Feb 5, 2020

gnufied Feb 5, 2020 • edited Loading

Choose a reason for hiding this comment

27149chen Feb 5, 2020

Choose a reason for hiding this comment

27149chen Feb 5, 2020

Choose a reason for hiding this comment

k8s-ci-robot commented Feb 5, 2020

27149chen commented Feb 5, 2020

saad-ali left a comment

Choose a reason for hiding this comment

gnufied commented Feb 7, 2020

27149chen commented Feb 10, 2020 • edited Loading

gnufied Feb 10, 2020

Choose a reason for hiding this comment

27149chen Feb 11, 2020

Choose a reason for hiding this comment

27149chen commented Feb 28, 2020

gnufied commented Mar 24, 2020 • edited Loading

27149chen commented Mar 25, 2020

k8s-ci-robot commented Mar 25, 2020

sandeen commented Mar 25, 2020

27149chen commented Mar 28, 2020 • edited Loading

sandeen commented Mar 28, 2020

27149chen commented Mar 30, 2020

sandeen commented May 1, 2020 • edited Loading

fejta-bot commented Jul 30, 2020

fejta-bot commented Aug 29, 2020

fejta-bot commented Sep 28, 2020

k8s-ci-robot commented Sep 28, 2020

gnufied Feb 5, 2020 •

edited

Loading

27149chen commented Feb 10, 2020 •

edited

Loading

gnufied commented Mar 24, 2020 •

edited

Loading

27149chen commented Mar 28, 2020 •

edited

Loading

sandeen commented May 1, 2020 •

edited

Loading