Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copying files to a dataset and then verifying shows mismatches #227

Open
EchterAgo opened this issue Jun 9, 2023 · 95 comments
Open

Copying files to a dataset and then verifying shows mismatches #227

EchterAgo opened this issue Jun 9, 2023 · 95 comments

Comments

@EchterAgo
Copy link

I'm using rclone sync to copy ~5TB of data to a ZFS dataset created with these options:

zpool create -o ashift=12 -O atime=off -O compression=lz4 -O recordsize=1MB -O casesensitivity=insensitive -O normalization=formD Data raidz1 PHYSICALDRIVE0 PHYSICALDRIVE1 PHYSICALDRIVE4 PHYSICALDRIVE5 cache PHYSICALDRIVE2 PHYSICALDRIVE3

After the copy is done, I use rclone sync -c to fully verify the contents of the copied files and it always comes up with differences. So far I've only seen it with bigger files, mostly with a 150GB disk image file, but also with some ~30GB image files.

I used HxD to compare the files and it turns out there are sections filled with zeroes instead of the original data.

I'll try to diagnose this further.

@EchterAgo
Copy link
Author

EchterAgo commented Jun 9, 2023

The hole in the file seems to start and end at a 4 KiB aligned offset and is 32 KiB in size. The data after the hole is identical again. I'm letting it continue scanning to see if there are further differences.

@EchterAgo
Copy link
Author

The harddrives have

LogicalSectorSize  : 512
PhysicalSectorSize : 4096

and the cache SSDs have

LogicalSectorSize  : 512
PhysicalSectorSize : 512

@EchterAgo
Copy link
Author

The corruption also remains when I reread the data.

@EchterAgo
Copy link
Author

I checked another file and the corruption start and end are again 4 KiB aligned but this time it is larger, 49152 bytes (48 KiB)

@lundman
Copy link

lundman commented Jun 10, 2023

Thanks for looking into this, I'll make it higher priority since we definitely don't want any corruption. If it's sections of nulls, i wonder if it is a seek/hole issue. I will dig deeper. Is there a relatively easily to make it happen, like close to a one-liner?

@EchterAgo
Copy link
Author

I haven't found an easy way to reproduce yet, but it does keep happening when I transfer a big amount of data. I'm currently trying to see if any rclone options help, like --local-no-preallocate and --local-no-sparse. I'll also check different dataset options, maybe disable compression. It takes a while to test due to the big data size and the low amount of data actually zeroed, in ~5TB of data only 80 KiB were actually bad. I will report back here when I have more information.

I also checked for any errors in logs, but could not find anything, SMART values look good. Both systems I copy to/from have ECC memory, the source system is a Linux system running ZFS. Previous attempts to copy the same data to either NTFS / ReFS worked without issue. I'll also try to get some test running locally that writes random data from a fixed seed and compares the result.

@EchterAgo
Copy link
Author

I also tried copying data with Windows Explorer and haven't had any corruption yet. Windows Explorer copies files sequentially in a single thread, rclone copies multiple files simultaneously and can even use multiple streams for single files.

@sskras
Copy link

sskras commented Jun 10, 2023

@EchterAgo commented 2 hours ago:

Both systems I copy to/from have ECC memory, the source system is a Linux system running ZFS. Previous attempts to copy the same data to either NTFS / ReFS worked without issue.

It would be nice to check if it can be reproduced by Linux or FreeBSD implementations of OpenZFS. At least Linux one contains some hard to reproduce bugs too. Eg: openzfs#12014

@EchterAgo
Copy link
Author

I'm using the Linux ZFS port for a long time already and I've never encountered corruption like this. The pool options are almost the same, except for case sensitive, RAIDZ2 instead of RAIDZ1 and only one cache device. I also use rclone a lot to copy to the Linux pool and I do sanity check the data, no problems so far.

Yesterday I used Windows Explorer to copy the full 5TB to the Windows pool and it seems everything copied correctly. I also tried rclone with disabled preallocation and disabled sparse files and the corruption remained. I'm not sure I specified the options correctly for the remote though, so I'll retry this now.

@lundman
Copy link

lundman commented Jun 14, 2023

I've given up on #223 for now, after 5 weeks, I need a break from it.

Sounds like I should try out rclone on Windows and see if i can figure out what goes wrong. Maybe it is something in seek/write/holes that goes wrong.

@sskras
Copy link

sskras commented Jun 14, 2023

Thanks @lundman. Wow, you're cool digger! Take rest and take care. :)

And thanks for referencing the FileSpy tool, which I haven't heard before. It's a bit sad it isn't an open source thing, but it's impressive anyway! Cool. :)

@EchterAgo
Copy link
Author

I've given up on #223 for now, after 5 weeks, I need a break from it.

Sounds like I should try out rclone on Windows and see if i can figure out what goes wrong. Maybe it is something in seek/write/holes that goes wrong.

I have also done more testing, none of --multi-thread-streams 0, --local-no-sparse, --local-no-preallocate and --transfers 1 make rclone copy correctly and I'm sure the options are applied correctly. Interestingly I've done a few more copies with Windows Explorer and so far all those copies worked. I've looked at the WriteFile calls and rclone seems to use 32KB writes while Windows explorer starts with 256KB first, then increases to 768KB, then 1 MB.

I plan to try some different filesystem options next. I'll also look into adding code to log zero data in some places.

@lundman
Copy link

lundman commented Jun 14, 2023

When rclone runs, and using FileSpy are all the writes in sequence?

Like 0..32KB 32KB..64KB 64KB..96KB. .....

etc? Or does it seek around a bit? I wonder if the "offset counter update" is wrong, or if it sends the offset each time

@EchterAgo
Copy link
Author

With the options I used it should just write sequentially with a single thread, but I'll verify this using FileSpy.

@EchterAgo
Copy link
Author

FileSpy_20230614_rclone_write.zip

This is a capture of rclone writing the biggest file I have in the dataset. It does look sequential in 32KB chunks. There seems to be a FASTIO_WRITE call that fails in there. I'll also get a capture of Windows Explorer writing the same file.

I also wonder if it is feasible to just capture the whole copy, then I could write a small tool to locate the problematic offsets in the capture.

@lundman
Copy link

lundman commented Jun 14, 2023

Ignore the FASTIO errors, it tries it first to see if we support it, and if not, does a regular IRP call.
We have the FASTIO patch in a branch, but it complicates things, so I was going for stable first.

2	17:10:10.318		rclone.exe	IRP_MJ_WRITE	00060A00	FFFFBB06C3C12088	D:\data2.crypt	STATUS_SUCCESS	Offset: 00000003-7B3C7000 ToWrite: 8000 Written: 8000 
4	17:10:10.318		rclone.exe	IRP_MJ_WRITE	00060A00	FFFFBB06C3C12088	D:\data2.crypt	STATUS_SUCCESS	Offset: 00000003-7B3CF000 ToWrite: 8000 Written: 8000 
6	17:10:10.318		rclone.exe	IRP_MJ_WRITE	00060A00	FFFFBB06C3C12088	D:\data2.crypt	STATUS_SUCCESS	Offset: 00000003-7B3D7000 ToWrite: 8000 Written: 8000 
8	17:10:10.318		rclone.exe	IRP_MJ_WRITE	00060A00	FFFFBB06C3C12088	D:\data2.crypt	STATUS_SUCCESS	Offset: 00000003-7B3DF000 ToWrite: 8000 Written: 8000 

Hmm what does IrpSp->Parameters.Write.ByteOffset.HighPart being 3 mean? Another special hardcoded value?

I am aware of these two:

		if (IrpSp->Parameters.Write.ByteOffset.LowPart ==
		    FILE_USE_FILE_POINTER_POSITION) {
			byteOffset = fileObject->CurrentByteOffset;
		} else if (IrpSp->Parameters.Write.ByteOffset.LowPart ==
		    FILE_WRITE_TO_END_OF_FILE) { // APPEND

#define FILE_WRITE_TO_END_OF_FILE 0xffffffff
#define FILE_USE_FILE_POINTER_POSITION 0xfffffffe

@lundman
Copy link

lundman commented Jun 14, 2023

Ah ok this is just a snippet of a very large file? Then maybe it is up to 0x000000037B3C7000. So as long as it started at 0, then 1, 2, and now 3 that is fine. All writes should then end with 0x7000 and 0xf000 as long as the writes are 0x8000. They certainly do in the snippet I have.
You can tell FileSpy to monitor the drive letter of ZFS, then at the end, File->Save.

@EchterAgo
Copy link
Author

Yes, the file is 150GB, that is a normal file offset. I checked with smaller files and the high part of the address is zero.

At the rate the FileSpy memory usage is growing when capturing I doubt I have enough memory to capture a whole rclone transfer. Is there any way to get FileSpy to log directly to disk?

@EchterAgo
Copy link
Author

Initially I believed this to only affect larger files, but I've since seen this corruption in 1GB sized files as well.

@EchterAgo
Copy link
Author

I can just copy the biggest file over and over again and capture its events.

@lundman
Copy link

lundman commented Jun 21, 2023

So the io tests in MS test.exe highlight we have a gap:

Write to file at 0, PASS
Check position hasn't moved, PASS
Set position, FAIL (STATUS_NOT_IMPLEMENTED)
Check position has now moved, FAIL (CurrentByteOffset was 0, expected 6)
Read from file specifying offset, PASS
Check position, FAIL (CurrentByteOffset was 0, expected 6)
Read from file specifying offset as FILE_USE_FILE_POINTER_POSITION, PASS
Write to file specifying offset as FILE_USE_FILE_POINTER_POSITION, PASS
Set position to 0, FAIL (STATUS_NOT_IMPLEMENTED)
Write to file specifying offset as FILE_WRITE_TO_END_OF_FILE, PASS
Check position, PASS

So let me fix those before we test again

@lundman
Copy link

lundman commented Jun 21, 2023

56eca2a

Unlikely to affect things, you would have seen not-implemented in FileSpy.

@EchterAgo
Copy link
Author

I tested with the latest changes and still see corruption. I also tested different sources for the copy, including local NTFS drives, with the same result.

So far, anytime I copy enough data using rclone sync <src> <dst> and then verify it using rclone check --differ <differ.log> <src> <dst> I get differences in the differ.log, different files each time even though I create a fresh zpool each time. When I examine the files there is a zeroed section where the start and end offsets are 4k aligned. So far I haven't found a file that has multiple such sections, but I haven't examined each of them.

I haven't found any rclone / zfs / zpool options that make a difference yet and there are no reported checksum or R/W errors.

@EchterAgo
Copy link
Author

EchterAgo commented Jun 21, 2023

Ok, I have a simple tool to reproduce:

import uuid
import random
import math
import os
from pathlib import Path

ROOTPATH = Path('D:\\testdir')
DATA_BLOCK_SIZE = 32 * 1024
DATA_NUM_BLOCKS = math.floor((1 * 1024 * 1024 * 1024) / DATA_BLOCK_SIZE)
#DATA_NUM_BLOCKS = 3

def gen_file():
    filename = ROOTPATH / str(uuid.uuid4())
    start_state = random.getstate()
    
    with open(filename, 'wb') as f:
        for bi in range(0, DATA_NUM_BLOCKS):
            data_block = random.randbytes(DATA_BLOCK_SIZE)
            f.write(data_block)

    random.setstate(start_state)

    with open(filename, 'rb') as f:
        for bi in range(0, DATA_NUM_BLOCKS):
            disk_data_block = f.read(DATA_BLOCK_SIZE)
            expected_data_block = random.randbytes(DATA_BLOCK_SIZE)
            
            if disk_data_block != expected_data_block:
                print(f'found mismatch in {filename} @ {bi * DATA_BLOCK_SIZE}')


def main():
    random.seed()
    
    os.makedirs(ROOTPATH, exist_ok=True)
    
    while True:
        gen_file()

if __name__ == "__main__":
    main()

it still needs a while to hit the errror. For my first try it wrote 888GB to hit the error. Same issue as with rclone but now isolated to only writing to ZFS.

@EchterAgo
Copy link
Author

You can set a breakpoint on the mismatch print to stop the program when something goes wrong. I'll try to get a trace at that point tomorrow.

@EchterAgo
Copy link
Author

I also tested with ZFS default options, same behavior. I'll check without cache devices, different redundancy options and just single device tomorrow.

@EchterAgo
Copy link
Author

EchterAgo commented Jun 21, 2023

I also analyzed the file, it is the same zero filled 4k aligned corruption. I'll try this on a clean Windows install in a VM next.

@EchterAgo
Copy link
Author

On another try I hit corruption after 200GB.

@lundman
Copy link

lundman commented Aug 24, 2023

Good news it will compress really well!

Oh and there is no Registry zfs_disable_wincache=1 for this version (we know that works anyway) - we'll bring it back if it eventually works, just letting you know its not plumbed in the new code.

@EchterAgo
Copy link
Author

I don't think there ever was a zfs_disable_wincache registry entry, I used the debugger to change it, but I would like having that feature. I can try making that myself and submitting a PR if that is something you want added.

@EchterAgo
Copy link
Author

I still wonder why I didn't see the issue when trying without the L2ARC cache devices. Maybe it was just luck? I'll try this again.

@lundman
Copy link

lundman commented Aug 24, 2023

Ah yep, I never did add the MODULE_PARAM lines around it.

@lundman
Copy link

lundman commented Aug 25, 2023

Hah yeah, literally the first if was wrong, so it skipped alllll the new code.

@lundman
Copy link

lundman commented Aug 25, 2023

0c3c274

@EchterAgo
Copy link
Author

First copy passed, I am going to keep testing a few more times.

@EchterAgo
Copy link
Author

By now I am convinced that this is fixed, so I am going to close this. If I hit it again I will reopen.

@lundman
Copy link

lundman commented Aug 26, 2023

That is exciting - I will massage it to fit with ZFS style a bit more, then move into main branches.

@lundman
Copy link

lundman commented Aug 26, 2023

The read function is just as complicated with special cases - even though we dont know of any issues, it may be worth copying as well.

@EchterAgo
Copy link
Author

It seems that I set up remote access correctly so I will be able to run some tests while away from home.

What I noticed yesterday is that my system seems to mostly hang but not crash after running the copy, deleting all the files and restarting the copy. Immediately before this almost all of the machines memory is used. After restarting importing the pool crashes in unlinked drain. I think I can trigger this quite reliably.

@EchterAgo EchterAgo reopened this Aug 27, 2023
@lundman
Copy link

lundman commented Aug 27, 2023

Lets have a separate issue for unlinked drain bug - it should be easy (easier) to fix at least, just needs stack.

Someone else mentioned we aren't freeing space on delete again - unsure how that happened since we've not touched that area.

@EchterAgo
Copy link
Author

This occurred again today. It does seem to occur less frequently. This is on a completely different computer.

@EchterAgo EchterAgo reopened this Nov 16, 2023
@lundman
Copy link

lundman commented Nov 16, 2023

What happened? "Copying files to a dataset and then verifying shows mismatches" ?

@EchterAgo
Copy link
Author

Copied with rclone sync then reread with rclone check showed a mismatch. Checking differences with HxD showed a zeroed hole.

@lundman
Copy link

lundman commented Nov 16, 2023

boo, i dont like this news :)

I assume we arent using the new clone-file feature, upstream is disabling it at the moment for making zero holes in files.

@EchterAgo
Copy link
Author

I think this is still the same issue as previously, it just occurs less frequently. I set up a fio job now that keeps writing and verifying with the rclone IO pattern:

[global]
ioengine=windowsaio
direct=0
iodepth=8
time_based
runtime=0
directory=D\:\testdir
filename_format=testfile_$jobname_$jobnum_$filenum.bin
bs=32k
size=16G
fallocate=native

[sequential_write]
rw=write
numjobs=8
verify=xxhash
verify_fatal=1
verify_dump=1

I'll run that for a bit and see if that produces anything.

@EchterAgo
Copy link
Author

openzfs#15526 sounds very similar to this.

@lundman
Copy link

lundman commented Nov 23, 2023

They certainly are panicking at the moment, I don't think we do any clonefile? Although, it does sound it happens outside of clonefile as well, so it is possible.

Got anything for me to work on? I am adding the new CaseSensitive flag, but almost done.

@EchterAgo
Copy link
Author

They certainly are panicking at the moment, I don't think we do any clonefile? Although, it does sound it happens outside of clonefile as well, so it is possible.

Got anything for me to work on? I am adding the new CaseSensitive flag, but almost done.

Yea, it does seem to happen even without clonefile. I don't have any issues right now other than the data corruption one, I'm still trying to get a API trace capture and cbuf from right after it happens, but it is difficult.

@EchterAgo
Copy link
Author

It seems that after your newwrite branch the frequency of the corruptions went way down, so far I can't reproduce it using my test script anymore, neither with fio, but rclone still triggers it somehow, just way less frequent.

@EchterAgo
Copy link
Author

The upstream issue also seems very timing sensitive. Maybe I should also try some different pool options.

@EchterAgo
Copy link
Author

I'll revert to before the newwrite branch and try to reproduce with openzfs#15571

@EchterAgo
Copy link
Author

So, testing with c3c9364 (zfswin-2.2.0rc3) it is still easy for me to reproduce the problem, so that rules out any change in my setup. I have now cherry picked 76df45c on top of that and am rerunning my tests now.

@EchterAgo
Copy link
Author

Sadly applying 76df45c did not change anything, so this is a different issue.

@lundman
Copy link

lundman commented Jan 24, 2024

@EchterAgo You and I worked hard on the issue of corruption, and I believe we fixed them all. #318

Can you let me know if this isn't the case?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants