Fix duplicated 8+3 filenames for long file name entries #71

rtreffer · 2020-08-25T20:53:19Z

We need two pieces to generate unique 8+3 names from arbitrary long file names: a fast lookup map of existing names to find collisions and a way to generate names in the case of a collision.

The lookup map is lazy generated on first entry addition. This means the normal loading logic does not degrade in performance.

Several existing truncation approaches are listed on https://en.wikipedia.org/wiki/8.3_filename#VFAT_and_Computer-generated_8.3_filenames:

Shorten the name to 6 characters, then append ~1 .. ~9
Shorten the name to 5 characters, then append ~10..~99
Shorten the name to 2 characters, append 4 characters from the hash of the original filename

An example of a problematic folder would be the Big Buck Bunny PNG download (see https://media.xiph.org/BBB/BBB-360-png/). All files have the same prefix and numbering those has quadratic cost.
The hashing is usually used as a runtime complexity limiter.

This PR implements a variation of 1.+3.+random, where random is a simple random number based file name generation.

WIP: I am still not truly happy with the collision avoidance method, I would like to get some feedback on better / easier avoidance method.

After applying this patch and the other open PRs fsck.vfat finds only one minor issue on a complex filled disk: the go-diskfs fat32 implementation does not track the number of free clusters. Everything else looks good from a on-disk layout perspective.

deitch · 2020-08-27T06:31:47Z

To be clear, you basically are addressing this todo ?

rtreffer · 2020-08-27T07:12:51Z

Exactly. The issue at hand is that if I write a raspberry pi boot folder then there are many dtbs with similar prefixes (bmc27...-). This causes those to be written out with the same short name and fsck.vfat regards the same short name as the same file, thus complaining about "duplicated directory entry".

The wiki page has a few nice pointers on how to solve this. The ~1 .. ~9 is very human friendly IMHO. The hashing is less friendly but avoids runtime complexity issues. Both will fall apart at >10k directory entries as they at some point run out of possible names, thus random as a last resort....

The basic hashmap of all entries should be less controversal and easy to add, I might extract the collision avoiding SFN search from the rest of the method body, that should make it well contained.

deitch · 2020-08-27T07:16:13Z

This PR implements a variation of 1.+3.+random, where random is a simple random number based file name generation.

I didn't understand what you wrote here. It looks (from the code) like you do the following:

Try to to the first option, i.e. "first 6 chars + tilde + sequential digit", eventually using up 9 slots. This is what the Windows algorithm does. If all of those 9 slots are taken
Try to do the Windows NT (and later) hashing
Do something with random

WIP: I am still not truly happy with the collision avoidance method, I would like to get some feedback on better / easier avoidance method.

To what do you refer?

rtreffer · 2020-08-27T07:36:14Z

I am referring to the fact that

I would like to know what you think :-)
I am not happy with the implementation yet

The hashing of NT and onwards is unknown, and the wikipedia page does not list an avoidance algorithm, but there must be one as collisions starts to become likely for a 16 bit number if you go beyond a few hundred files.

deitch · 2020-08-27T07:49:16Z

The hashing of NT and onwards is unknown, and the wikipedia page does not list an avoidance algorithm, but there must be one as collisions starts to become likely for a 16 bit number if you go beyond a few hundred files

What I think? Pretty simple, actually.

What we have now doesn't handle even a single conflict. Which means anything is better. Even just the 9-file-limitation of [0-9A-Z]~[1-9] would be a material improvement.
Adding the second stage of hashing is even better, as it now gets us somewhere in the range of hundreds or possibly thousands of files before a likely collision

In all seriousness, this would take us from a guaranteed collision with >1 file, to a guaranteed collision with >9 file (an order of magnitude improvement) to a possible collision with >hundreds of files (a 2+ order of magnitude improvement). It isn't perfect? So what.

Now that I understand it, I like most everything in here. I have some suggestions on the coding style, but will put those inline.

deitch

Just some stylistic questions, but this really is quite good.

deitch · 2020-08-27T07:50:14Z

filesystem/fat32/directoryentry.go

@@ -413,11 +416,47 @@ func convertLfnSfn(name string) (string, string, bool, bool) {
 		isLFN = true
 	}

+	truncate := func(name string, n int) string {
+		return name[0:6] + "~" + strconv.Itoa(n)
+	}


I am not madly in love with inline anonymous functions here, as they will get declared every time we call convertLfnSfn, and they are harder to understand and test.

Can we pull these three out into named functions, and give them more representative names?

deitch · 2020-08-27T07:51:05Z

filesystem/fat32/directoryentry.go

+
+	random := func() string {
+		return fmt.Sprintf("%06X", (rand.Int63()%0x1000000)) + "~" + strconv.Itoa(1+rand.Intn(8))
+	}


random is a bit too generic of a name, and probably will confuse. We should pull this out too, but make it more specifically-named.

deitch · 2020-08-27T07:54:41Z

filesystem/fat32/directoryentry.go

+			for i < 9 && cache[truncate(shortName, i)+"."+extension] {
+				i++
+			}
+			shortName = truncate((shortName), i)


This calls truncate() duplicatively several times. Can we tighten it up a bit?

Also, why is this:

i := 1 for i < 9 && cache[truncate(shortName, i)+"."+extension] { i++ }

and not the simpler and more idiomatic:

var found string for i := i; i++; i < 9 { found = truncate(shortName, i)+"."+extension if _, ok := cache[found]; ok { break } } shortName = found

or similar

Fix duplicated 8+3 filenames for long file name entries

58e1f5b

deitch requested changes Aug 27, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix duplicated 8+3 filenames for long file name entries #71

Fix duplicated 8+3 filenames for long file name entries #71

rtreffer commented Aug 25, 2020 •

edited

Loading

deitch commented Aug 27, 2020

rtreffer commented Aug 27, 2020

deitch commented Aug 27, 2020

rtreffer commented Aug 27, 2020 •

edited

Loading

deitch commented Aug 27, 2020

deitch left a comment

deitch Aug 27, 2020

deitch Aug 27, 2020

deitch Aug 27, 2020

Fix duplicated 8+3 filenames for long file name entries #71

Are you sure you want to change the base?

Fix duplicated 8+3 filenames for long file name entries #71

Conversation

rtreffer commented Aug 25, 2020 • edited Loading

deitch commented Aug 27, 2020

rtreffer commented Aug 27, 2020

deitch commented Aug 27, 2020

rtreffer commented Aug 27, 2020 • edited Loading

deitch commented Aug 27, 2020

deitch left a comment

Choose a reason for hiding this comment

deitch Aug 27, 2020

Choose a reason for hiding this comment

deitch Aug 27, 2020

Choose a reason for hiding this comment

deitch Aug 27, 2020

Choose a reason for hiding this comment

rtreffer commented Aug 25, 2020 •

edited

Loading

rtreffer commented Aug 27, 2020 •

edited

Loading