Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix duplicated 8+3 filenames for long file name entries #71

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

rtreffer
Copy link

@rtreffer rtreffer commented Aug 25, 2020

We need two pieces to generate unique 8+3 names from arbitrary long file names: a fast lookup map of existing names to find collisions and a way to generate names in the case of a collision.

The lookup map is lazy generated on first entry addition. This means the normal loading logic does not degrade in performance.

Several existing truncation approaches are listed on https://en.wikipedia.org/wiki/8.3_filename#VFAT_and_Computer-generated_8.3_filenames:

  1. Shorten the name to 6 characters, then append ~1 .. ~9
  2. Shorten the name to 5 characters, then append ~10..~99
  3. Shorten the name to 2 characters, append 4 characters from the hash of the original filename

An example of a problematic folder would be the Big Buck Bunny PNG download (see https://media.xiph.org/BBB/BBB-360-png/). All files have the same prefix and numbering those has quadratic cost.
The hashing is usually used as a runtime complexity limiter.

This PR implements a variation of 1.+3.+random, where random is a simple random number based file name generation.

WIP: I am still not truly happy with the collision avoidance method, I would like to get some feedback on better / easier avoidance method.

After applying this patch and the other open PRs fsck.vfat finds only one minor issue on a complex filled disk: the go-diskfs fat32 implementation does not track the number of free clusters. Everything else looks good from a on-disk layout perspective.

@deitch
Copy link
Collaborator

deitch commented Aug 27, 2020

To be clear, you basically are addressing this todo ?

@rtreffer
Copy link
Author

Exactly. The issue at hand is that if I write a raspberry pi boot folder then there are many dtbs with similar prefixes (bmc27...-). This causes those to be written out with the same short name and fsck.vfat regards the same short name as the same file, thus complaining about "duplicated directory entry".

The wiki page has a few nice pointers on how to solve this. The ~1 .. ~9 is very human friendly IMHO. The hashing is less friendly but avoids runtime complexity issues. Both will fall apart at >10k directory entries as they at some point run out of possible names, thus random as a last resort....

The basic hashmap of all entries should be less controversal and easy to add, I might extract the collision avoiding SFN search from the rest of the method body, that should make it well contained.

@deitch
Copy link
Collaborator

deitch commented Aug 27, 2020

This PR implements a variation of 1.+3.+random, where random is a simple random number based file name generation.

I didn't understand what you wrote here. It looks (from the code) like you do the following:

  1. Try to to the first option, i.e. "first 6 chars + tilde + sequential digit", eventually using up 9 slots. This is what the Windows algorithm does. If all of those 9 slots are taken
  2. Try to do the Windows NT (and later) hashing
  3. Do something with random

WIP: I am still not truly happy with the collision avoidance method, I would like to get some feedback on better / easier avoidance method.

To what do you refer?

@rtreffer
Copy link
Author

rtreffer commented Aug 27, 2020

I am referring to the fact that

  1. I would like to know what you think :-)
  2. I am not happy with the implementation yet

The hashing of NT and onwards is unknown, and the wikipedia page does not list an avoidance algorithm, but there must be one as collisions starts to become likely for a 16 bit number if you go beyond a few hundred files.

@deitch
Copy link
Collaborator

deitch commented Aug 27, 2020

The hashing of NT and onwards is unknown, and the wikipedia page does not list an avoidance algorithm, but there must be one as collisions starts to become likely for a 16 bit number if you go beyond a few hundred files

What I think? Pretty simple, actually.

  • What we have now doesn't handle even a single conflict. Which means anything is better. Even just the 9-file-limitation of [0-9A-Z]~[1-9] would be a material improvement.
  • Adding the second stage of hashing is even better, as it now gets us somewhere in the range of hundreds or possibly thousands of files before a likely collision

In all seriousness, this would take us from a guaranteed collision with >1 file, to a guaranteed collision with >9 file (an order of magnitude improvement) to a possible collision with >hundreds of files (a 2+ order of magnitude improvement). It isn't perfect? So what.

Now that I understand it, I like most everything in here. I have some suggestions on the coding style, but will put those inline.

Copy link
Collaborator

@deitch deitch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some stylistic questions, but this really is quite good.

@@ -413,11 +416,47 @@ func convertLfnSfn(name string) (string, string, bool, bool) {
isLFN = true
}

truncate := func(name string, n int) string {
return name[0:6] + "~" + strconv.Itoa(n)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not madly in love with inline anonymous functions here, as they will get declared every time we call convertLfnSfn, and they are harder to understand and test.

Can we pull these three out into named functions, and give them more representative names?


random := func() string {
return fmt.Sprintf("%06X", (rand.Int63()%0x1000000)) + "~" + strconv.Itoa(1+rand.Intn(8))
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

random is a bit too generic of a name, and probably will confuse. We should pull this out too, but make it more specifically-named.

for i < 9 && cache[truncate(shortName, i)+"."+extension] {
i++
}
shortName = truncate((shortName), i)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This calls truncate() duplicatively several times. Can we tighten it up a bit?

Also, why is this:

			i := 1
 			for i < 9 && cache[truncate(shortName, i)+"."+extension] {
 				i++
 			}

and not the simpler and more idiomatic:

			var found string
 			for i := i; i++; i < 9 {
                            found = truncate(shortName, i)+"."+extension
                            if _, ok := cache[found]; ok {
                               break
                            }
 			}
                        shortName = found

or similar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants