Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IOError: File name too long #1634

Closed
joehillen opened this issue Mar 11, 2014 · 27 comments
Closed

IOError: File name too long #1634

joehillen opened this issue Mar 11, 2014 · 27 comments
Labels
auto-locked Outdated issues that have been locked by automation

Comments

@joehillen
Copy link

Got this error while using pip caching. I had to turn off caching in order for install to finish successfully. This is on 1.5.4. This is on Ubuntu 12.04 64-bit.

Downloading/unpacking backports.ssl-match-hostname (from tornado->-r requirements/standard.txt (line 15))
  Downloading backports.ssl_match_hostname-3.4.0.2.tar.gz
  Storing download in cache at /home/joe/.pip/cache/http%3A%2F%2Fpypi.internal%2Fpackages%2Fbackports.ssl_match_hostname%2Fdownload%2F1855%2Fbackports.ssl_match_hostname-3.4.0.2.tar.gz
Cleaning up...
Exception:
Traceback (most recent call last):
  File "/home/joe/work/project/ve/local/lib/python2.7/site-packages/pip-1.5.4-py2.7.egg/pip/basecommand.py", line 122, in main
    status = self.run(options, args)
  File "/home/joe/work/project/ve/local/lib/python2.7/site-packages/pip-1.5.4-py2.7.egg/pip/commands/install.py", line 278, in run
    requirement_set.prepare_files(finder, force_root_egg_info=self.bundle, bundle=self.bundle)
  File "/home/joe/work/project/ve/local/lib/python2.7/site-packages/pip-1.5.4-py2.7.egg/pip/req.py", line 1197, in prepare_files
    do_download,
  File "/home/joe/work/project/ve/local/lib/python2.7/site-packages/pip-1.5.4-py2.7.egg/pip/req.py", line 1375, in unpack_url
    self.session,
  File "/home/joe/work/project/ve/local/lib/python2.7/site-packages/pip-1.5.4-py2.7.egg/pip/download.py", line 586, in unpack_http_url
    cache_download(cache_file, temp_location, content_type)
  File "/home/joe/work/project/ve/local/lib/python2.7/site-packages/pip-1.5.4-py2.7.egg/pip/util.py", line 609, in cache_download
    fp = open(target_file+'.content-type', 'w')
IOError: [Errno 36] File name too long: '/home/joe/.pip/cache/http%3A%2F%2Fpypi.internal%2Fpackages%2Fbackports.ssl_match_hostname%2Fdownload%2F1855%2Fbackports.ssl_match_hostname-3.4.0.2.tar.gz.content-type'

What's weird is the string it's failing on is only 166 characters.

@joehillen
Copy link
Author

It turns out it's because I'm using eCryptFS which has a limit of 143 characters: http://stackoverflow.com/questions/6571435/limit-on-file-name-length-in-bash

The only solution I can think is to shorten the file names. Maybe don't store the entire url in the filename?

Or add exception handling for this error and just skip caching for this case.

Let me know which you prefer, and I will write a patch for either.

@joehillen
Copy link
Author

I thought of a better solution.

Instead of storing the full url as the filename, hash it.

I'm thinking of doing a structure similar to how git stores its object files:

<first 2 characters of the SHA>/<rest of the SHA>/(<filename>.tar.gz|content-type|url)

The hash is on the file's full url. I forget why git does separate directories for the beginning of the SHAs, but I'm sure there is a good reason.

Finding matching files for a particular host would be easy because then you can just do:

find ~/.pip/cache/ -name url | grep pypi.python.org

This also has the advantage that you don't have to encode the url.

For backwards compatibility, you can just check for files in the old format and then convert them to the new format without having to redownload any files.

I'd be happy to build this, but I would like some approval of the design from a maintainer before starting. It's terrible spending a ton of time writing and testing a patch just to have it ignored or rejected.

Let me know.

@piotr-dobrogost
Copy link

This also has the advantage that you don't have to encode the url.

Because of?

@dstufft
Copy link
Member

dstufft commented Mar 12, 2014

The reason for the prefix is to limit the number of directories/files in a single directory.

Some bikeshedding here, i'd prefer the path to be more broken up, something like...

a/b/d/3/7/abd37534c7d9a2efb9465de931cd7055ffdb8879563ae98078d6d6d5/<filename>

That is using sha224 (these urls might be coming from different locations, a collision attack may be plausible so using sha224 should make that harder). It also uses 5 directory deep prefix instead of a single one. I don't have a particular reason for that except I prefer it. It also includes the full hash in the final directory instead of just the rest of it.

Once you locate the filename it should verify that the url file associated with it matches the url we are looking for, and should treat a failure as a cache miss.

This should also solve the same issue for #1287

@joehillen
Copy link
Author

Because the url is kept in a file named url. Here is a specific example:

The url http://pypi.internal/packages/backports.ssl_match_hostname/download/1855/backports.ssl_match_hostname-3.4.0.2.tar.gz hashes to c5a89d648fb312c4988d3cd7600434e1895cfc48.

This creates the following directories and files:

  • ~/.pip/cache/c5/a89d648fb312c4988d3cd7600434e1895cfc48/backports.ssl_match_hostname-3.4.0.2.tar.gz - the package
  • ~/.pip/cache/c5/a89d648fb312c4988d3cd7600434e1895cfc48/url - the url that was hashed
  • ~/.pip/cache/c5/a89d648fb312c4988d3cd7600434e1895cfc48/content-type - the content-type of the url

@joehillen
Copy link
Author

@dstufft I see no reason to use a stronger hash function as there are no known attacks on sha1 (only a theoretical attack that is unproven) and even the attack on MD5 requires a large piece of data to be effective. URLs are short compared to the size requirements for an MD5 attack. Also the target controls the URLs so the attack space isn't very large.

I think git made the right choice in balancing number of directories and hash length, I would prefer to defer to their expertise.

@dstufft
Copy link
Member

dstufft commented Mar 12, 2014

Sorry, but you need to argue a reason why using a weaker hash is more appropriate. The default in any software I'm willing to accept should be the strongest available. In this case sha1, sha224, sha256, sha384, and sha512 have a 40, 56, 64, 96, and 128 byte hex digest respectively. There are two filesystems where the difference will matter, that is FATX and MINIX V3 FS and MINIX V3 FS will function perfectly fine with sha224 too.

So there's no technical reason afaict to prefer the weaker just cargo culting what git has done.

As far as two letter prefix vs my scheme, that's just a style thing, I find the multiple nested and a full hash at the end to be nicer to work with in general.

@joehillen
Copy link
Author

sha1 is python native. I shouldn't need more reason than that and the reasons I said earlier.

You're not building a crypto library here. Keep your requirements simple.

just cargo culting what git has done

Please don't insult me.

@dstufft
Copy link
Member

dstufft commented Mar 12, 2014

Python sha-2 is native to Python as well via the hashlib module. There's literally no more complexity from using something in the sha-2 family over using sha1.

@joehillen
Copy link
Author

Ah, I thought hashlib wasn't in the standard library. Fine, sha224 it is.

@dstufft
Copy link
Member

dstufft commented Mar 12, 2014

Also your reasons only address why using sha1 isn't inherently broken, they don't provide any reasoning as to why this should use sha1 over something in the sha-2 family.

@dstufft
Copy link
Member

dstufft commented Mar 12, 2014

Ah cool :) That explains it then :)

@dstufft
Copy link
Member

dstufft commented Mar 12, 2014

I wonder if we ought to use a hardcoded name for the filename inside of the package directory too... as of right now someone could have a filename longer than 143 (there's no limit in PyPI etc). Although I've never seen anyone ever have a problem with that so it probably doesn't matter.

@dstufft
Copy link
Member

dstufft commented Mar 12, 2014

Oh incase it wasn't obvious besides the bikeshedding I would absolutely accept this and I doubt any of the other maintainers would object.

@joehillen
Copy link
Author

I thought about that, but decided to leave the full file name to make it easier to search and navigate. I've never seen a 143 character library name, and that would be perverse and silly.

@dstufft
Copy link
Member

dstufft commented Mar 12, 2014

Yea I'm happy punting on that.

@joehillen
Copy link
Author

As for putting the full sha as the subdirectory, that's just duplicating information. Once the hash is generated there is only one place it could possibly be and the python code for it is:

os.path.join(cache_path, hash[:2], hash[2:])

@Ivoz
Copy link
Contributor

Ivoz commented Mar 12, 2014

if you want a compromise between path length and hash family, imo
hash = hashlib.sha256(url).hexdigest()[:32] would work fine.

@joehillen
Copy link
Author

MD5 would work fine too. I'm really not worried about collisions or security here as they're not really within the scope of this use case.

That being said, it's taboo to crop a hashed result, so I will just use the full hash. 64 characters should be fine for all known systems.

@Ivoz
Copy link
Contributor

Ivoz commented Mar 13, 2014

it's taboo to crop a hashed result

Would love to know who says so.

@joehillen
Copy link
Author

I'll leave that as an exercise for you to learn more about cryptographic hashing.

@Ivoz
Copy link
Contributor

Ivoz commented Mar 13, 2014

@joehillen as a matter of fact, that's absolute bollocks. It coincidentally stops message extension attacks on plain merkle-damgard constructions like sha1 and sha256, and is exactly how sha224 and sha384 are computed.

@joehillen
Copy link
Author

k

@cmclaughlin
Copy link

I recently hit this on OSX, which has a 255 filename limit. I'm also using an internal PyPI proxy, which results in pip adding a long'?remote=... target_url ....' string to the filename.

Hashing the cache directory structure sounds like an ideal fix.

@joehillen
Copy link
Author

Yeah, I want to work on this, but I've been far too busy the last month. I'm hoping I will get some downtime soon to work on this.

@mariocesar
Copy link

What is the status of this Issue?

I want to add if you install Ubuntu or a Linux Distribution that encrypt your home, the filename restriction raises.

@Ivoz
Copy link
Contributor

Ivoz commented Apr 27, 2014

@mariocesar see dstuff's PR #1748 above, it should solve these when pulled

@lock lock bot added the auto-locked Outdated issues that have been locked by automation label Jun 5, 2019
@lock lock bot locked as resolved and limited conversation to collaborators Jun 5, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
auto-locked Outdated issues that have been locked by automation
Projects
None yet
Development

No branches or pull requests

6 participants