IOError: File name too long #1634

joehillen · 2014-03-11T22:46:24Z

Got this error while using pip caching. I had to turn off caching in order for install to finish successfully. This is on 1.5.4. This is on Ubuntu 12.04 64-bit.

Downloading/unpacking backports.ssl-match-hostname (from tornado->-r requirements/standard.txt (line 15))
  Downloading backports.ssl_match_hostname-3.4.0.2.tar.gz
  Storing download in cache at /home/joe/.pip/cache/http%3A%2F%2Fpypi.internal%2Fpackages%2Fbackports.ssl_match_hostname%2Fdownload%2F1855%2Fbackports.ssl_match_hostname-3.4.0.2.tar.gz
Cleaning up...
Exception:
Traceback (most recent call last):
  File "/home/joe/work/project/ve/local/lib/python2.7/site-packages/pip-1.5.4-py2.7.egg/pip/basecommand.py", line 122, in main
    status = self.run(options, args)
  File "/home/joe/work/project/ve/local/lib/python2.7/site-packages/pip-1.5.4-py2.7.egg/pip/commands/install.py", line 278, in run
    requirement_set.prepare_files(finder, force_root_egg_info=self.bundle, bundle=self.bundle)
  File "/home/joe/work/project/ve/local/lib/python2.7/site-packages/pip-1.5.4-py2.7.egg/pip/req.py", line 1197, in prepare_files
    do_download,
  File "/home/joe/work/project/ve/local/lib/python2.7/site-packages/pip-1.5.4-py2.7.egg/pip/req.py", line 1375, in unpack_url
    self.session,
  File "/home/joe/work/project/ve/local/lib/python2.7/site-packages/pip-1.5.4-py2.7.egg/pip/download.py", line 586, in unpack_http_url
    cache_download(cache_file, temp_location, content_type)
  File "/home/joe/work/project/ve/local/lib/python2.7/site-packages/pip-1.5.4-py2.7.egg/pip/util.py", line 609, in cache_download
    fp = open(target_file+'.content-type', 'w')
IOError: [Errno 36] File name too long: '/home/joe/.pip/cache/http%3A%2F%2Fpypi.internal%2Fpackages%2Fbackports.ssl_match_hostname%2Fdownload%2F1855%2Fbackports.ssl_match_hostname-3.4.0.2.tar.gz.content-type'

What's weird is the string it's failing on is only 166 characters.

The text was updated successfully, but these errors were encountered:

joehillen · 2014-03-11T23:20:00Z

It turns out it's because I'm using eCryptFS which has a limit of 143 characters: http://stackoverflow.com/questions/6571435/limit-on-file-name-length-in-bash

The only solution I can think is to shorten the file names. Maybe don't store the entire url in the filename?

Or add exception handling for this error and just skip caching for this case.

Let me know which you prefer, and I will write a patch for either.

joehillen · 2014-03-12T20:10:02Z

I thought of a better solution.

Instead of storing the full url as the filename, hash it.

I'm thinking of doing a structure similar to how git stores its object files:

<first 2 characters of the SHA>/<rest of the SHA>/(<filename>.tar.gz|content-type|url)

The hash is on the file's full url. I forget why git does separate directories for the beginning of the SHAs, but I'm sure there is a good reason.

Finding matching files for a particular host would be easy because then you can just do:

find ~/.pip/cache/ -name url | grep pypi.python.org

This also has the advantage that you don't have to encode the url.

For backwards compatibility, you can just check for files in the old format and then convert them to the new format without having to redownload any files.

I'd be happy to build this, but I would like some approval of the design from a maintainer before starting. It's terrible spending a ton of time writing and testing a patch just to have it ignored or rejected.

Let me know.

piotr-dobrogost · 2014-03-12T20:46:06Z

This also has the advantage that you don't have to encode the url.

Because of?

dstufft · 2014-03-12T20:56:11Z

The reason for the prefix is to limit the number of directories/files in a single directory.

Some bikeshedding here, i'd prefer the path to be more broken up, something like...

a/b/d/3/7/abd37534c7d9a2efb9465de931cd7055ffdb8879563ae98078d6d6d5/<filename>

That is using sha224 (these urls might be coming from different locations, a collision attack may be plausible so using sha224 should make that harder). It also uses 5 directory deep prefix instead of a single one. I don't have a particular reason for that except I prefer it. It also includes the full hash in the final directory instead of just the rest of it.

Once you locate the filename it should verify that the url file associated with it matches the url we are looking for, and should treat a failure as a cache miss.

This should also solve the same issue for #1287

joehillen · 2014-03-12T20:56:15Z

Because the url is kept in a file named url. Here is a specific example:

The url http://pypi.internal/packages/backports.ssl_match_hostname/download/1855/backports.ssl_match_hostname-3.4.0.2.tar.gz hashes to c5a89d648fb312c4988d3cd7600434e1895cfc48.

This creates the following directories and files:

~/.pip/cache/c5/a89d648fb312c4988d3cd7600434e1895cfc48/backports.ssl_match_hostname-3.4.0.2.tar.gz - the package
~/.pip/cache/c5/a89d648fb312c4988d3cd7600434e1895cfc48/url - the url that was hashed
~/.pip/cache/c5/a89d648fb312c4988d3cd7600434e1895cfc48/content-type - the content-type of the url

joehillen · 2014-03-12T21:04:55Z

@dstufft I see no reason to use a stronger hash function as there are no known attacks on sha1 (only a theoretical attack that is unproven) and even the attack on MD5 requires a large piece of data to be effective. URLs are short compared to the size requirements for an MD5 attack. Also the target controls the URLs so the attack space isn't very large.

I think git made the right choice in balancing number of directories and hash length, I would prefer to defer to their expertise.

dstufft · 2014-03-12T22:15:09Z

Sorry, but you need to argue a reason why using a weaker hash is more appropriate. The default in any software I'm willing to accept should be the strongest available. In this case sha1, sha224, sha256, sha384, and sha512 have a 40, 56, 64, 96, and 128 byte hex digest respectively. There are two filesystems where the difference will matter, that is FATX and MINIX V3 FS and MINIX V3 FS will function perfectly fine with sha224 too.

So there's no technical reason afaict to prefer the weaker just cargo culting what git has done.

As far as two letter prefix vs my scheme, that's just a style thing, I find the multiple nested and a full hash at the end to be nicer to work with in general.

joehillen · 2014-03-12T22:19:01Z

sha1 is python native. I shouldn't need more reason than that and the reasons I said earlier.

You're not building a crypto library here. Keep your requirements simple.

just cargo culting what git has done

Please don't insult me.

dstufft · 2014-03-12T22:20:00Z

Python sha-2 is native to Python as well via the hashlib module. There's literally no more complexity from using something in the sha-2 family over using sha1.

joehillen · 2014-03-12T22:25:06Z

Ah, I thought hashlib wasn't in the standard library. Fine, sha224 it is.

dstufft · 2014-03-12T22:25:07Z

Also your reasons only address why using sha1 isn't inherently broken, they don't provide any reasoning as to why this should use sha1 over something in the sha-2 family.

dstufft · 2014-03-12T22:25:42Z

Ah cool :) That explains it then :)

dstufft · 2014-03-12T22:27:08Z

I wonder if we ought to use a hardcoded name for the filename inside of the package directory too... as of right now someone could have a filename longer than 143 (there's no limit in PyPI etc). Although I've never seen anyone ever have a problem with that so it probably doesn't matter.

dstufft · 2014-03-12T22:28:44Z

Oh incase it wasn't obvious besides the bikeshedding I would absolutely accept this and I doubt any of the other maintainers would object.

joehillen · 2014-03-12T22:30:25Z

I thought about that, but decided to leave the full file name to make it easier to search and navigate. I've never seen a 143 character library name, and that would be perverse and silly.

dstufft · 2014-03-12T22:31:05Z

Yea I'm happy punting on that.

joehillen · 2014-03-12T22:36:15Z

As for putting the full sha as the subdirectory, that's just duplicating information. Once the hash is generated there is only one place it could possibly be and the python code for it is:

os.path.join(cache_path, hash[:2], hash[2:])

Ivoz · 2014-03-12T23:59:41Z

if you want a compromise between path length and hash family, imo
hash = hashlib.sha256(url).hexdigest()[:32] would work fine.

joehillen · 2014-03-13T00:04:03Z

MD5 would work fine too. I'm really not worried about collisions or security here as they're not really within the scope of this use case.

That being said, it's taboo to crop a hashed result, so I will just use the full hash. 64 characters should be fine for all known systems.

Ivoz · 2014-03-13T00:46:03Z

it's taboo to crop a hashed result

Would love to know who says so.

joehillen · 2014-03-13T00:48:55Z

I'll leave that as an exercise for you to learn more about cryptographic hashing.

Ivoz · 2014-03-13T00:55:00Z

@joehillen as a matter of fact, that's absolute bollocks. It coincidentally stops message extension attacks on plain merkle-damgard constructions like sha1 and sha256, and is exactly how sha224 and sha384 are computed.

joehillen · 2014-03-13T00:56:35Z

k

cmclaughlin · 2014-04-03T18:02:39Z

I recently hit this on OSX, which has a 255 filename limit. I'm also using an internal PyPI proxy, which results in pip adding a long'?remote=... target_url ....' string to the filename.

Hashing the cache directory structure sounds like an ideal fix.

joehillen · 2014-04-03T18:04:26Z

Yeah, I want to work on this, but I've been far too busy the last month. I'm hoping I will get some downtime soon to work on this.

mariocesar · 2014-04-26T15:44:16Z

What is the status of this Issue?

I want to add if you install Ubuntu or a Linux Distribution that encrypt your home, the filename restriction raises.

Ivoz · 2014-04-27T03:01:06Z

@mariocesar see dstuff's PR #1748 above, it should solve these when pulled

Ivoz mentioned this issue Apr 20, 2014

[WIP] Cache downloaded files by default #1734

Closed

dstufft mentioned this issue Apr 24, 2014

Use CacheControl instead of custom cache code #1748

Merged

11 tasks

dstufft closed this as completed in #1748 May 9, 2014

lock bot added the auto-locked Outdated issues that have been locked by automation label Jun 5, 2019

lock bot locked as resolved and limited conversation to collaborators Jun 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IOError: File name too long #1634

IOError: File name too long #1634

joehillen commented Mar 11, 2014

joehillen commented Mar 11, 2014

joehillen commented Mar 12, 2014

piotr-dobrogost commented Mar 12, 2014

dstufft commented Mar 12, 2014

joehillen commented Mar 12, 2014

joehillen commented Mar 12, 2014

dstufft commented Mar 12, 2014

joehillen commented Mar 12, 2014

dstufft commented Mar 12, 2014

joehillen commented Mar 12, 2014

dstufft commented Mar 12, 2014

dstufft commented Mar 12, 2014

dstufft commented Mar 12, 2014

dstufft commented Mar 12, 2014

joehillen commented Mar 12, 2014

dstufft commented Mar 12, 2014

joehillen commented Mar 12, 2014

Ivoz commented Mar 12, 2014

joehillen commented Mar 13, 2014

Ivoz commented Mar 13, 2014

joehillen commented Mar 13, 2014

Ivoz commented Mar 13, 2014

joehillen commented Mar 13, 2014

cmclaughlin commented Apr 3, 2014

joehillen commented Apr 3, 2014

mariocesar commented Apr 26, 2014

Ivoz commented Apr 27, 2014

IOError: File name too long #1634

IOError: File name too long #1634

Comments

joehillen commented Mar 11, 2014

joehillen commented Mar 11, 2014

joehillen commented Mar 12, 2014

piotr-dobrogost commented Mar 12, 2014

dstufft commented Mar 12, 2014

joehillen commented Mar 12, 2014

joehillen commented Mar 12, 2014

dstufft commented Mar 12, 2014

joehillen commented Mar 12, 2014

dstufft commented Mar 12, 2014

joehillen commented Mar 12, 2014

dstufft commented Mar 12, 2014

dstufft commented Mar 12, 2014

dstufft commented Mar 12, 2014

dstufft commented Mar 12, 2014

joehillen commented Mar 12, 2014

dstufft commented Mar 12, 2014

joehillen commented Mar 12, 2014

Ivoz commented Mar 12, 2014

joehillen commented Mar 13, 2014

Ivoz commented Mar 13, 2014

joehillen commented Mar 13, 2014

Ivoz commented Mar 13, 2014

joehillen commented Mar 13, 2014

cmclaughlin commented Apr 3, 2014

joehillen commented Apr 3, 2014

mariocesar commented Apr 26, 2014

Ivoz commented Apr 27, 2014