Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better error messages when using ia upload --metadata incorrectly #267

Closed
JustAnotherArchivist opened this issue Sep 16, 2018 · 7 comments
Closed

Comments

@JustAnotherArchivist
Copy link
Contributor

Earlier, I used a wrong syntax with ia upload's --metadata option; specifically, I forgot to include the description field specifier and directly used --metadata='My description: something.' (instead of --metadata='description:My description: something'). That should obviously produce an error, and it did, but the error message was very unhelpful and led me to think the IA S3 API was having issues instead (as I had connection issues the past few days which manifested in the exact same error):

Traceback (most recent call last):
  File ".../lib/python3.4/site-packages/urllib3/contrib/pyopenssl.py", line 304, in _send_until_done
    return self.connection.send(data)
  File ".../lib/python3.4/site-packages/OpenSSL/SSL.py", line 1630, in send
    self._raise_ssl_error(self._ssl, result)
  File ".../lib/python3.4/site-packages/OpenSSL/SSL.py", line 1538, in _raise_ssl_error
    raise SysCallError(errno, errorcode.get(errno))
OpenSSL.SSL.SysCallError: (32, 'EPIPE')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File ".../lib/python3.4/site-packages/urllib3/connectionpool.py", line 601, in urlopen
    chunked=chunked)
  File ".../lib/python3.4/site-packages/urllib3/connectionpool.py", line 357, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/lib/python3.4/http/client.py", line 1090, in request
    self._send_request(method, url, body, headers)
  File "/usr/lib/python3.4/http/client.py", line 1128, in _send_request
    self.endheaders(body)
  File "/usr/lib/python3.4/http/client.py", line 1086, in endheaders
    self._send_output(message_body)
  File "/usr/lib/python3.4/http/client.py", line 928, in _send_output
    self.send(message_body)
  File "/usr/lib/python3.4/http/client.py", line 887, in send
    self.sock.sendall(datablock)
  File ".../lib/python3.4/site-packages/urllib3/contrib/pyopenssl.py", line 316, in sendall
    sent = self._send_until_done(data[total_sent:total_sent + SSL_WRITE_BLOCKSIZE])
  File ".../lib/python3.4/site-packages/urllib3/contrib/pyopenssl.py", line 311, in _send_until_done
    raise SocketError(str(e))
OSError: (32, 'EPIPE')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File ".../lib/python3.4/site-packages/requests/adapters.py", line 440, in send
    timeout=timeout
  File ".../lib/python3.4/site-packages/urllib3/connectionpool.py", line 639, in urlopen
    _stacktrace=sys.exc_info()[2])
  File ".../lib/python3.4/site-packages/urllib3/util/retry.py", line 357, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File ".../lib/python3.4/site-packages/urllib3/packages/six.py", line 685, in reraise
    raise value.with_traceback(tb)
  File ".../lib/python3.4/site-packages/urllib3/connectionpool.py", line 601, in urlopen
    chunked=chunked)
  File ".../lib/python3.4/site-packages/urllib3/connectionpool.py", line 357, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/lib/python3.4/http/client.py", line 1090, in request
    self._send_request(method, url, body, headers)
  File "/usr/lib/python3.4/http/client.py", line 1128, in _send_request
    self.endheaders(body)
  File "/usr/lib/python3.4/http/client.py", line 1086, in endheaders
    self._send_output(message_body)
  File "/usr/lib/python3.4/http/client.py", line 928, in _send_output
    self.send(message_body)
  File "/usr/lib/python3.4/http/client.py", line 887, in send
    self.sock.sendall(datablock)
  File ".../lib/python3.4/site-packages/urllib3/contrib/pyopenssl.py", line 316, in sendall
    sent = self._send_until_done(data[total_sent:total_sent + SSL_WRITE_BLOCKSIZE])
  File ".../lib/python3.4/site-packages/urllib3/contrib/pyopenssl.py", line 311, in _send_until_done
    raise SocketError(str(e))
urllib3.exceptions.ProtocolError: ('Connection aborted.', OSError("(32, 'EPIPE')",))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File ".../bin/ia", line 11, in <module>
    sys.exit(main())
  File ".../lib/python3.4/site-packages/internetarchive/cli/ia.py", line 159, in main
    sys.exit(ia_module.main(argv, session))
  File ".../lib/python3.4/site-packages/internetarchive/cli/ia_upload.py", line 225, in main
    for _r in _upload_files(item, files, upload_kwargs):
  File ".../lib/python3.4/site-packages/internetarchive/cli/ia_upload.py", line 86, in _upload_files
    response = item.upload(files, **upload_kwargs)
  File ".../lib/python3.4/site-packages/internetarchive/item.py", line 860, in upload
    request_kwargs=request_kwargs)
  File ".../lib/python3.4/site-packages/internetarchive/item.py", line 710, in upload_file
    **request_kwargs)
  File ".../lib/python3.4/site-packages/internetarchive/session.py", line 370, in send
    r = super(ArchiveSession, self).send(request, **kwargs)
  File ".../lib/python3.4/site-packages/requests/sessions.py", line 618, in send
    r = adapter.send(request, **kwargs)
  File ".../lib/python3.4/site-packages/requests/adapters.py", line 490, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', OSError("(32, 'EPIPE')",))

Note that my --metadata value did contain a colon, and so I suspect that ia thought it's valid. Here's the exact value I specified (for this item; in Bash, hence the weird '"'"'):

--metadata='XML dumps of WikiFur (wikifur.com) from <a href="http://dumps.wikifur.com/">http://dumps.wikifur.com/</a> as of 2018-09-16 11:15 UTC, retrieved using wpull 1.2.3 and the command <code>wpull --recursive --no-robots --no-verbose --reject-regex '"'"'/\?C=[NMSD];O=[AD]$'"'"' http://dumps.wikifur.com/</code>'

I assume that this was detected as an attempt to inject malicious HTML into the system or something like that. Perhaps there could be a bit more validation on the client side as to what metadata keys are acceptable?

By the way, using that value with ia metadata identifier --modify produces error (400): Invalid Character Error.

@jjjake
Copy link
Owner

jjjake commented Sep 18, 2018

See issue #176

@jjjake jjjake closed this as completed Sep 18, 2018
@JustAnotherArchivist
Copy link
Contributor Author

Are you sure that this is the same issue? Although the error message is the same, I wasn't using the bulk upload method.

Also, as mentioned, I've seen that same error occasionally due to general connectivity issues (fairly frequently in the last few days actually), but simply rerunning ia with the same command line usually worked fine. On the other hand, this specific case was perfectly reproducible and evidently caused by that --metadata option; my upload went perfectly fine as soon as I removed that option (and I assume it would've been fine also if I had corrected it to start with --metadata='description:...).

@JustAnotherArchivist
Copy link
Contributor Author

I just fell into this trap again. Definitely not the same as #176 and the countless transient appearances of that error. Please reopen. I'm investigating the detailed conditions currently.

@JustAnotherArchivist
Copy link
Contributor Author

I managed to get it down to a minimal reproducible example. I strongly suspect that it's due to whitespace before the first colon in the --metadata value. This command reliably fails:

ia upload test_connection_aborts_jaa_20210601 foo --metadata='collection:test_collection' --metadata='a test:foo'

If foo is a small file (e.g. echo foo >foo), the output is this:

test_connection_aborts_jaa_20210601:                                                        
 uploading foo: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  6.51MiB/s]
 error uploading foo: 

The --debug log reveals that the PUT request results in an HTTP 400 response:

2021-06-01 18:44:53,286 - internetarchive.session - DEBUG - no metadata provided for "test_collection", retrieving now.
2021-06-01 18:44:53,287 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): archive.org:443
2021-06-01 18:44:53,630 - urllib3.connectionpool - DEBUG - https://archive.org:443 "GET /metadata/test_collection HTTP/1.1" 200 None
2021-06-01 18:44:53,631 - internetarchive.session - DEBUG - no metadata provided for "test_connection_aborts_jaa_20210601", retrieving now.
2021-06-01 18:44:53,632 - urllib3.connectionpool - DEBUG - Resetting dropped connection: archive.org
2021-06-01 18:44:53,961 - urllib3.connectionpool - DEBUG - https://archive.org:443 "GET /metadata/test_connection_aborts_jaa_20210601 HTTP/1.1" 200 None
2021-06-01 18:44:53,971 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): s3.us.archive.org:443
2021-06-01 18:44:54,181 - urllib3.connectionpool - DEBUG - https://s3.us.archive.org:443 "PUT /test_connection_aborts_jaa_20210601/foo HTTP/1.1" 400 90
2021-06-01 18:44:54,183 - internetarchive.item - ERROR -  error uploading foo to test_connection_aborts_jaa_20210601, 

If foo is larger (e.g. dd if=/dev/zero of=foo bs=4M count=1), the crash in the original issue happens. The --debug log doesn't reveal anything extra:

2021-06-01 18:46:13,422 - internetarchive.session - DEBUG - no metadata provided for "test_collection", retrieving now.
2021-06-01 18:46:13,424 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): archive.org:443
2021-06-01 18:46:13,750 - urllib3.connectionpool - DEBUG - https://archive.org:443 "GET /metadata/test_collection HTTP/1.1" 200 None
2021-06-01 18:46:13,751 - internetarchive.session - DEBUG - no metadata provided for "test_connection_aborts_jaa_20210601", retrieving now.
2021-06-01 18:46:13,752 - urllib3.connectionpool - DEBUG - Resetting dropped connection: archive.org
2021-06-01 18:46:16,655 - urllib3.connectionpool - DEBUG - https://archive.org:443 "GET /metadata/test_connection_aborts_jaa_20210601 HTTP/1.1" 200 None
2021-06-01 18:46:16,664 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): s3.us.archive.org:443
2021-06-01 18:46:16,985 - internetarchive.session - ERROR - (ProtocolError('Connection aborted.', BrokenPipeError(32, 'Broken pipe')), 'https://s3.us.archive.org/test_connection_aborts_jaa_20210601/foo')

There is no hard limit where it switches from one error to the other. For me, it changes at about 95 kB, but a test with a 95250-byte foo randomly produces one or the other error.

I'll happily send a PR to catch this client-side before even attempting the upload. Might as well make it more robust than just this though. @jjjake, could you clarify what the requirements for metadata keys are exactly please? I couldn't find it in the docs. Would make a good addition there as well... :-)

@jjjake
Copy link
Owner

jjjake commented Jun 8, 2021

Metadata keys must be valid XML tags, as noted here. IA-S3 metadata headers should be lowercase and are a subset of what fits in the http headers header name space, with one exception: _ is not allowed in http headers, but we use in it in metadata field names, so there is that crazy hack where -- becomes _ (internetarchive does this translation when preparing the metadata headers, i.e. converting _ in provided metadata to -- when building the headers).

IMO, the issue here should be resolved by IA-S3 returning a valid S3 XML response that would then be parsed and surfaced to the user exactly like other S3 errors are. Currently it returns HTML, and that's why there is no error message returned. I think any and all validation should be centralized and happen at the IA-S3 level.

For whitespace in particular though, I'm wondering if we should just strip trailing/leading whitespace and replace any other whitespace with -... what are your thoughts on that?

@JustAnotherArchivist
Copy link
Contributor Author

Metadata keys must be valid XML tags, as noted here. IA-S3 metadata headers should be lowercase and are a subset of what fits in the http headers header name space

Ah, duh, not sure how I missed that. Thanks.

That explains the behaviour I guess. When the request is small enough to be received in full by the server before parsing, it notices the malformed HTTP headers and responds with a 400, but if the payload is too large, it just bails. Or perhaps it does send a 400 response even then, but the client never notices that because it sees the connection error on trying to send the next chunk of request data.

IMO, the issue here should be resolved by IA-S3 returning a valid S3 XML response that would then be parsed and surfaced to the user exactly like other S3 errors are. Currently it returns HTML, and that's why there is no error message returned. I think any and all validation should be centralized and happen at the IA-S3 level.

I certainly agree with that, though I've found it to often take much longer to change such things on the IA side than the client one. And I have a feeling that this might not be exactly straightforward to implement without adding 100 Continue support or similar; as hinted at above, the response might not be seen by the client at all otherwise. But if it's easy enough to do, absolutely!

For whitespace in particular though, I'm wondering if we should just strip trailing/leading whitespace and replace any other whitespace with -... what are your thoughts on that?

As a user, I'd much rather get told that my metadata field name is invalid and how to fix it than it getting mangled transparently. The former should lead to the user learning how to properly name custom metadata fields. The latter results in rather painful attempts at fixing the metadata after making a mistake like the one in the original issue.

@jjjake
Copy link
Owner

jjjake commented Jun 11, 2021

@JustAnotherArchivist Thanks for the reply. I agree with everything you've said, it makes sense.

I'll try to push getting S3 to return proper errors in these cases. It sounds like something we might be able to do soon, fingers crossed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants