-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DISCUSS] CouchDB Request Size Limits #1200
Conversation
I haven't read everything yet, but you are failing
|
OK, I've had a read through. Looks good, though I have some comments. In 2.2.0, shouldn't we raise the default We should support We should enforce I am +1 on the idea of setting I am -0 on The only way I'd think As a side-note, I'd like to move towards any (and all?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It reads well and you cover all the bases. we should obviously discuss if this is the plan, but it's a good write up of a proposed direction.
The code is just meant to demonstrate how this could look. This PR is mainly meant for discussion
Sorry for bringing this up, I'll open another issue to discuss, it's orthogonal to this discussion. |
Summing up, here are the things we need to decide:
If iii., there are two options (so far, please suggest more ;):
|
Good write-up @janl
In general, if the cost of checking was the same, I'd think most users would rather pick the more specific limits (doc size, att size, max num of atts etc) than thinking of those limits and then applying some formula to calculate a max http request size. Maybe Replying to the summary specifically:
From a users' perspective I think 1.i is buggy and is just as random of failure 1.ii. Since when replicating the only indication of a failed write is a failed doc write count bump that the majority of users don't know about, the breakage is quite insidious. It might lead to invalid backups and it might take years before users notice the missing data their backups. So I'd pick 1.ii as better than 1.i from this point of view. A DOS is terrible but at least immediately apparent. Attachments mysteriously disappearing during replication is only to be discovered much later later is a more serious issue. I can imagine already the "My db ate my data blog posts".
I like the There is one more place were we'd need a limit to completely constrain the http request size based on the more precise limits - So I think I like 1.iii and it seems 1.iii.2 is similar to the idea of automatically deducing a max http request size from the other limits and rejecting it update based on it. But we' instead let users specify the more precise limits instead of using http max request size as the primary constraint. (One exception I guess if users somehow have a broken proxy or some middleware that cannot cope with large requests and they need to specifically adjust the http request size).
I think we should increase max http request size, at least for 2.2.0 and document how users can apply limits to avoid DOS attacks and how max request size, max doc size and max attachment sizes are related. This would unbreak customers with 64Mb attachments. Maybe add the Some more random notes: Another way to handle some of the issues is to teach the replicator to bisect attachment PUTs just like it bisect _bulk_docs when it receives a 413 response: couchdb/src/couch_replicator/src/couch_replicator_worker.erl Lines 481 to 489 in 40b9f85
Also it is good to keep in mind that replicator could be running on a 3rd cluster (not target or source necessarily) and it would need to handle older or other alternative CouchDB implementations. In that respect it has to "auto-discover" setting by probing, bisecting and guessing. In the table above, technically in <2.0.0 we didn't have max document size, but only max http request size. The setting was called max document size but was just bad name and as soon as we had a proper max document size we renamed it. |
I have a concrete use case that I don't think is well-served by iii.1:
My understanding is that due to this, I'd have to choose between having an unrealistically large I would much prefer iii.2, where |
I'd like to add that if iii.1 moves forward, I'd like to have some information about what impact setting
Etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @nickva 's comment on this issue in general. Thanks for the good analysis.
Autocalculating max_http_request_size
from the other variables rather than having to set it explicitly would be nice. In that sense, 1.iii.2
meets POLA very nicely. Big +1 here.
I like Nick's idea on _bulk_docs
, but we can't re-write older versions of the replicator. What would CouchDB 1.x do? At what point does it insert a break between docs in a _bulk_docs? There would certainly be performance implications if you set Couch to Jan's 3.0.0 defaults, limited _bulk_docs to ~100 docs, and then tried to replicate lots of ~1K documents. I assume that if 1.x hit a limit talking to 2.2.0, we'd just send a 413 back and it'd try again...but I'd like to test that explicitly before moving forward.
And what about other PUT/POST
endpoints that can take potentially large bodies, such as POST /{db}/_design/{ddoc}/_view/{view}
and the similar new endpoints introduced in a recent PR for databases? If we drop max_http_request_size
, couldn't these blow up hugely? Sure we could add a max_multi_requests
parameter, but the number of settings is starting to climb....
I almost feel like we could keep max_http_request_size
just for these latter two cases, IF we document that it only applies to these endpoints, AND keep that documentation up to date for any additional endpoints of this type. We could rename it max_bulk_request_size
so it's clearer that it applies only to bulk operations.
I see pluses and minuses to both of the above ideas (more max_*
parameters vs. max_bulk_request_size
). I have a slight preference for more max_*
parameters, since that moves us towards a cardinality view that is more deterministic.
@elistevens thanks, this convinces me that |
@janl What are we planning to do here for 2.2.0? This is a hot topic that needs resolution before we call an RC. |
The validation path is now the following: If a new doc body is > max_document_size, we throw an error. If a new attachment is > max_attachment_size, we throw an error. If the new doc body in combination with new and/or existing attachments is > max_attachment_size, we throw an error. This also sets the max_document_size to 2 GB, to restore 1.x and 2.0.x compatibility. Closes apache#1200
The validation path is now the following: If a new doc body is > max_document_size, we throw an error. If a new attachment is > max_attachment_size, we throw an error. If the new doc body in combination with new and/or existing attachments is > max_http_request_size, we throw an error. This also sets the max_document_size to 2 GB, to restore 1.x and 2.0.x compatibility. Closes apache#1200
Thanks for the great discussion here. Closing this out in favour of the new proposed approach: #1253 |
Use the already computed (conservative) body size. Switch multipart length calcuation to accept body and boundary sizes. Issue apache#1200 Issue apache#1253
Use the already computed (conservative) body size. Switch multipart length calcuation to accept body and boundary sizes. Issue apache#1200 Issue apache#1253
Note: the text below is written in a style that would allow it to be included in the CouchDB 2.2.0 documentation and/or release notes.
CouchDB Request Size Limits
There are multiple configuration variables for CouchDB that determine request size limits. This document explains the configuration variables, how they work together, and why they exist in the first place
Why Limit Requests by Size
Allowing requests of unlimited size to any network server is a [denial of service vector](wikipedia: denial of service). To allow safe operation of CouchDB, even on a network with hostile third parties, various request size limits exist.
The Request Size Limits
max_http_request_size
: the maximum number bytes a request to a CouchDB server can have.max_document_size
: the maximum number of bytes for a JSON document written to CouchDB.max_attachment_size
: the maximum number of bytes for any one attachment written to CouchDB.Background
There are three distinct ways of getting data into CouchDB:
In version 2.1, CouchDB started enforcing a 64MB limit for
max_http_request_size
on all requests, but did not apply this to the standalone attachment API.This had the unfortunate side effect that one could create a doc that is smaller than
max_http_request_size
with an attachment that is bigger thanmax_http_request_size
. In addition, one could create a doc with two or more attachments that were each smaller thanmax_http_request_size
but together bigger thanmax_http_request_size
. The result in this scenario now is that these documents could no longer be replicated to CouchDB nodes with the same default configuration (or even to the same node).Regardless to say, this is a very unfortunate user experience: create a number of documents with attachments, and at some not immediately obvious point, replications start failing.
Large Documents and Attachments
While CouchDB works reasonably well with almost any sort of JSON data sizes and attachment sizes. The development team makes recommendations as to the various limits for ideal and optimal uses. CouchDB users may vary from these recommendations, but will need to be okay with the resulting operational implications, like increased CPU & RAM usage as well as increased latency for many core operations.
Before CouchDB 2.1.0 there were no real limits imposed, and before CouchDB 2.2.0 the available limits weren’t applied uniformly, leading to surprising behaviour as for example outlined above.
CouchDB 2.2.0 and later aims to have a complete set of limits that avoids any unexpected behaviour, but the limits imposed won’t be set by default in order to preserve backwards compatibility. Starting with
CouchDB 3.0.0 the recommended limits will be set by default and users migrating from earlier versions of CouchDB need to adjust them, if their use-case requires it. The CouchDB team might produce a utility script that would allow to determine the required settings from an existing CouchDB installation, if resources can be made available for this.
Starting with CouchDB 2.2.0, the CouchDB distribution will come with an additional configuration file local.ini-recommended* with the developer-recommended defaults and explanations for what happens when these defaults are exceeded.
An alternative solution could avoid using the
max_attachments_per_doc
and reject attachment additions based on the existing doc + attachments size plus the new attachment size, but this PR/Discussion suggests that having another config value with sensible defaults here will nudge users into doing the right thingLimits by Version
In order to account for all use-cases and the interplay of the different APIs, CouchDB 2.2.0 introduces a new limit
max_attachments_per_document
. This allows the application of a formula to show the interplay of all limits:Using this formula, any doc update (JSON or attachments) can check whether it would exceed
max_http_request_size
which would cause replication to fail.max_http_request_size
max_document_size
max_attachment_size
max_attachments_per_document*
The table shows the approximate sizes (sans HTTP multipart boundaries) for all limits. CouchDB versions earlier than 3.0.0 will still encounter the behaviour of not being able to replicate documents that have attachments that alone or together exceed
max_http_request_size
.Implementation
This draft implementation introduces the new
max_attachments_per_document
to show how it could work. Tests will need to be added to validate that all three API routes are covered (casual review suggests they are, but we do, of course need tests). I stopped short of adding tests so we can discuss the details of this suggestion first.To 2.2.0 or not to 2.2.0
Since we started on the 2.2.0 milestone, this might be too big a thing to discuss and finish. I’d be very okay with bumping this to 2.3.0 as long as we document the behaviour in the 2.2.0 release notes.