Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiment with a fully async ContentSourceCompletableFuture #9975

Merged
merged 20 commits into from
Jun 30, 2023

Conversation

gregw
Copy link
Contributor

@gregw gregw commented Jun 26, 2023

Currently MultiPartFormData is only partially used and MultiPartByteRanges is not used at all. Both contain duplicate code that does not have a clear split between the responsibilities of managing the Content.Source, parsing the data and managing the CompletableFuture.

The usage of MultiPartFormData is in blocking style any bypasses the Content.Source. If the DelayedHandler.UntilMultiPartFormData was also used, then 2 attempts to parse the content would be made, one asynchronous that would be successful, but ignored and the second would just see EOF!

This PR introduces a ContentSourceCompletableFuture which attempts to extract all the common aspects of MultiPartFormData and MultiPartByteRanges. Once updated, then the usage needs to be updated so that it will follow the pattern of org.eclipse.jetty.server.FormFields#from(org.eclipse.jetty.server.Request, int, int) so that content can be read asynchronously before a blocking servlet is invoked. However it needs to be a little more complex so that context or servlet specific configuration can be used.

For use by MultiPartFormData and MultiPartByteRanges
@gregw gregw requested review from lachlan-roberts and sbordet June 26, 2023 21:48
@sbordet
Copy link
Contributor

sbordet commented Jun 27, 2023

@gregw sorry but MultiPartBytesRanges is critically used in ResourceService to handle ranged responses.
It will also be used on the client to parse those, so it's not true that it is not used.

The usage of MultiPartFormData is in blocking style any bypasses the Content.Source

Not sure what you mean, as all MultiPartFormData APIs are non-blocking?

I'm pretty much -1 on this PR, I don't see the savings, the two classes used in different contexts, and while there is a little duplication, I don't think it's worth an abstract class, also considering that the current duplication may not be a duplication anymore if other parameters are added -- for example max parse length, etc.

I also don't see your point about double parsing?
If DelayedHandler.UntilMultiPartFormData does the parsing, it should store the parts as an attribute, and other code should probe for the attribute first.
Trying to read the source again should correctly yield EOF, like it does for forms fields and getParameters().

@gregw
Copy link
Contributor Author

gregw commented Jun 27, 2023

@sbordet please take time to understand before you -1. Currently in too many places we are taking back to blocking... Which is so stupid after making so the effort to make our async so much better. By all means give feed back, but hold the veto until I've been able to answer your questions.

@gregw
Copy link
Contributor Author

gregw commented Jun 27, 2023

@sbordet wrote:
Don't like this because it makes MultiPartFormData non-reusable

MultiPartFormData is a CompletableFuture, so it has never been reusable.

and it is also weird because a constructor dependency typically implies "I need this component to work", while here the source it's not necessary at all

What I find weird is that it contains code to efficiently/asynchronously consume input from a Content.Source in the class, but then that is not actually used and instead we have hard coded blocking IO code or other Content.Source code that does the reading, creates fake chunks and then calls parse(Chunk) directly. Either the class is an IO class or it is not. It should not be used by code that is doing it's own IO. Either we have just a pure parser, or we always let the class do the IO. Since we have some great utility methods to turn other input sources into Content.Sources, then I think that is the best way to go. The less often we have to write the same IO loops is a good thing.

it is only necessary when parsing it.

That's what MultiPartFormData is! A parser! It is the MultiPartFormData.Parts class that is the thing used afterwards which is not-a parser and doesn't contain IO code. Perhaps it is a little strange that the parser class is not called "XxxParser" and contains the end product class.... but that predates this PR and I'm open to renaming and restructuring that if it makes sense.

It is also confusing when using MultiPartFormData to parse chunks, as it would happen in the client. What would be the source passed to the constructor used for?

Nope, it is the otherway around. It is confusing that we have code that is directly managing chunks to be parsed. We already have an abstraction for a sequences of Chunks, it is the Content.Source and we should just use that rather than inventing other ways to feed in chunks.

MultiPartBytesRanges is critically used in ResourceService to handle ranged responses. It will also be used on the client to parse those, so it's not true that it is not used.

See I missed that because it was using the class in a strange way. It was not calling parse(Content.Source).

I'm not saying that we get rid of the class. I'm saying that we factor out the duplicate code with MultiPartFormData and we avoid having it be used in two different modes.

The usage of MultiPartFormData is in blocking style any bypasses the Content.Source
Not sure what you mean, as all MultiPartFormData APIs are non-blocking?

The code that is reading an InputStream, converting it to Chunks and then calling parse(Chunk) is blocking. OK it will always be blocking because it is an InputStream, but it should just use InputStreamContentSource and avoid duplicating the IO code.

I'm pretty much -1 on this PR, I don't see the savings, the two classes used in different contexts,

Hey - as I said before. No veto until you've understood and given me a chance to answer your questions. Besides, it is an experiment tagged for 12.0.1, so no need to decide either way now.

and while there is a little duplication, I don't think it's worth an abstract class, also considering that the current duplication may not be a duplication anymore if other parameters are added -- for example max parse length, etc.

The IO code is duplicated in both classes, and then not used in several places where more IO code is written, duplicating either the same IO code again or other utility IO code that we have. Perhaps the IO code should not be in either class? but either way, we have far too much duplication of very similar code that is absolutely crucial to both performance and correctness to get right. Having that code in well maintained utility or base classes is a good thing to do. Managing the whole conversation between turning a Content.Source into a CompletableFuture via some kind of parser is a pattern repeated several times (even just creating a UTF8 string) and can benefit from having a single implementation. Perhaps this PR is not the right abstraction.... that's why it is labelled experimental and tagged for 12.0.1.

I also don't see your point about double parsing? If DelayedHandler.UntilMultiPartFormData does the parsing, it should store the parts as an attribute, and other code should probe for the attribute first. Trying to read the source again should correctly yield EOF, like it does for forms fields and getParameters().

OK I missed that ServletMultiPartFormData is doing the attribute lookup for the MultiPartFormData, so in that case it is would not double parse if the delayed handler (or similar) was used. But that is the wrong class to put that code in (hence I didn't find it). It should be in MultiPartFormData. Also we have a problem of how to get the multipart config to any delay style handler, so that it is correctly used if we asynchronously parse before calling the servlet. We need to solve this problem... somewhat orthogonal, but related.

@sbordet
Copy link
Contributor

sbordet commented Jun 27, 2023

MultiPartFormData was supposed to be a "container" class for both parsing and generation, like MultiPart is.

Agreed on single-use of MultiPartFormData -- it should be used on-the-fly, so passing the source to constructor is fine, or perhaps even going for static if we introduce a Parser class.

I'm not opposed to having MultiPartFormData as a "container" class and inner classes Parser and ContentSource (for generation) and Parts for results.

Agreed that parse(Chunk) is wrong -- I think it was there to be used by the client, but we have since converted the client to a pull model as well, so rather than onContent(Chunk) now has onContentSource(Content.Source), which is symmetrical to the server.

See I missed that because it was using the class in a strange way. It was not calling parse(Content.Source).

It's not using it in a strange way 😃
The server does not parse MultiPartByteRanges (typically), it generates them, so it is natural that it uses MultiPartByteRanges.ContentSource which is the generator.
The client should parse MultiPartByteRanges but we apparently don't have tests with a clear client usage.

About the veto, I think I'm using -1 too liberally as in "I don't agree with this PR, but open to discussion" -- I'll use -0 in future.

It [putting Parts as a request attribute] should be in MultiPartFormData.

I disagree. MultiPartFormData is used by the client too (as well as MultiPartByteRanges) so it should not have anything related to server-side specific stuff.
I maintain that ServletMultiPartFormData is the right place because it's server-side and Servlet specific.
For server-side core APIs, IIRC we settled on the fact that the application must do form-data parsing explicitly because we don't provide explicit APIs for parts (while Servlets do), so there is always this pseudocode for those apps:

if (content-type is multipart/form-data) {
  try lookup attribute
  else explicit parsing
} else {
  // some other request type
}

If this style is confirmed, then we either have a ServerMultiPartFormData for core where we store the attribute, or we document the single line necessary to look up the attribute, which I think is simpler, at least for now.

On the client, we don't need the attribute as nothing parses multipart if not the application itself.

Accessing the Servlet multipart config from a delayed handler I think would be hard... not a big fan of delaying TBH (I prefer lazy for this case).
If DelayedHandler.UntilMultiPartFormData can exist after ServletContextHandler great (would be able to access Servlet configuration), otherwise I think I would prefer to remove DelayedHandler.UntilMultiPartFormData, possibly to reintroduce it later after more thinking.

@gregw
Copy link
Contributor Author

gregw commented Jun 27, 2023

MultiPartFormData was supposed to be a "container" class for both parsing and generation, like MultiPart is.

There is something a bit strange, or at least different in how we have a class that is a container, a parser and a generator. We might needs some restructuring here, or perhaps just some renaming? or maybe a bit of both.

However, that might be for another PR. I started this PR because of the duplication I saw in the IO patterns. Thus my main goal is to generalise the IO that is done around these types of classes. Once that is done and agreed, I think we can look to see if this PR should also include some restructuring/renaming or if that is separate.

Agreed on single-use of MultiPartFormData -- it should be used on-the-fly, so passing the source to constructor is fine, or perhaps even going for static if we introduce a Parser class.

Or is the CompletableFuture abstraction something separate that just happens to use the parser to provide a container in the future?

I'm not opposed to having MultiPartFormData as a "container" class and inner classes Parser and ContentSource (for generation) and Parts for results.

Agreed that parse(Chunk) is wrong -- I think it was there to be used by the client, but we have since converted the client to a pull model as well, so rather than onContent(Chunk) now has onContentSource(Content.Source), which is symmetrical to the server.

See I missed that because it was using the class in a strange way. It was not calling parse(Content.Source).

It's not using it in a strange way smiley The server does not parse MultiPartByteRanges (typically), it generates them, so it is natural that it uses MultiPartByteRanges.ContentSource which is the generator. The client should parse MultiPartByteRanges but we apparently don't have tests with a clear client usage.

OK - bad choice of words. But see above, it can be a surprise to see a class that looks like it is primarily a parser, but then discover it is primarily used as a generator. Elsewhere we are very explicit with the Parser/Generator naming. We have a different pattern here, without clarity or consistency. The pattern might be fine, but we need better consistency to help with clarity.

About the veto, I think I'm using -1 too liberally as in "I don't agree with this PR, but open to discussion" -- I'll use -0 in future.

Definitely too liberal with the -1's. I also think that a draft PR of an experimental idea targeted for 2 or 3 releases in the future also doesn't need a -0. This PR or anything that comes from it is weeks if not months away from needing any ± decisions. Questions, dislikes, clarifications are all worthwhile at this stage, but a -0 helps little with a draft of an experiment.

It [putting Parts as a request attribute] should be in MultiPartFormData.

I disagree. MultiPartFormData is used by the client too (as well as MultiPartByteRanges) so it should not have anything related to server-side specific stuff. I maintain that ServletMultiPartFormData is the right place because it's server-side and Servlet specific.

Yes and no. Anything that moves to servlet land immediately needs at least 2 versions of it, so trying to keep things core is worthwhile. But then the config class itself it a servlet API class. So we need an simple/elegant way for a core handler to suck servlet specific configuration out of a context in a typeless way. Currently no good ideas for this, so 2 API dependent versions might be the best.

Accessing the Servlet multipart config from a delayed handler I think would be hard... not a big fan of delaying TBH (I prefer lazy for this case). If DelayedHandler.UntilMultiPartFormData can exist after ServletContextHandler great (would be able to access Servlet configuration), otherwise I think I would prefer to remove DelayedHandler.UntilMultiPartFormData, possibly to reintroduce it later after more thinking.

Lazy is not good enough. Once we get to servlet-land, then we are naturally blocking. We can go back to async, but not without dispatch type changes that makes anything we do less transparent.

I'm not a big fan of the DelayedHandler in its current form, as I think it is trying to do too much for too many different situations. I think we should just give up on the whole delay dispatch until the first data optimisation.. or at last just have a handler dedicated to just that. What this PR is about is ensuring that we can delay dispatch until the entire content is asynchronously consumed (and written to temporary files etc.) before invoking the servlet. Perhaps we should call this "pre-parsing" or "pre-loading" rather than "delaying"?

Pre-loading is best done in a core handler before hitting the ServletHandler. Now, after recent changes, it can be done after the ServletContextHandler, so the servlet itself will be known along with any multipart configuration.
This has always been the target of my long game for ee10: to do as much as possible in core handlers before hitting the ServletHandler. Thus we now now core sessions and core security and we can wrap core request/responses after the ServletContextHandler. All that work has been done precisely so we can do fast simple handling in core handlers for things like asynchronous reading of multipart. Hey eventually I want to get rid of the blocking code that invokes the ResourceHandler from the DefaultServlet, so that in the same way the ResourceHandler can asynchronously serve static content if it has not been filtered/wrapped.

But there is no hurry on this stuff, it is all internal and can be done after 12.0.0. I only worked on it yesterday and today as I had 4 hours on a train and didn't want to work on anything more critical.

@gregw gregw requested a review from sbordet June 28, 2023 16:05
Copy link
Contributor

@sbordet sbordet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.
Needs MultiPartByteRanges to be updated in the same style.

@gregw gregw requested a review from sbordet June 28, 2023 22:16
@gregw gregw marked this pull request as ready for review June 29, 2023 13:29
@gregw
Copy link
Contributor Author

gregw commented Jun 29, 2023

@lorban @sbordet I think this is getting closer to being ready to merge. I still don't have an EagerHandler, but the from method is better setup to support that with servlet configuration now. I've removed the multipart aspect to the DelayedHandler

* An example usage to asynchronously read UTF-8 content is:
* </p>
* <pre>
* public static class FutureUtf8String extends ContentSourceCompletableFuture<String>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to replace < and > with &lt; and &gt; respectively in the template type declaration.

return from(request, request, charset, maxFields, maxLength);
}

static CompletableFuture<Fields> from(Content.Source source, Attributes attributes, Charset charset, int maxFields, int maxLength)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs some serious javadoc'ing.

@@ -138,6 +140,7 @@ private static int getRequestAttribute(Request request, String attribute)

public FormFields(Content.Source source, Charset charset, int maxFields, int maxSize)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it looks like this class is meant to be instantiated with the static from*() methods, maybe the ctor should be private?

@@ -33,7 +35,7 @@
* A {@link CompletableFuture} that is completed once a {@code application/x-www-form-urlencoded}
* content has been parsed asynchronously from the {@link Content.Source}.
*/
public class FormFields extends CompletableFuture<Fields> implements Runnable
public class FormFields extends ContentSourceCompletableFuture<Fields>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the static methods lack javadoc.

* @return The parsed {@code X} instance or null if parsing is not yet complete
* @throws Throwable Thrown if there is an error parsing
*/
protected abstract X parse(Content.Chunk chunk) throws Throwable;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That method is supposed to do both parsing and accumulation. I'm tempted to suggest a new name like parseAndAccumulate or simply accumulate.

At least, this should be more clearly javadoc'ed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it just parses... and many parsers have an element of accumulation. However, an implementation is free to call chunk.retain rather than accumulate. I'll javadoc that.

Copy link
Contributor

@lorban lorban Jun 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retaining the chunks is a form of accumulation, but you're right that parsing implicitly implies some sort of accumulation too.

Ok to keep the parse name, and a bit of javadoc stating this method is supposed to do some accumulation should be enough.

* {@link org.eclipse.jetty.io.Content.Chunk#isFailure(Content.Chunk) failure chunk}
* @return True if the chunk can be ignored.
*/
protected boolean ignoreTransientFailure(Throwable cause)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not expose this for now, so I'd prefer to see this method be private.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole point of these changes is to allow us to experiment with transient failures.
Specifically, it is removing many different read loops that might not be transient failure safe, and replacing them with this loop that is.

if (cause instanceof IOException ioException)
throw ioException;

throw new ServletException(new BadMessageException("bad multipart", cause));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not directly throwing BadMessageException? Wrapping it in ServletException to later unwrap it looks strange.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because Servlet API says that is what we should throw to allow app to catch and handle. We wrap a BadMessage to make sure we send a 400 if the app does not catch.
Yes it is cumbersome... and probably fragile with regards to the TCK for failures.

{
if (!super.handle(request, response, callback))
callback.failed(new IllegalStateException("Not Handled"));
return true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just return the boolean from super?
The client made a request to the wrong URI, so I'd say it's a 404.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because we have already returned true.

@@ -155,6 +155,11 @@ public ServletHolder(Class<? extends Servlet> servlet)
setHeldClass(servlet);
}

public Object getMultipartConfig()
Copy link
Contributor

@sbordet sbordet Jun 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why returning Object?

Also getMultiPartConfig() capital P.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added the type, but it is lowercase p to match the Servlet API type now: MultipartConfigElement

*/
public static Parts from(ServletApiRequest request) throws IOException
public static CompletableFuture<Parts> from(ServletRequest servletRequest)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should not this be HttpServletRequest? As parts are HTTP things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will have to be-a HttpServletRequest, but just going for the lowest common denominator as elsewhere in the code.


future.whenComplete((result, failure) ->
{
try
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There must be a check on failure to not call super if we are already failed.

In case of success, we want to store the parts as an attribute, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

result or failure are handled by ServletApiRequest#getParts. I'll comment.

@gregw gregw requested review from lorban and sbordet June 30, 2023 09:54
Copy link
Contributor

@lorban lorban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost good to go.

@gregw gregw requested a review from lorban June 30, 2023 10:34
lorban
lorban previously approved these changes Jun 30, 2023
Copy link
Contributor

@lorban lorban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gregw gregw merged commit ec2dbe7 into jetty-12.0.x Jun 30, 2023
@gregw gregw deleted the experiment/jetty-12-ContentSourceCompletableFuture branch June 30, 2023 15:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants