-
Notifications
You must be signed in to change notification settings - Fork 327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revisiting, speeding up, and simplifying API Umbrella's architecture #86
Comments
This is a bold and visionary move! Using Lapis and api-umbrella could be a really powerful combination to create fast, responsive api services. I wonder if both could share the same OpenResty instance? |
For reference, some additional feedback was gathered via e-mail on the US Government APIs mailing list: https://groups.google.com/forum/?nomobile=true#!topic/us-government-apis/5QcbBKKD4dk |
@jtmkrueger While technically a Lapis API could share the same OpenResty instance as API Umbrella, that's not something we would probably support. But you could certainly run another OpenResty instance on the same server and still proxy to that with API Umbrella with pretty minimal overhead (that's essentially what these benchmarks were doing). The reason we don't want to particularly support embedding APIs in the same server instances as API Umbrella is that I think it would add complexity to how we currently package and deploy API Umbrella. We've seen considerable benefits in making API Umbrella more of a standalone package that you just install and run as a whole, rather than the server administrator having to concern themselves with the API Umbrella internals and trying to mesh that with other pieces that may already exist on servers. If we try to allow sharing a single OpenResty server, I think it just gets trickier to package up or manage in a way that we wouldn't step on each others toes. But again, you could certainly run a separate OpenResty process on the same server or other servers to provide your APIs, and API Umbrella could proxy to that just like any other API backend. But I think this does bring up an important point, which is that if we did pursue this prototyped rearchitecture, it would be backwards compatible from a usage and administration perspective with the current version of API Umbrella. We would just provide new binary packages that could be installed, and existing installations should be upgradeable just like any other package upgrade. Again, by treating API Umbrella as more of a standalone package, I think that has some advantages over having server admins setup all the individual components. In this case, it gives us a little more latitude to pick which dependencies we need, or even do something wild, like swap an entire component's implementation (as long as the functionality is the same), with minimal impact on users and administrators of API Umbrella. |
Great writeup :-) How might the OpenResty approach affect server memory overhead, generally speaking? Is there any way to efficiently leverage Node.js libraries using the OpenResty approach? |
+1 This seems like a great move, and well explained. What do you think the effect will be on the other contributors to the project (are there many that aren't NREL based) - i.e. would the switch from node to lua cut down on existing external contributors? I don't think implementers of api-umbrella would have too much of a problem with an internal change that improves performance. |
I think your new architecture is very sound and Nginx is definitely the right horse to be hitching your wagon to. As an added benefit with the significantly increased throughput, API Umbrella would be much better positioned to try and protect against DoS attacks. I do have similar apprehensions about Lua/OpenResty though:
All that said, what's the alternative? If we look at some of the languages designed for concurrency, there may be some suitable web server implemented in Go or Erlang that could be extended. But you're going to run into similar problems of a small ecosystem around a more esoteric web server. Not to mention that choosing Erlang would shrink your contributor base significantly. I have nothing against Erlang, but it's more of a niche language. So, let's take for given that Nginx is right "container" for this proxy. To write an Nginx module, one would typically use C, but then you're losing a lot of the developer productivity benefits of a higher level languages. You could create some sort of C wrapper and embed some other concurrent scripting container, but why bother when that's what you're already getting with Lua using OpenResty. So, although I have some apprehensions about OpenResty, I don't see a better option either. |
Nginx developers are planning core JavaScript support. We're planning JavaScript configurations, using JavaScript in [an] Nginx configuration. We plan to be more efficient on these [configurations], and we plan to develop a flexible application platform. You can use JavaScript snippets inside configurations to allow more flexible handling of requests, to filter responses, to modify responses. Also, eventually, JavaScript can be used as [an] application language for Nginx. In the meantime, assuming JS is a candidate language, would it be possible to port some of the Umbrella components using ngx_http_js? |
I opened a ticket on the Nginx Trac requesting clarification of plans for and progress towards native JavaScript support. |
Thanks all! Some belated responses:
This is something I plan to benchmark, so I have more concrete numbers, but very loosely speaking, I think memory usage should be quite a bit lower with this OpenResty approach. The processes weren't super memory hungry before, but the new architecture does cut down on the number of processes quite a bit, so I think it should consume less memory (since we'd no longer be running Varnish, another nginx process, or multiple node.js processes). The main nginx worker processes would consume more memory in return, but I think we should see a somewhat substantial reduction in memory overall. But again, this is something I need to properly document and benchamrk.
No, not really. Since this OpenResty approach would shift everything towards Lua, we would no longer be depending on Node.js
Thanks for bringing this to my attention. I had not heard about these JS in nginx plans before. When I Googled around, I couldn't find much more information than that quote in InfoWorld, so thanks for filing the issue asking for more details. This would certainly give us more to think about if this is happening in nginx core. However, without knowing more details, I'm somewhat apprehensive to put all our eggs in that theoretical basket. One of my primary questions would be whether or not nginx's implementation would actually work with Node.js, or if they would be doing something different (for example, I don't think that existing ngx_http_js module would support Node.js libraries). Doing something different wouldn't necessarily be a bad thing, but then in that case, it seems like the library ecosystem would be back to square one, in which case I don't see a huge advantage over OpenResty. |
@ghchinoy: Thanks for the feedback! Regarding contributors, that is something I was hoping to get a sense from in this issue. Currently, we don't have a great deal of external contributors to the primary code base. Of course, we'd love to see that change if there's community interest in this project, so that is why I wanted to open this issue and try to gauge the current community's take on Lua. Because even if people aren't contributing now, we'd like to avoid making a change in direction that would be seen as a big detractor from people possibly contributing in the future. So you saying you'd be fine with Lua is precisely the type of thing we're interested in knowing. Thanks for the input! |
@darylrobbins: Sorry for the delay, but I appreciate the feedback! And I think your thoughts pretty much mirror my own in terms of apprehensions, alternatives, and fit. Overhauling our platform to use Lua and OpenResty certainly gives me pause, but the more I dwell on it, I think it does feel better than some possible alternatives, and I think it is a good fit for what we're trying to accomplish with our proxy layer. |
v0.9 has been released with these architecture updates. Further details: #183 https://github.com/NREL/api-umbrella/releases/tag/v0.9.0 |
Short(ish) Overview
I've prototyped a somewhat significant overhaul to API Umbrella's architecture that simplifies its operation and speeds things up. Because this would change quite a bit of API Umbrella's current code base, I wanted to open this issue to start gathering feedback.
The potential speed improvements appear quite substantial, and come by reducing the proxying overhead of having API Umbrella sitting in front of your underlying API backends. It's still very early, but benchmarks point to around a 10-50x reduction in our proxying overhead. On a test server, it takes the overhead of having API Umbrella sitting in front of an API from an average of 13ms down to 0.3ms (averages don't tell the whole story, though, so I'll get more into these numbers later).
Aside from the speed increases, this also simplifies API Umbrella quite a bit by reducing the number of components and dependencies. I think this could make API Umbrella easier to run, operate, and maintain. It should also be more performant with fewer server resources. And finally, I think all this simplifies the code-base and cleans up some of the more complicated pieces of functionality in our current code-base.
So this all sounds great, right? What's not to like? In my view, the main downside is that it would be a somewhat significant shift in architecture, and that obviously has implications and repercussions.
But before diving into the technical details, here are some random, high-level questions that come to mind:
We'd love to get your feedback on any of this, so please share if you have any thoughts (and don't feel obligated to read the rest of this short novella of technical mumbo-jumbo).
To reiterate, this is still an early prototype. The current benchmarks aren't very rigorous right now, so the speed numbers are subject to change. However, I do believe the benchmarks are in the right ballpark of how much faster this could make API Umbrella (somewhere between 10x-50x).
Longer Version
From here, I'll dive into the nitty-gritty of what our current architecture looks like and what I'm proposing. Along the way, I'll probably write far too much text and bore you to death. Sorry!
For quite a while, I've had some ideas on how to speed up and simplify API Umbrella rolling around in my head. A couple weekends ago, I decided to try to finally start playing around and put some of those ideas to use. The basic premises behind the optimizations are:
To be fair, we could optimize our current implementation using _#_2 and _#_3 without rewriting it in Lua, but there are some reasons I'll get to of why using Lua inside nginx makes these optimizations easier.
Aside from the speed gains, I think the other important thing to consider in this rearchitecture is operational simplicity of API Umbrella. Even if it weren't for the speed gains, I think all of these changes actually lend themselves to making API Umbrella easier to run, manage, and debug.
Current Architecture
Let's start with how things currently look:
Proposed Architecture
With these changes, we basically squash all of that down into:
How & Why
So you might be wondering why in the world we have all those pieces in the current architecture, and why we can just squash them all into one component now. I think the best way to explain the changes is to step through each component in the previous architecture, explain why it existed, and then detail how it's being handled in the new architecture.
nginx: Initial Routing / Load Balancing
Instead of incoming requests being directly handled by our Node.js Gatekeeper, this initial nginx server was in the stack for a couple primary reasons:
In the rearchitecure, we're still hitting nginx first, so this first step doesn't really change. We're just able to remove more of the pieces behind this by embedding more functionality directly inside nginx.
Node.js: Gatekeeper
This is our reverse proxy layer where we've implemented most of our custom logic. This includes things like API key validation, rate limiting, etc.
In the rearchitecture, nothing really changes about what this layer does, it just shifts the implementation to be in Lua instead of Node.js. That in turn allows it to be embedded inside nginx.
So why nginx and Lua instead of Node.js?
Lua vs Node.js Implementation Comparisons
If you're curious what the be Lua code looks like, here are a couple of quick comparisons of equivalent features:
You'll notice that our implementation logic remains pretty similar, so the move to Lua doesn't really mean we're throwing everything out. It's largely a translation of our current code base, just in a different language. It's also given us an opportunity to clean some things up with our old codebase.
Varnish: HTTP Cache
This serves as the HTTP caching layer in our stack. It needs to be placed behind the Gatekeeper, so we can enforce API key validation, rate limits, etc before allowing cache hits. I've liked using Varnish in the past, so that's largely how it landed in our stack here.
In the rearchitecture, we're going to use nginx's built in proxy_cache (or possibly ledge). In either case, it will be a cache embedded inside nginx. This is one feature I haven't tackled in the current prototype, and this area still needs more exploration. However, I think one of those two options will give us the caching capabilities we need directly inside nginx. Functionally, the cache should do the same thing (since HTTP caching's a pretty standard thing), this again just simplifies things and removes the need to also run Varnish.
One of my main hesitations previously about using nginx's built-in cache was the lack of purge capabilities. However, through plugins, I think this can be addressed. So we'd probably use something like ngx_cache_purge, nginx-selective-cache-purge-module, or Ledge, to provide a purge API endpoint for administrators.
And while Varnish's banlists may technically be superior than straight purges, and Varnish's caching capabilities may also be more robust, I like the simplicity gained by making nginx our default cache implementation. And if someone is super keen on Varnish or other caching servers, there's no reason they couldn't still run those behind the scenes instead.
nginx: API Routing / Load Balancing
Finally, we get to the last piece of our current "nginx sandwich". After hitting nginx, then our Gatekeeper, and then Varnish, we route back to nginx again to perform the actual routing of API requests to the API backend servers. Instead of using the Gatekeeper or Varnish for this routing, we go back to nginx for a few reasons:
Why not implement it in X (Java, Go, Erlang, Node.js, etc)?
So if we can squash all of our functionality down into a single component, why not use something else to implement that component? Why Lua and nginx? We already have a lot of our custom proxying stack implemented in Node.js, so why not do everything there? I've touched on some of those details already for Node.js, but in all these cases it basically boils down to: this layer of API Umbrella is just a proxy with some additional logic in it. By embedding our logic inside nginx, we get to take advantage of nginx's proven proxying capabilities and features. While there may be libraries for nice reverse proxies in these other languages, in my experience, they don't provide a lot of the features we need and that get from nginx (and these features would be non-trivial to reimplement). Here's a few examples of features we get with nginx that usually aren't included in proxy libraries I usually see written in other high-level proxies that allow customization (for example, node-http-proxy):
Benchmarks
On some simplistic benchmarks, the average overhead of having API Umbrella proxying to an API drops from around 13ms in the old stack to 0.2-0.3ms in the nginx+Lua stack.
(Note: When I'm talking about benchmarking API Umbrella's "overhead," I'm referring to how much time API Umbrella adds to a raw API call locally--this does not account for network latency between users and API Umbrella or between API Umbrella and API backends--but there's not much we can do about those, so this is about optimizing what we can at API Umbrella's layer.)
Getting back to the averages above, there's a bit more to the story than what the averages tell. The current stack can be quite a bit faster than the 13ms average indicates (it can be as fast as 2-3ms), but a quarter of the requests it serves up are consistently much slower to respond (in the 40ms range), which drives the average up. We could certainly get to the bottom of where those periodic slowdowns are coming from in the current stack (since I'm not sure we've always had those slowdowns), but a few notes on that:
Here's what I did for the tests:
Onwards to the benchmark details:
Baseline (Direct API)
Current Stack (Node.js + nginx + Varnish)
Overhead: 13.012 milliseconds on average (13.195 ms average response times for this test - 0.183 ms average response times for the baseline)
However, what's of particular note is the standard deviation and the percentile breakdowns. When I first started seeing these average overhead numbers, they seemed much higher than what I remembered the last time I was profiling API Umbrella's internals. And to some degree the average is high--50% of requests are served in 2-3ms, but the problem become evident once you look at the percentile breakdowns--around 20% of our requests are served in the 42-45ms range. That's a non-insignificant number of requests that are much slower.
Prototyped Stack (Lua + nginx)
Overhead: 0.235 milliseconds on average (0.418 ms average response times for this test - 0.183 ms average response times for the baseline)
Also note that 98% of requests were served in less than 1ms. 99% served in ~1ms or less. The max is 41ms, but I think that represents much more of a true outlier, since on other test runs I've had the max be around 6ms.
History / Cautionary Tale
I think it's worth touching briefly on the history of API Umbrella's architecture, since there are perhaps some lessons to be learned there.
From 2010-2013, API Umbrella's core proxying was composed of Ruby EventMachine and HAProxy. In 2013, we decided to switch over to Node.js and nginx.
The switch from HAProxy to nginx is perhaps the simplest to explain: nginx has connection pooling/caching for backend keepalive servers while HAProxy does not. HAProxy is still a fantastic piece of software, and it has other features I wish nginx had, but that keepalive feature has particular performance benefits for us, since we're proxying to remote API backends in different datacenters. This ability to hold open backend server connections helps speed things up for our specific use case.
The reason for switching from Ruby EventMachine to Node.js was also pretty straightforward. But if you're unfamiliar Ruby EventMachine, it offers an event-driven programming environment using the Ruby language. Node.js is also event-driven, so there are a lot of parallels in how the two function (non-blocking I/O, async callbacks, etc), it's just a different syntax. But the primary thing Node.js has going for it is that it was built from the ground up with this non-blocking mentality in mind, and so the entire ecosystem is non-blocking (database adapters, etc). With Ruby EventMachine, you're sort of split between two worlds--there are a plethora of standard Ruby libraries out there, but they may or may not block your application and harm your performance. There are certainly eventmachine-specific libraries available, but at the time, that ecosystem was much more nascent. It got frustrating not being able to use highly popular "normal" Ruby libraries for something like Redis connections, since those libraries performed blocking connections (or you could use them, but the performance of your app would suffer). And while there might have been an alternative eventmachine-based Redis library available, it was usually newer, not as well supported, lacking features, etc. The eventmachine scene has maybe changed over the past couple years, but at the time, it seemed like switching to Node.js was the best bet, since the entire ecosystem is oriented around non-blocking.
I mention this, because in some ways a Lua+nginx/OpenResty architecture would put us back in a similar situation to where we were with Ruby EventMachine. nginx+Lua is also event-based and must use non-blocking libraries. However, there are plenty of "normal" Lua libraries floating out there that do perform blocking. For situations where blocking is a concern (which isn't all the time), you then have to really look for the OpenResty-compliant libraries, since those have been built with nginx in mind. So if Lua's ecosystem wasn't already small enough, the OpenResty ecosystem is even smaller and less mature. The libraries supported by the core OpenResty team are of great quality, but the other ones seem to be of varying quality and popularity. As a quick example, I've already encountered bugs with the Resty MongoDB library that I've had to patch and submit fixes for.
So the idea of joining an ecosystem where some of the libraries aren't mature or missing doesn't make me super excited. But on the other hand, the OpenResty core is very solid, and we don't actually need a lot of other functionality or libraries in our proxy layer. Plus, some of the functionality OpenResty does provide is somewhat unique and makes our lives easier.
I think it's also worth noting that while the nginx+Lua/OpenResty ecosystem might be small, one of the main drivers of the entire platform is CloudFlare. They seem to run a lot of their services with this stack, and they do a lot of truly impressive stuff with it (and open source quite a bit of it). So it at least makes me feel a little bit better that they're invested in this stack and do a lot to contribute to the open source OpenResty world. Anecdotally, it also seems like I've seen more interest and growth in the OpenResty platform as a whole over the past couple years. This might just be due to me looking more into it recently, but I had actually looked at OpenResty a couple years ago when I was contemplating the switch to Node.js. I skipped over it at the time, since it still seemed too new and niche, but these days it seems like I see more being done with it, and I'm a little more comfortable with where the ecosystem is (albeit, it's still admittedly small).
Conclusion
So… Have you made it through my rambling (or at least skipped to the bottom)? We have an early prototype of running API Umbrella with Lua embedded inside nginx, but where do we go from here? What exactly are the pros and cons of such an architecture? Here are some ones I can think of:
And I guess to summarize my own personal opinion, I'm slightly favoring this nginx+Lua implementation, primarily because of how much it can simplify our server stack, which I think will make it much easier to run and maintain API Umbrella. The speed increases are also quite nice. The parts that make me nervous are the more nascent OpenResty ecosystem and whether Lua would be off-putting to potential contributors. I think those issues are surmountable, but I'm still not sold one way or another. Which is why we'd love to get your feedback!
So if you have any feedback, additional questions, comments, etc, feel free to leave them here. Or if all this has just been far too many words for you, you're welcome to take a nap.
Thanks!
The text was updated successfully, but these errors were encountered: