Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add http proxy support to request_uri() #112

Merged
merged 4 commits into from
Dec 23, 2017

Conversation

sjakthol
Copy link
Contributor

These changes introduce support for making HTTP requests through a
forward proxy when using the high-level request_uri() API.

The new set_proxy_options() function can be used to configure HTTP proxy
settings. This method takes a table as a parameter that can include the
following fields:

  • http_proxy - an URI to a proxy that should be used for http:// requests
  • https_proxy - an URI to a proxy that should be used for https:// requests
  • no_proxy - a comma separated list of hosts / IPs that should not go through
    the configured proxy

When request_uri() is called with an http:// URI and http_proxy has been
configured for the client instance, the connection will be established to
the proxy host. Once the connection has been established, the HTTP request
is sent to the proxy host as if the peer on the other end of the connection
was the remote server. The only difference to the non-proxy case is that
the path sent to the proxy is actually a full URI instead of a relative
path.

When request_uri() is called with an https:// URI and https_proxy has been
configured for the client instance, the connection will be established to
the proxy host. Once the connection has been established, we go ahead and
perform a CONNECT request to open a TCP tunnel to the remote server. If
the proxy gives a success response, to the CONNECT request, the client
continues to perform the TLS handshake with the remote server and making
the request as if there was no proxy configured.

Some tests have been included which verify that the proxy options are
interpreted correctly in different cases and that http_proxy works as
real proxies expect. This is about as much there is to test without
actually including a real forward proxy in the test harness. Manual
testing has been performed against squid and tinyproxy.

Fixes #63

@pintsized
Copy link
Member

Thanks for this, will try to review shortly (I have a bit of a backlog to deal with).

local res, err = self:request({
method = "CONNECT",
path = destination,
headers = {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Miss User-Agent, because request method will add a default User-Agent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I was not able to parse your comment. Is there an issue with how User-Agent header is set in self:request() for CONNECT requests?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really sorry so late reply to you, I think this is a issue like you said.

@mikz
Copy link

mikz commented Jul 10, 2017

Looks pretty good. I'd be worried about the HTTPS support though. Could you add tests using real TLS connection? Or I could contribute them if you want.

You can easily embed SSL certs into the Test::Nginx test: https://github.com/3scale/apicast/blob/7df0f152b79eabddad15cfa6db0ff3f71d33bb18/t/003-apicast.t#L319-L358

@@ -776,11 +776,65 @@ function _M.request_uri(self, uri, params)
if not params.path then params.path = path end
if not params.query then params.query = query end

local c, err = self:connect(host, port)
-- See if we should use a proxy to make this request
local proxy_host, proxy_port;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trailing semicolon is pretty uncommon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bad habit from elsewhere. Will remove

if not c then
return nil, err
end

if proxy_uri and scheme == "https" then
Copy link

@mikz mikz Jul 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about extracting handling of the proxy_uri into own function?

Like

local c, err

if proxy_uri then
  c, err = self:connect_proxy(proxy_uri, host, port, params)  
else
  c, err = self:connect(host, port)
end

Not only that would split the code into logical segments, but also would allow use of that function as low-level API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a good idea. Let's see if I can find some time to refactor this one.

if proxy_uri and scheme == "http" then
-- http proxies expect to see the full URI in the request line
if port == 80 then
params.path = scheme .. "://" .. host .. path
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this the uri parameter?

What I mean is is it necessary to create new string on every request? Wouldn't it be just fine to reuse the uri parameter passed to this function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference between uri and this is that the uri may contain a query string but this one does not.

Both the path and the query string was extracted from the URI earlier in the function and added to the params table: https://github.com/pintsized/lua-resty-http/blob/master/lib/resty/http.lua#L801

Later on when the actual request is made, the code (https://github.com/pintsized/lua-resty-http/blob/master/lib/resty/http.lua#L275) appends params.query to the value of params.path to form the request line. If we were to set params.path = uri, the query string extracted from the URI would be appended at the end of the uri causing all kinds of breakage.

So params.path should be set to the uri without the query string. And that's what this concatenation here does.

I should probably add a comment here that explains why this is needed. It took me a few moments to figure out why did I do this here.

-- the proxy
local destination = host .. ":" .. port
local res, err = self:request({
method = "CONNECT",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to point to https://tools.ietf.org/html/rfc7231#section-4.3.6 and think about the Proxy-Authorization header. Maybe extract that header from params ?

@sjakthol
Copy link
Contributor Author

Thanks for the great comments. I will try to address them as soon as possible.

@sjakthol
Copy link
Contributor Author

On the topic of testing TLS: I tried to come up with a way to test a TLS connection over the proxy but didn't find a way to make it happen with the available tooling.

In order to make a HTTPS connection, I would need to have a forward proxy available that knows how to create a transparent TCP tunnel between two hosts when it receives a CONNECT request. nginx is not suitable for this as it refuses to process CONNECT requests altogether and they never reach the code you would have in the nginx config of the test. Another alternative would be to use the raw tcp_listen directive of Test::Nginx::Socket but I do not believe it is able to create a TCP tunnel to an nginx server and forward the TCP packets between the client and the nginx backend.

That was the reason why https over a proxy was not tested. If you have an idea how to create a forward proxy that understands CONNECT requests and can create a TCP tunnel between the client here and an nginx serving https traffic, I will be more than happy to implement such a case.

Sami Jaktholm and others added 2 commits July 18, 2017 21:33
These changes introduce support for making HTTP requests through a
forward proxy when using the high-level request_uri() API.

The new set_proxy_options() function can be used to configure HTTP proxy
settings. This method takes a table as a parameter that can include the
following fields:
* http_proxy - an URI to a proxy that should be used for http:// requests
* https_proxy - an URI to a proxy that should be used for https:// requests
* no_proxy - a comma separated list of hosts / IPs that should not go through
  the configured proxy

When request_uri() is called with an http:// URI and http_proxy has been
configured for the client instance, the connection will be established to
the proxy host. Once the connection has been established, the HTTP request
is sent to the proxy host as if the peer on the other end of the connection
was the remote server. The only difference to the non-proxy case is that
the path sent to the proxy is actually a full URI instead of a relative
path.

When request_uri() is called with an https:// URI and https_proxy has been
configured for the client instance, the connection will be established to
the proxy host. Once the connection has been established, we go ahead and
perform a CONNECT request to open a TCP tunnel to the remote server. If
the proxy gives a success response, to the CONNECT request, the client
continues to perform the TLS handshake with the remote server and making
the request as if there was no proxy configured.

Some tests have been included which verify that the proxy options are
interpreted correctly in different cases and that http_proxy works as
real proxies expect. This is about as much there is to test without
actually including a real forward proxy in the test harness. Manual
testing has been performed against squid and tinyproxy.

Fixes #63
@sjakthol
Copy link
Contributor Author

sjakthol commented Jul 18, 2017

Rebased on top of master & removed an unneeded semicolon from the code. Will try to get the larger changes done at some point in the coming weeks.

…t_proxy() method

This commit separates the logic that sets up the connection to the proxy server
to a separate connect_proxy() method. This method is provided to users as a
low-level API they can use similarly to the connect() method.

The connect_proxy() will handle the connection establishment to the proxy server
and performs the CONNECT request to setup a TCP tunnel to a https protected host.
Similar to the connect() method, it is then up to the user to take care of the
details that are relevant when using a proxy (i.e use absolute uris for http
requests and perform a TLS handshake for https connections).

There's also a new test case that verifies the CONNECT request is used properly
to establish a tunnel to the remote server when TLS is used. Due to the limitations
of the test framework, this case only considers the format of the outgoing CONNECT
request and how the code handles errors sent by the proxy. Testing a full TLS tunnel
is unfortunately not possible with the tools the test framework provides as it would
require a real reverse proxy or a method of forwarding the TCP connection after the
CONNECT request is received to a real web server that can talk TLS.
@sjakthol
Copy link
Contributor Author

sjakthol commented Sep 2, 2017

Finally managed to refactor the code a bit according to the suggested changes. There's now a low-level method connect_proxy() that is equivalent to the low-level connect() method for connections that use a proxy server. The commit message has more details about this change.

Copy link

@mikz mikz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome 🥇 👍 🎺

@domderen
Copy link

Hey guys, any chance to get this one merged? It perfectly solves the problem that we have, and it would be awesome to use it ;)

@cvillerm
Copy link

I also tested this successfully. Great work! Any hope to have this merged and published soon?

@pintsized
Copy link
Member

Hi all, many thanks for the work and testing on this. I'm really sorry for the slow response (recently started a new job). Looks good at a glance but I just need to find some time to understand the usage properly. I'll look at it as soon as I can.

Thanks for your patience!

@p0pr0ck5
Copy link
Contributor

p0pr0ck5 commented Oct 3, 2017

@sjakthol this is awesome! From what I can see this doesn't seem to cover the proxy_request method, though, as request is still called with a path that is not an absolute URI. I don't think it would be a deal breaker, but if it's not worth covering, should we add a note that the proxy_request call is unaffected by forward proxy configuration?

@@ -192,6 +211,18 @@ Note that calling this instead of `close` is "safe" in that it will conditionall

In case of success, returns `1`. In case of errors, returns `nil, err`. In the case where the conneciton is conditionally closed as described above, returns `2` and the error string `connection must be closed`.

## set_proxy_options
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this function be added to the ToC?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, good spot, please add to the ToC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

Copy link
Member

@pintsized pintsized left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good to me, although I've left a few small comments. The tests don't pass for me but I suspect for a minor reason. Happy to merge with this stuff resolved.

@@ -192,6 +211,18 @@ Note that calling this instead of `close` is "safe" in that it will conditionall

In case of success, returns `1`. In case of errors, returns `nil, err`. In the case where the conneciton is conditionally closed as described above, returns `2` and the error string `connection must be closed`.

## set_proxy_options
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, good spot, please add to the ToC.

@@ -914,5 +957,99 @@ function _M.proxy_response(self, response, chunksize)
until not chunk
end

function _M.set_proxy_options(self, opts)
self.proxy_opts = opts
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you modify self.proxy_opts in other code paths, please take opts by value so that user tables can be reused without being modified by the module.

e.g.

self.proxy_opts = tbl_copy(opts)  -- Take by value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

local no_proxy_set = {}
-- wget allows domains in no_proxy list to be prefixed by "."
-- e.g. no_proxy=.mit.edu
for host_suffix in self.proxy_opts.no_proxy:gmatch("%.?([^,]+)") do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In ngx_lua we're assured access to the ngx.re.* regex functions which can be JIT compiled. At the risk of premature optimisation (I haven't tested and compared) I'm pretty sure gmatch will be quite a bit slower here. Please consider changing these to use ngx.re.gmatch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewrote the logic to use ngx.re functions instead of built-in gmatch. I had to make some changes to the logic here as apparently the built-in gmatch and ngx.re.gmatch use completely different regex syntax. Please check carefully that I did not make any mistakes in the process (at leasts tests are passing for me).

-- matched as either a domain which contains the hostname, or the
-- hostname itself. For example local.com would match local.com,
-- local.com:80, and www.local.com, but not www.notlocal.com.
for pos in host:gmatch("%f[^%z%.]()") do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above

# The reply cannot be successful or otherwise the client would start
# to do a TLS handshake with the proxied host and that we cannot
# do with these sockets
--- tcp_reply
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test fails for me (on Arch Linux)

#          got: 'CONNECT 127.0.0.1:443 HTTP/1.1
# Host: 127.0.0.1:443
# User-Agent: test_ua
#
# '
#     expected: 'CONNECT 127.0.0.1:443 HTTP/1.1
# User-Agent: test_ua
# Host: 127.0.0.1:443

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh... It seems that the order of headers in the request is not guaranteed.

I changed the test to not care about the order of headers. Or at least I think I managed to do that. Please test if it works now.

@sjakthol
Copy link
Contributor Author

Thanks for the review. I will try to fix those issues you spotted soonish.

This includes:
* Add set_proxy_options to README TOC.
* Make a copy of proxy options gotten from the user.
* Rewrite no_proxy matching to use ngx.re.gmatch and friends instead
  of Lua's built-in gmatch functions. Since the regular expression
  formats are completely different, the no_proxy matching process had
  to be modified a bit.
* Fix 14-host-header.t on systems with ipv6 support by disabling ipv6
  address resolution.
* Make 16-http-proxy.t more reliable by ignoring the order of the
  headers in the CONNECT request (the order is not guaranteed). The
  new check uses a regular expression to check that the CONNECT line
  is correct and that the headers include correct Host in some position.
@sjakthol
Copy link
Contributor Author

I believe the comments made here are now addressed.

Except for the comment by @p0pr0ck5 about proxy_request() not being affected by proxy configuration:
The documentation for set_proxy_opts() states that it only affects requests made through the request_uri() API. I can add additional notes on other methods that might trigger an HTTP request if so desired but I don't really see how useful that would be.

Copy link

@mikz mikz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 amazing work! Can't wait for this to get merged.

@hamishforbes
Copy link
Collaborator

@pintsized I think this should be fine to merge too. It only affects the simple interface and if you haven't configured any proxy stuff there's just a couple of additional is nil conditional checks.
I can't see how it would break any existing users 👍

@p0pr0ck5
Copy link
Contributor

ping @pintsized @hamishforbes, any movement here? we'd love to be able to leverage this :D

@pintsized pintsized merged commit af8b08b into ledgetech:master Dec 23, 2017
@pintsized
Copy link
Member

Merged (finally!) and released via LuaRocks and OPM as v0.12. Thanks again for everyone's input, and patience.


-- Make the connection to the given proxy
local proxy_host, proxy_port = parsed_proxy_uri[2], parsed_proxy_uri[3]
local c, err = self:connect(proxy_host, proxy_port)
Copy link

@mikz mikz Feb 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pintsized @sjakthol It should be possible to create keepalive pool for each unique destination proxy+host.

tcpsocket:connect(host,port,options_table?) supports pool in the options_table.

So setting that to string.format("%s:%s/%s:%s", proxy_host, proxy_port, host, port) should do the trick and not reuse connections between different proxy/host combinations.

If my thinking is right I'm happy to propose a patch :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that solve a real world problem?
I would've thought it would be up to the forwarding proxy to handle keepalives on upstream/downstream connections properly itself.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well yes. Any high performance system has to use keepalives or is going to run out of ephemeral ports very soon.
Doing several thousand requests per second would run through all the ephemeral ports in few seconds making the machine unusable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, keepalives are vital, but does having a unique pool for the proxied destination solve a real world problem? Surely the forwarding proxy is responsible for managing keepalives to the destination?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the proxy will be responsible for connections between proxy and upstream. But the issue is between openresty and the proxy.

openresty  <> proxy <> upstream

Imagine 100 same requests would be made serially to openresty. Openresty would open 100 connections to the proxy. And the proxy when using keepalive would open just one connection to the upstream.

Now imagine 3k requests per second. That burns through ~40k ephemeral ports in less than 15 seconds. Then openresty will no longer be able to open new connections until the previous ones are recycled.

Using keepalives is vital in high performance servers that connect to external tcp services.
Keepalives keep the number of opened connections to N of unique pairs of host:port times number of connections in parallel. Without keepalives new connection is needed for every request.

@mikz mikz mentioned this pull request May 15, 2018
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants