Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep TCP/TLS connection alive and close connection properly #3018

Merged
merged 3 commits into from
Jun 2, 2022

Conversation

jirihnidek
Copy link
Contributor

@jirihnidek jirihnidek commented Mar 31, 2022

  • BZ: https://bugzilla.redhat.com/show_bug.cgi?id=2092234
  • The client tools try to keep TCP/TLS connection now
  • The HTTP header "Connection: keep-alive" is sent now and
    when server does not reply with "Connection: close", then
    RHSM tools try to keep existing connection. It helps to
    reduce traffic between client tools and candlepin server.
    It also reduce significantly time of executing all commands.
    E.g. The time of registration is reduced from 20 seconds
    to 13 seconds.
  • Connection using consumer authentication has to be recreated
    during registration, because new consumer certificate is
    installed
  • Tested with rhsm.service and it also works as expected
  • Improved little bit debugging of traffic and it is visible
    what type of authentication is used for communication with
    candlepin server. You can see the difference, when
    "subscription-manager version" is used vs e.g.
    "subscription-manager identity"

@cnsnyder cnsnyder requested review from a team and m-horky and removed request for a team March 31, 2022 12:15
@jirihnidek jirihnidek force-pushed the jhnidek/keep_connection_alive branch 2 times, most recently from 9da8a61 to b2c68dc Compare April 5, 2022 15:41
@jirihnidek
Copy link
Contributor Author

jirihnidek commented Apr 5, 2022

Wireshark shows that the TLS connection is not closed gracefully (client does not send close_notify alert). You can also see that subscription-manager opens two TCP/TLS connection for two REST API calls.
obrazek

Wireshark after fixing the issue. You can see that TLS connection is closed gracefully.
obrazek

@cnsnyder
Copy link
Member

cnsnyder commented Apr 5, 2022

This looks good to me! I'd be curious to hear what other folks think to make sure all involved parties are happy.

@jirihnidek jirihnidek force-pushed the jhnidek/keep_connection_alive branch from b2c68dc to 95c78f1 Compare April 13, 2022 15:03
@jirihnidek jirihnidek force-pushed the jhnidek/keep_connection_alive branch from 95c78f1 to cc2b14d Compare April 25, 2022 17:14
src/rhsm/connection.py Outdated Show resolved Hide resolved
@jirihnidek jirihnidek force-pushed the jhnidek/keep_connection_alive branch from cc2b14d to bdd4d66 Compare April 26, 2022 14:20
Copy link
Contributor

@ptoscano ptoscano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally OK, there are few things to tweak.

# Do proper TLS shutdown handshake (TLS tear down) first
if self.__conn.sock is not None:
try:
self.__conn.sock.unwrap()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's disappointing how not even shutdown() on the SSL socket will actually shutdown! python/cpython#63080

unwrap() is documented to return the original socket, so its return value should be assigned back to self.__conn.sock; unwrap() is the opposite of wrap_socket(), which encapsulates a socket object into SSLSocket.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that unwrap() returns exactly the same ssl.SSLSocket object as self.__conn.sock.sock is referencing before calling unwrap(). The ssl.SSLSocket is subclass of socket. Thus I don't see any reason to rewriting it.

Comment on lines 166 to 177
if self.consumer_auth_cp is not None:
log.debug("Closing consumer authenticated connection...")
self.consumer_auth_cp.conn.close_connection()
if self.no_auth_cp is not None:
log.debug("Closing no authenticated connection...")
self.no_auth_cp.conn.close_connection()
if self.basic_auth_cp is not None:
log.debug("Closing basically authenticated connection...")
self.basic_auth_cp.conn.close_connection()
if self.keycloak_auth_cp is not None:
log.debug("Closing keycloak authenticated connection...")
self.keycloak_auth_cp.conn.close_connestion()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd make the debug messages here a bit more clear (at least for me); for example, "Closing auth/<something> connection...", where <something> is e.g. "consumer", "none", "basic", "keycloak"

src/rhsmlib/services/register.py Show resolved Hide resolved
@jirihnidek
Copy link
Contributor Author

Some references on Keep-Alive HTTP header:

https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Keep-Alive

@@ -281,7 +281,7 @@ def get_server_versions(cp, exception_on_timeout=False):

if cp:
try:
supported_resources = get_supported_resources()
supported_resources = get_supported_resources(uep=cp)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change reduce number of TLS handshakes. Without this change: one no auth connection was used for gathering list of supported resources and then another auth/consumer connection was used fot getting status.

cp = self.cp
else:
cp = self.no_auth_cp
self.server_versions = get_server_versions(cp, exception_on_timeout=False)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When system is registered, then it is better to use auth/consumer connection than no auth connection.

Copy link
Contributor

@ptoscano ptoscano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly OK; there are few things to change, in addition to what I wrote inline:

  • this must be rebased, otherwise it cannot be tested properly
  • the last commit "Improved debug print a little bit." does not make sense on its own, simply squash it with whichever commit adds that debug statement
  • regarding auth_type: maybe it would be better as Enum, so it is properly typed, there are way less risks of typos in strings, and it is automatically converted to string (reduces the debug code a bit)
  • beware of the max_request_num vs max_requests_num typo! this basically nullifies the whole "max requests" logic...

Comment on lines 859 to 875
if item.startswith("timeout=") is True:
search_result = pattern.search(item)
if search_result is not None:
keep_alive_timeout = search_result.groups()[1]
if keep_alive_timeout.isdecimal() is True:
keep_alive_timeout = int(keep_alive_timeout)
else:
keep_alive_timeout = None
# Maximal number of request on one connection
if item.startswith("max=") is True:
search_result = pattern.search(item)
if search_result is not None:
max_request_num = search_result.groups()[1]
if max_request_num.isdecimal() is True:
max_request_num = int(max_request_num)
else:
max_request_num = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO this way of parsing the values is more complex than it should be:

  • there is no need for regexps
  • the check for isdecimal() is redundant, as it the regexp already ensures the value is only numerical

I have a couple of ideas for this code, factorizing it in an helper function:

def parse(text, prefix, default=None):
    found = False
    ret = default
    if text.startswith(prefix):
        val = text[len(prefix):]
        found = True
        try:
            ret = int(val)
        except ValueError:
            pass
    return (ret, found)

to be called as e.g. parse(item, "timeout=").

An alternative version is:

def parse(text, prefix, default=None):
    found = False
    ret = default
    parts = text.split("=", maxsplit=1)
    if len(parts) == 2 and parts[0] == prefix:
        found = True
        try:
            ret = int(parts[1])
        except ValueError:
            pass
    return (ret, found)
``
to be called as e.g. `parse(item, "timeout")`.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated and simplified code in the method. What is wrong with regular expression?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's simply overkill and slower, as the same job can be done without.

Comment on lines 880 to 931
"""

:return:
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still undocumented, i'd remove it then (i don't see any added value by this docstring)

@jirihnidek jirihnidek force-pushed the jhnidek/keep_connection_alive branch from 3c72146 to 6cdc435 Compare May 12, 2022 19:15
Copy link
Contributor

@ptoscano ptoscano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is getting better, there are still few things to fix/improve; in addition to the inline notes, there is still the wrong max_request_num typo (missing s).

CONSUMER_CERT_AUTH = enum.auto()

# Connection uses Keycloak token
KEYCLOAK_AUTH = enum.auto
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo for auto()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sorry. Fixed.

@@ -131,6 +133,24 @@ def __str__(self):
return "Bad certificate at %s" % self.cert_path


class ConnectionType(enum.Flag):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

considering the various connection types are mutually exclusive, IMHO the type should be Enum; in addition, if the values will be the description/debug strings, it is possible to use the automatic string representation to avoid the manual string assignment on debug (see later on)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it should be enum.Enum and not enum.Flag. I don't to assign some values to BASIC_AUTH, NO_AUTH, etc. that are used for debug print in one method in another class.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't to assign some values to BASIC_AUTH, NO_AUTH, etc. that are used for debug print in one method in another class.

I don't understand this sentence, sorry :-/

What I mean is that you can use strings as values for the enum items, and use them when printing for debug:

>>> class FooEnum(enum.Enum):
...    FOO_1 = "foo_1"
...    FOO_2 = "foo2?"
>>> x = FooEnum.FOO_1
>>> print(x.value)
foo_1

This way, in the (remote) case of adding a new authentication type, it is only needed to add the new enum item without adding it also to debug blocks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, but I don't like it, because you propose to use e.g. "basic auth" as a value, but this value is specific for one method. BTW: This string is not grammatically correct. I just do NOT want to use it in the way you are proposing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jirka, no need to be that drastic :)

Considering the enum was introduced for debugging purposes, it makes sense to use it for debugging. Having a "string representation" as value would make definitely more sense than an integer value for which there is no use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, your are proposing and I'm not accepting. I opened this PR almost 2 months ago. Is it something critical that is blocking this PR from merging to main?

Comment on lines +771 to +793
if self.auth_type == ConnectionType.KEYCLOAK_AUTH:
auth = "keycloak auth"
elif self.auth_type == ConnectionType.BASIC_AUTH:
auth = "basic auth"
elif self.auth_type == ConnectionType.CONSUMER_CERT_AUTH:
auth = "consumer auth"
elif self.auth_type == ConnectionType.NO_AUTH:
auth = "no auth"
else:
auth = "undefined auth"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enum can do a string representation automatically, see above

Comment on lines 786 to 903
else:
body = None

log.debug("Making request: %s %s" % (request_type, handler))

if self.user_agent:
self.headers["User-Agent"] = self.user_agent

final_headers = self.headers.copy()
if body is None:
final_headers["Content-Length"] = "0"
if headers:
final_headers.update(headers)

self._print_debug_info_about_request(request_type, handler, final_headers, body)

@staticmethod
def parse_keep_alive_header(keep_alive_header: str) -> tuple:
"""
Try to parse 'Keep-Alive' header received from candlepin server
:param keep_alive_header: string with value of the header
:return: Tuple containing connection timeout and maximal number of requests
"""
keep_alive_timeout = None
max_requests_num = None
# Regular expression pattern represents: key=number
pattern = re.compile(r"^(.*)=(\d+)$")

items = keep_alive_header.split()

for item in items:
search_result = pattern.search(item)
if search_result is not None:
key, value = search_result.groups()
# Timeout of connection using keep-alive
if key == "timeout":
keep_alive_timeout = int(search_result.groups()[1])
# Maximal number of request on one connection
if key == "max":
max_requests_num = int(search_result.groups()[1])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be simplified a bit without a regexp:

--- a/src/rhsm/connection.py
+++ b/src/rhsm/connection.py
@@ -869,21 +869,19 @@ class BaseRestLib(object):
         """
         keep_alive_timeout = None
         max_requests_num = None
-        # Regular expression pattern represents: key=number
-        pattern = re.compile(r"^(.*)=(\d+)$")
 
         items = keep_alive_header.split()
 
         for item in items:
-            search_result = pattern.search(item)
-            if search_result is not None:
-                key, value = search_result.groups()
+            parts = item.split("=", maxsplit=2)
+            if len(parts) == 2:
+                key, value = parts
                 # Timeout of connection using keep-alive
                 if key == "timeout":
-                    keep_alive_timeout = int(search_result.groups()[1])
+                    keep_alive_timeout = safe_int(value)
                 # Maximal number of request on one connection
-                if key == "max":
-                    max_requests_num = int(search_result.groups()[1])
+                elif key == "max":
+                    max_requests_num = safe_int(value)
 
         return keep_alive_timeout, max_requests_num
 

also please use a if/elif/etc chain, so it is a minor optimization

Copy link
Contributor Author

@jirihnidek jirihnidek May 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not use regular expression on server side, but it is client application and we are not going to parse many HTTP headers per second.

BTW: Your proposal does not have problem with following argument: timeout=123=456. It will produce timeout with string "123=465" and it could cause troubles later. I still believe that regular expression is more reliable in this case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW: Your proposal does not have problem with following argument: timeout=123=456. It will produce timeout with string "123=465" and it could cause troubles later. I still believe that regular expression is more reliable in this case.

I'm not sure I understand: even in my proposal there is the conversion to int, and "123=465" definitely makes int() raise ValueError.

I checked my proposal with the tests you added, and they work fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW: Your proposal does not have problem with following argument: timeout=123=456. It will produce timeout with string "123=465" and it could cause troubles later. I still believe that regular expression is more reliable in this case.

I'm not sure I understand: even in my proposal there is the conversion to int, and "123=465" definitely makes int() raise ValueError.

I'm sorry for confusion. I tested only small snippet of your proposal. safe_int() will not raise exception, but it will return safe_value which is default None in this case. It is probably fine, but floats and negative values should not be IMHO allowed according RFC.

I added more test cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is probably fine, but floats and negative values should not be IMHO allowed according RFC.

int() already rejects floats. Regarding negative values, it can be an easy addition:

def safe_positive_int(value, safe_value=None):
    num = safe_int(value, safe_value)
    if num is not None and num <= 0:
        return safe_value
    return num

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just don't like what you are proposing. Please stop enforce me to rewrite my code in different way. There is nothing wrong with regular expression.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jirka, no need to be that drastic nor absolute :) We are having a polite conversation on the possible implementation of this, and I'm explaining my ideas with reasons, examples and code snippets. There is noone "enforcing you" in any way, I'm sorry if you feel that way.

I said earlier why I don't think it's ideal to use a regexp: while it works, it's too slow, and can be done even without.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that it is slow, but we talk about client (not server). We do not parse 10000 requests per second. We parse one response per second, when everything goes smoothly.

@ptoscano
Copy link
Contributor

Also, please ensure the proper style of the patches, reindenting with black.

@jirihnidek jirihnidek force-pushed the jhnidek/keep_connection_alive branch 2 times, most recently from 61784f9 to 78b2caf Compare May 16, 2022 13:57
* The client tools try to keep TCP/TLS connection now
* The HTTP header "Connection: keep-alive" is sent now and
  when server does not reply with "Connection: close", then
  RHSM tools try to keep existing connection. It helps to
  reduce traffic between client tools and candlepin server.
  It also reduce significantly time of executing all commands.
  E.g. The time of registration is reduced from 20 seconds
  to 13 seconds
* Connection using consumer authentication has to be recreated
  during registration, because new consumer certificate is
  installed
* Tested with rhsm.service and it also works as expected
* Improved little bit debugging of traffic and it is visible
  what type of authentication is used for communication with
  candlepin server. You can see the difference, when
  "subscription-manager version" is used vs e.g.
  "subscription-manager identity"
* Try to do TLS shutdown handshake (tear down) at the end of
  communication with candlepin server
* Improved debug print of authentication type
* Close all connections, when consumer certs are deleted
* New environment variable SUBMAN_DEBUG_TCP_IP is supported.
  When this env. var. exists, then information about TCP/IP
  is also printed. It helps to track connections.
* TODO:
  - rhsm.service do not try to reuse connections
  - When connection is inactive for too long, then sub-man
    cannot resurrect from this situation.
@jirihnidek jirihnidek force-pushed the jhnidek/keep_connection_alive branch from 78b2caf to de72a7e Compare May 24, 2022 08:35
* When connection is closed by server due to client inactivity
  and exception httplib.RemoteDisconnected is raised, then
  client close existing connections and it tries request
  once again. This happens only in situation, when server
  does not propagate keep-alive timeout using Keep-Alive
  HTTP header and timeout of server is lower than default
  timeout used by client (see following comments).
* We try to use default values of connection timeout 50 seconds
* When timeout is reached, then client closes opened connections
  and new connections are recreated. It minimize repeated
  requests due to client inactivity
* When server sends in response HTTP header Keep-Alive, then
  client tries to parse this HTTP header and timeout and
  max parameters are used
* Use `enum` for type of connections
* Added several unit tests for parsing Keep-Alive HTTP header
* Do not print status messages, when debug printing is
  enabled, because it could lead to mixing of two outputs
@jirihnidek jirihnidek force-pushed the jhnidek/keep_connection_alive branch from de72a7e to d11934d Compare May 24, 2022 09:54
Copy link
Contributor

@m-horky m-horky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything works for me. I can confirm significant speedup of 25% [1]. LGTM.

[1]: I run subscription-manager register --force three times for both main and this feature branch. Average speedup was 25.9%, min 18.4%, max 33.6%.

@m-horky m-horky merged commit 551da68 into main Jun 2, 2022
@m-horky m-horky deleted the jhnidek/keep_connection_alive branch June 2, 2022 08:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants