Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential memory leak on Jupyter Notebooks ZMQChannelHandler code #6244

Closed
Vishwajeet0510 opened this issue Dec 3, 2021 · 5 comments
Closed

Comments

@Vishwajeet0510
Copy link
Contributor

Vishwajeet0510 commented Dec 3, 2021

Description

We are using Jupyter Enterprise Gateway (v2.1.0) which uses Jupyter Notebook library as a dependency to handle the various API exposed by JEG. When a kernel is started, the Notebook UI tries to establish a Websocket connection to the kernel via JEG using the API /api/kernels/<kernel_id>/channels. The API handler for this on JEG is inherited from the Jupyter Notebook package (ZMQChannelHandler).

In case the kernel id passed in the above request URI is an invalid kernel (i.e. it is not being recognised by the JEG) and the
ZMQChannelHandler::pre_get tries to populate the _open_sessions hashmap with a self object before checking if the kernel actually exists _register_session This seems to create a memory leak if large number of such requests are made to the incorrect kernel and because the Websocket connection was never made the _open_sessions hashmap never gets cleaned up for the stale entries.

Upon debugging we found that, before checking if a kernel is valid or not, a session is being created and if the kernel is invalid that allocated memory is never released.

Steps followed

  • JEG server being run locally.
  • Artillery script to stress the JEG server with invalid requests.

Current behavior and Memory consumption

Current Code (/notebook/services/kernels/handlers.py):

@gen.coroutine
def pre_get(self):
    # authenticate first
    super(ZMQChannelsHandler, self).pre_get()
    # check session collision:
    yield self._register_session()
    # then request kernel info, waiting up to a certain time before giving up.
    # We don't want to wait forever, because browsers don't take it well when
    # servers never respond to websocket connection requests.
    kernel = self.kernel_manager.get_kernel(self.kernel_id)

Memory for JEG server (at start)

Screenshot 2021-12-03 at 7 55 02 PM

Screenshot 2021-12-03 at 7 56 12 PM

OS Version:      macOS 11.6.1 (20G224)
Report Version:  7
Analysis Tool:   /usr/bin/sample

Physical footprint:         36.5M
Physical footprint (peak):  36.5M

After running the script for 60 seconds

All virtual users finished
Summary report @ 20:00:27(+0530) 2021-12-03
  Scenarios launched:  911
  Scenarios completed: 0
  Requests completed:  0
  Mean response/sec: NaN
  Response time (msec):
    min: NaN
    max: NaN
    median: NaN
    p95: NaN
    p99: NaN
  Scenario counts:
    Sample: 911 (100%)
  Errors:
    Unexpected server response: 404: 911

Memory for JEG server (after 60 seconds)

Screenshot 2021-12-03 at 8 00 47 PM

OS Version:      macOS 11.6.1 (20G224)
Report Version:  7
Analysis Tool:   /usr/bin/sample

Physical footprint:         50.6M
Physical footprint (peak):  50.6M

Proposed Solution and Memory Consumption

If we check whether a kernel is valid or not before creating a session, the extra memory that is allocated and un-utilised can be avoided.

Proposed Change (/notebook/services/kernels/handlers.py):

@gen.coroutine
def pre_get(self):
    # authenticate first
    super(ZMQChannelsHandler, self).pre_get()
    # then request kernel info, waiting up to a certain time before giving up.
    # We don't want to wait forever, because browsers don't take it well when
    # servers never respond to websocket connection requests.
    kernel = self.kernel_manager.get_kernel(self.kernel_id)
    # check session collision:
    yield self._register_session()

Memory for JEG server (at start)

Screenshot 2021-12-03 at 8 03 44 PM

Screenshot 2021-12-03 at 8 04 46 PM

OS Version:      macOS 11.6.1 (20G224)
Report Version:  7
Analysis Tool:   /usr/bin/sample

Physical footprint:         36.6M
Physical footprint (peak):  36.6M

After running the script for 60 seconds

All virtual users finished
Summary report @ 20:07:23(+0530) 2021-12-03
  Scenarios launched:  911
  Scenarios completed: 0
  Requests completed:  0
  Mean response/sec: NaN
  Response time (msec):
    min: NaN
    max: NaN
    median: NaN
    p95: NaN
    p99: NaN
  Scenario counts:
    Sample: 911 (100%)
  Errors:
    Unexpected server response: 404: 911

Memory for JEG server (after 60 seconds)

Screenshot 2021-12-03 at 8 08 00 PM

OS Version:      macOS 11.6.1 (20G224)
Report Version:  7
Analysis Tool:   /usr/bin/sample

Physical footprint:         37.7M
Physical footprint (peak):  37.7M

Observations

  • As we see that the memory allocation for a session without checking a kernel being valid may ultimately result to possible un-utilised memory, we think it is better to check the kernel and then initialise a session.

Questions

  • Is the above understanding correct and is the proposed solution feasible?
  • Is there some underlying reason as to why session is being registered first and then kernel is being checked to be valid or not?

Environment

  • Enterprise Gateway Version [v 2.1.0]
  • Notebook Version [v 6.0.3]
  • Others [Artillery : 1.7.9]

Thanks & Regards
Vishwajeet

@rahul26goyal
Copy link

adding on to description: we are trying to understand the following:

  1. wrt https://github.com/jupyter/notebook/blob/6.0.3/notebook/services/kernels/handlers.py#L254 :
@gen.coroutine
    def _register_session(self):
        """Ensure we aren't creating a duplicate session.
        
        If a previous identical session is still open, close it to avoid collisions.
        This is likely due to a client reconnecting from a lost network connection,
        where the socket on our side has not been cleaned up yet.
        """
        self.session_key = '%s:%s' % (self.kernel_id, self.session.session)
        stale_handler = self._open_sessions.get(self.session_key)
        if stale_handler:
            self.log.warning("Replacing stale connection: %s", self.session_key)
            yield stale_handler.close()
        self._open_sessions[self.session_key] = self

self._open_sessions[self.session_key] = self why do we populate the hashmap before checking if the kernel exist...incase the kernel does not exist, the self object stored in the map would never get cleanup?

@kevin-bates
Copy link
Member

Hi @rahul26goyal - I'm sorry for the delayed response.

I believe this particular window really doesn't surface in normal operations (i.e., those not configured with a Gateway). EG introduces additional scenarios in which the notebook server's kernel management "loses track" of its kernels launched from another (potentially remote) server, into potentially other remote servers. That is, it introduces more layers where these kinds of issues can occur.

As noted in the referenced issue, I agree that it makes sense, particularly from a transactional perspective, that the kernel's existence be checked prior to session creation.

Would you like to provide a pull request to address this?

@Vishwajeet0510 Vishwajeet0510 changed the title Potential memory leak on Notebook server Potential memory leak on Jupyter Notebooks ZMQChannelHandler code Dec 8, 2021
@rahul26goyal
Copy link

thanks @kevin-bates ..I totally agree with you that EG introduces additional hops which might be causing this issue on JEG server.
Yes, we are planning to provide the fix to populate the _open_session dict only after checking if the kernel exists.
We just want to understand from the notebook community that this will not cause any additional regression on the overall lifecycle of websocket handler.
Thanks.

@Vishwajeet0510
Copy link
Contributor Author

Hello @kevin-bates

I have raised a PR for this issue. The fix will add a new session only after validation of the kernel. Let us know your views about this.

@kevin-bates
Copy link
Member

Closed via #6251

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 25, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants