When I publish a data to pubsub topic, I have a memory leak on kubernetes pod. #395

fodrh1201 · 2021-04-23T08:03:13Z

Each time a public request is made, the memory in the pod of Kubernetes increases. Removing the publishing part does not increase memory.
The pod is continuously restarted due to OOM.

Environment details

OS type and version: Debian GNU/Linux 10
Python version: 3.8.7
pip version: 21.0.1
google-cloud-pubsub version: 2.4.1

Code example

class PublishToPubsubCallback(Callback):
    def __init__(self, pod_name, project_id, topic_id, preprocess=[]):
        super(PublishToPubsubCallback, self).__init__(pod_name, preprocess=preprocess)
        self.project_id = project_id
        self.topic_id = topic_id
        self.publisher = pubsub_v1.PublisherClient()
        self.topic_path = self.publisher.topic_path(project_id, topic_id)
        
    def __call__(self, data):
        pub_data = json.dumps(data).encode('utf-8')
        self.publisher.publish(self.topic_path, pub_data)

callback = PublishToPubsubCallback(pod_name, project_id, topic_id)

async def run():
    while True:
        data = await watch_websocket_data()
        callback(data)

loop = asyncio.get_event_loop()
loop.run_until_complete(run())

The text was updated successfully, but these errors were encountered:

plamut · 2021-04-26T10:09:41Z

@fodrh1201 Thanks for the report. Would it be possible to provide a minimal runnable self-contained code sample? We are not aware of any publisher memory leaks, thus we would first like to exclude any code outside of the client as a possible reason.

BTW, one thing that caught my eye is the mutable default argument to PublishToPubsubCallback's __init__() method, i.e. preprocess=[]. That empty list is instantiated only once at method definition time, and is shared among all instances of PublishToPubsubCallback. If the instances append anything to that list, it will continuously grow, consuming more and more memory.

In the code sample the callback is actually instantiated only once, but might nevertheless be worth checking out if that's also the case in the actual code.

fodrh1201 · 2021-04-28T04:22:38Z

This is python code.

import json
from google.cloud import pubsub_v1
import asyncio
import logging
from datetime import datetime


class Callback:
    
    def __init__(self, pod_name, is_work_at_stop=False, preprocess=[]):
        self.is_work_at_stop = is_work_at_stop
        self.pod_name = pod_name
        self.preprocess = preprocess
    
    def __call__(self, data):
        for preprocess in self.preprocess:
            data = preprocess(data)
        return data


class PublishToPubsubCallback(Callback):
    
    def __init__(self, pod_name, project_id, topic_id, preprocess=[]):
        super(PublishToPubsubCallback, self).__init__(pod_name, preprocess=preprocess)
        self.project_id = project_id
        self.topic_id = topic_id
        self.publisher = pubsub_v1.PublisherClient()
        self.topic_path = self.publisher.topic_path(project_id, topic_id)
        
    # @profile
    def __call__(self, data):
        data = super().__call__(data)
        pub_data = json.dumps(data).encode('utf-8')
        self.publisher.publish(self.topic_path, pub_data,
                            origin=self.pod_name, data_type=data['data_type'],
                            symbol=data['symbol'], client_id=data['client_id'],
                            timestamp=str(data['timestamp']),
                            market_type=data['market_type'])

data = {
    'data_type': 'orderbook',
    'client_id': 'binance',
    'symbol': 'BTC/USDT',
    'timestamp': '1619494646034',
    'market_type': 'spot',
    'asks': [[53482.36, 2e-06], [53493.34, 0.025379], [53493.64, 0.01412]],
    'bids': [[53482.35, 2.882399], [53479.18, 0.000373], [53477.2, 0.406]]
}

callback = PublishToPubsubCallback('test', 'bx-datafeed', 'test')
async def run():
    while True:
        callback(data)
        await asyncio.sleep(0.1)

loop = asyncio.get_event_loop()
loop.run_until_complete(run())

this is the result of above code.

my docker file is below.

FROM python:3.8
WORKDIR /app
ADD . .
RUN pip install argparse numpy google-cloud-pubsub==2.4.1 six==1.15.0

ENV PYTHONBUFFERED 1

plamut · 2021-04-29T08:02:35Z

I ran a slightly modified version of the example locally and measured memory consumption. I put the main code into the main() function (which was profiled), and changed the loop to run for 30_000 iterations. This is the graph:

It appears that the memory is indeed slowly but surely leaking. Can't yet say why, will investigate further.

A possible workaround is to modify the code to run in a subprocess (for a bounded number of iterations), and creating a new subprocess when the previous one finishes. Not ideal, of course, but at least the memory will be released at regular intervals.

plamut · 2021-04-29T14:01:17Z

It turns out the issue is reproducible even with a straightforward example that repeatedly publishes a trivial message in a loop.

import time

from google.cloud import pubsub_v1


PROJECT_ID = "set-project-id"
TOPIC_ID = "set-topic-id"


def main():
    publisher = pubsub_v1.PublisherClient()

    topic_path = publisher.topic_path(PROJECT_ID, TOPIC_ID)
    print("Starting the main loop...")

    for i in range(15_000):
        if not i % 1000:
            print(f"{i: 5}: publishing a message")

        publisher.publish(topic_path, b"x")
        time.sleep(0.01)

    print("Main loop complete!")


if __name__ == "__main__":
    main()

Using tracemalloc and taking a snapshot every 1000 iterations showed that by far the most memory was allocated in the threading module, but the (Python) traceback was limited:

130.9921875 new KiB, 1991.53125 total KiB, 3724 new, 56645 total memory blocks: 
  File "/usr/local/lib/python3.8/threading.py", line 890
    self._bootstrap_inner()
  File "/usr/local/lib/python3.8/threading.py", line 918
    self._set_tstate_lock()
  File "/usr/local/lib/python3.8/threading.py", line 908
    self._tstate_lock = _set_sentinel()

It appears that the these allocations are initiated by something outside of Python , maybe gRPC?

@lidizheng Do you know if there are currently any confirmed or suspected leaks in gRPC? Do you have any free cycles to take a look in the near future? Is there anything else I can provide on top of the above minimal example?

Versions:

Python 3.8
Ubuntu 18.04.4 LTS

google-cloud-pubsub==2.4.1 
grpcio==1.36.1
google-api-core==1.26.0
google-auth==1.28.0

lidizheng · 2021-04-29T16:49:33Z

@plamut I will take a look today or tomorrow.

plamut · 2021-04-29T20:37:00Z

@lidizheng This is the memoryprofiler run with the @profile decorator on the main() function:

The memory usage initially jumps to almost 70 MiB and then gradually increases to almost 79 MiB. I'm attaching the raw profiling data below (renamed the .dat suffix to .txt):

mprofile_data.txt

Update:
Hmm... an interesting observation. If I comment out the actual RPC call from the generated client, the memory still gradually increases!

diff --git google/pubsub_v1/services/publisher/client.py google/pubsub_v1/services/publisher/client.py
index f74e85a..d560307 100644
--- google/pubsub_v1/services/publisher/client.py
+++ google/pubsub_v1/services/publisher/client.py
@@ -605,7 +605,8 @@ class PublisherClient(metaclass=PublisherClientMeta):
         )

         # Send the request.
-        response = rpc(request, retry=retry, timeout=timeout, metadata=metadata,)
+        # response = rpc(request, retry=retry, timeout=timeout, metadata=metadata,)
+        response = pubsub.PublishResponse(message_ids=["123"] * len(request.messages))

         # Done; return the response.
         return response

It might be something in the hand-written layer that leaks memory, perhaps some rouge references to all the publish threads created when publishing each batch of messages... quite interesting.

plamut · 2021-04-30T07:46:19Z

Found a possible fix! If the client commit thread is marked as a daemon, the leak disappears. Can't say why, though.

The patch:

diff --git google/cloud/pubsub_v1/publisher/_batch/thread.py google/cloud/pubsub_v1/publisher/_batch/thread.py
index 36dd3b9..d0d648e 100644
--- google/cloud/pubsub_v1/publisher/_batch/thread.py
+++ google/cloud/pubsub_v1/publisher/_batch/thread.py
@@ -207,9 +207,10 @@ class Batch(base.Batch):
 
     def _start_commit_thread(self):
         """Start a new thread to actually handle the commit."""
-
+        # NOTE: If the thread is *not* a daemon, a memory leak exists for some reason...
+        # https://github.com/googleapis/python-pubsub/issues/395#issuecomment-829910303
         commit_thread = threading.Thread(
-            name="Thread-CommitBatchPublisher", target=self._commit
+            name="Thread-CommitBatchPublisher", target=self._commit, daemon=True
         )
         commit_thread.start()
 
diff --git google/cloud/pubsub_v1/publisher/client.py google/cloud/pubsub_v1/publisher/client.py
index 7e6801d..330fecb 100644
--- google/cloud/pubsub_v1/publisher/client.py
+++ google/cloud/pubsub_v1/publisher/client.py
@@ -375,8 +375,12 @@ class Client(object):
 
     def _start_commit_thread(self):
         """Start a new thread to actually wait and commit the sequencers."""
+        # NOTE: If the thread is *not* a daemon, a memory leak exists for some reason...
+        # https://github.com/googleapis/python-pubsub/issues/395#issuecomment-829910303
         self._commit_thread = threading.Thread(
-            name="Thread-PubSubBatchCommitter", target=self._wait_and_commit_sequencers
+            name="Thread-PubSubBatchCommitter",
+            target=self._wait_and_commit_sequencers,
+            daemon=True,
         )
         self._commit_thread.start()

Graph before:

Graph after:

plamut · 2021-04-30T13:24:30Z

For additional context, I found this CPython issue and the tracemalloc traceback from the reproduction looks very familiar (see one of the previous comments):

New Bytes: 36000	Total Bytes 36000	New blocks: 1000	Total blocks: 1000: 
  File "/usr/local/lib/python3.7/threading.py", line 890
    self._bootstrap_inner()
  File "/usr/local/lib/python3.7/threading.py", line 914
    self._set_tstate_lock()
  File "/usr/local/lib/python3.7/threading.py", line 904
    self._tstate_lock = _set_sentinel()

The issue has been closed as a duplicate of another issue, which is currently still open, and reportedly affects Python 3.8+.

I also did a test run with Python 3.6 and the memory usage was more or less constant - it appears that the linked CPython issue is indeed the root cause.

fodrh1201 · 2021-05-01T00:56:15Z

@plamut Thank you!! I solved the problem thanks to you.

product-auto-label bot added the api: pubsub Issues related to the googleapis/python-pubsub API. label Apr 23, 2021

yoshi-automation added the triage me I really want to be triaged. label Apr 25, 2021

plamut added type: question Request for information or clarification. Not an issue. and removed triage me I really want to be triaged. labels Apr 26, 2021

plamut added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. and removed type: question Request for information or clarification. Not an issue. labels Apr 29, 2021

plamut mentioned this issue Apr 30, 2021

fix: memory leak when publishing messages #406

Merged

4 tasks

plamut self-assigned this Apr 30, 2021

plamut added priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. and removed priority: p2 Moderately-important priority. Fix may not be included in next release. labels Apr 30, 2021

yoshi-automation added the 🚨 This issue needs some love. label Apr 30, 2021

plamut closed this as completed in #406 May 6, 2021

bowenatbear mentioned this issue Jul 13, 2022

Memory leak issue while publishing message #734

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When I publish a data to pubsub topic, I have a memory leak on kubernetes pod. #395

When I publish a data to pubsub topic, I have a memory leak on kubernetes pod. #395

fodrh1201 commented Apr 23, 2021 •

edited

Loading

plamut commented Apr 26, 2021

fodrh1201 commented Apr 28, 2021

plamut commented Apr 29, 2021

plamut commented Apr 29, 2021

lidizheng commented Apr 29, 2021 •

edited

Loading

plamut commented Apr 29, 2021 •

edited

Loading

plamut commented Apr 30, 2021 •

edited

Loading

plamut commented Apr 30, 2021 •

edited

Loading

fodrh1201 commented May 1, 2021

When I publish a data to pubsub topic, I have a memory leak on kubernetes pod. #395

When I publish a data to pubsub topic, I have a memory leak on kubernetes pod. #395

Comments

fodrh1201 commented Apr 23, 2021 • edited Loading

Environment details

Code example

plamut commented Apr 26, 2021

fodrh1201 commented Apr 28, 2021

plamut commented Apr 29, 2021

plamut commented Apr 29, 2021

lidizheng commented Apr 29, 2021 • edited Loading

plamut commented Apr 29, 2021 • edited Loading

plamut commented Apr 30, 2021 • edited Loading

plamut commented Apr 30, 2021 • edited Loading

fodrh1201 commented May 1, 2021

fodrh1201 commented Apr 23, 2021 •

edited

Loading

lidizheng commented Apr 29, 2021 •

edited

Loading

plamut commented Apr 29, 2021 •

edited

Loading

plamut commented Apr 30, 2021 •

edited

Loading

plamut commented Apr 30, 2021 •

edited

Loading