Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible memory leak when using Google Sheets API #535

Closed
mrc0mmand opened this issue Jun 29, 2018 · 24 comments
Closed

Possible memory leak when using Google Sheets API #535

mrc0mmand opened this issue Jun 29, 2018 · 24 comments
Assignees
Labels
api: sheets Issues related to the Sheets API API. status: blocked Resolving the issue is dependent on other work. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@mrc0mmand
Copy link

There seems to be a memory leak when using the google-api-client with GSheets.

Environment:

$ python --version
Python 3.6.4
$ pip show google-api-python-client
Name: google-api-python-client
Version: 1.7.3

Here's a simple reproducer (without a .client_secret.json):

#!/usr/bin/env python3

import httplib2
import os
from apiclient import discovery
from memory_profiler import profile
from oauth2client import client, tools
from oauth2client.file import Storage
from time import sleep

SCOPES = "https://www.googleapis.com/auth/spreadsheets.readonly"
# See https://cloud.google.com/docs/authentication/getting-started
CLIENT_SECRET_FILE = ".client_secret.json"
APPLICATION_NAME = "ClientDebug"
DISCOVERY_URL = "https://sheets.googleapis.com/$discovery/rest?version=v4"

def get_credentials():
    home_dir = os.path.expanduser("~")
    credential_dir = os.path.join(home_dir, ".credentials")
    flags = None
    if not os.path.exists(credential_dir):
        os.makedirs(credential_dir)
    credential_path = os.path.join(credential_dir,
                                "sheets.googleapis.com-clientdebug.json")

    store = Storage(credential_path)
    credentials = store.get()
    if not credentials or credentials.invalid:
        flow = client.flow_from_clientsecrets(CLIENT_SECRET_FILE, SCOPES)
        flow.user_agent = APPLICATION_NAME
        credentials = tools.run_flow(flow, store, flags)

    return credentials

@profile(precision=4)
def get_responses(creds):
    """Fetch spreadsheet data."""
    sheet_id = "1TowKJrFVbT4Bfp-HFcMh_CZ5anfH0CLfmoqCz9SUr9c"

    http = creds.authorize(httplib2.Http())
    service = discovery.build("sheets", "v4", http=http,
                discoveryServiceUrl=(DISCOVERY_URL), cache_discovery=False)
    result = service.spreadsheets().values().get(
            spreadsheetId=sheet_id, range="A1:O").execute()
    values = result.get("values", [])

    print("Got {} rows".format(len(values)))

if __name__ == "__main__":
    creds = get_credentials()
    for i in range(0, 50):
        get_responses(creds)
        sleep(2)

For measurements I used memory_profiler module with following results:

First and second iteration

Got 760 rows
Filename: ./main.py

Line #    Mem usage    Increment   Line Contents
================================================
    35  26.5195 MiB  26.5195 MiB   @profile(precision=4)
    36                             def get_responses(creds):
    37                                 """Fetch spreadsheet data."""
    38  26.5195 MiB   0.0000 MiB       sheet_id = "1TowKJrFVbT4Bfp-HFcMh_CZ5anfH0CLfmoqCz9SUr9c"
    39                             
    40  26.5195 MiB   0.0000 MiB       http = creds.authorize(httplib2.Http())
    41  26.5195 MiB   0.0000 MiB       service = discovery.build("sheets", "v4", http=http,
    42  29.2891 MiB   2.7695 MiB                   discoveryServiceUrl=(DISCOVERY_URL), cache_discovery=False)
    43  49.5742 MiB  20.2852 MiB       result = service.spreadsheets().values().get(
    44  49.5742 MiB   0.0000 MiB               spreadsheetId=sheet_id, range="A1:O").execute()
    45  49.5742 MiB   0.0000 MiB       values = result.get("values", [])
    46                             
    47  49.5742 MiB   0.0000 MiB       print("Got {} rows".format(len(values)))


Got 760 rows
Filename: ./main.py

Line #    Mem usage    Increment   Line Contents
================================================
    35  49.5742 MiB  49.5742 MiB   @profile(precision=4)
    36                             def get_responses(creds):
    37                                 """Fetch spreadsheet data."""
    38  49.5742 MiB   0.0000 MiB       sheet_id = "1TowKJrFVbT4Bfp-HFcMh_CZ5anfH0CLfmoqCz9SUr9c"
    39                             
    40  49.5742 MiB   0.0000 MiB       http = creds.authorize(httplib2.Http())
    41  49.5742 MiB   0.0000 MiB       service = discovery.build("sheets", "v4", http=http,
    42  49.5742 MiB   0.0000 MiB                   discoveryServiceUrl=(DISCOVERY_URL), cache_discovery=False)
    43  67.9922 MiB  18.4180 MiB       result = service.spreadsheets().values().get(
    44  67.9922 MiB   0.0000 MiB               spreadsheetId=sheet_id, range="A1:O").execute()
    45  67.9922 MiB   0.0000 MiB       values = result.get("values", [])
    46                             
    47  67.9922 MiB   0.0000 MiB       print("Got {} rows".format(len(values)))

Last iteration

Got 760 rows
Filename: ./main.py

Line #    Mem usage    Increment   Line Contents
================================================
    35 229.6055 MiB 229.6055 MiB   @profile(precision=4)
    36                             def get_responses(creds):
    37                                 """Fetch spreadsheet data."""
    38 229.6055 MiB   0.0000 MiB       sheet_id = "1TowKJrFVbT4Bfp-HFcMh_CZ5anfH0CLfmoqCz9SUr9c"
    39                             
    40 229.6055 MiB   0.0000 MiB       http = creds.authorize(httplib2.Http())
    41 229.6055 MiB   0.0000 MiB       service = discovery.build("sheets", "v4", http=http,
    42 229.6055 MiB   0.0000 MiB                   discoveryServiceUrl=(DISCOVERY_URL), cache_discovery=False)
    43 229.6055 MiB   0.0000 MiB       result = service.spreadsheets().values().get(
    44 229.6055 MiB   0.0000 MiB               spreadsheetId=sheet_id, range="A1:O").execute()
    45 229.6055 MiB   0.0000 MiB       values = result.get("values", [])
    46                             
    47 229.6055 MiB   0.0000 MiB       print("Got {} rows".format(len(values)))

There's clearly a memory leak, as the reproducer fetches the same data over and over again, yet the memory consumption keeps rising. Full log can be found here.

As a temporary workaround for one of my long-running applications I use an explicit garbage collector call, which mitigates this issue, at least for now:

...
import gc
...
    result = service.spreadsheets().values().get(
            spreadsheetId=sheet_id, range="A1:O").execute()
    values = result.get("values", [])

    gc.collect()
...

I went a little deeper, and the main culprit seems to be in the createMethod function when creating dynamic method batchUpdate:

Method 'batchUpdate, approx. __doc__ size: 2886834                                                                                                                                             
<class 'function'>                                                                                                                                                            
Filename: /home/fsumsal/venv/googleapiclient/lib64/python3.6/site-packages/googleapiclient/discovery.py                                                                       
                                                                                                                                                                              
Line #    Mem usage    Increment   Line Contents                                                                                                                              
================================================                                                                                                                              
  1064     48.7 MiB     48.7 MiB     @profile                                                                                                                                 
  1065                               def _add_basic_methods(self, resourceDesc, rootDesc, schema):                                                                            
  1066                                 # If this is the root Resource, add a new_batch_http_request() method.                                                                 
  1067     48.7 MiB      0.0 MiB       if resourceDesc == rootDesc:                                                                                                           
...                          
  1086                                                                                                                                                                        
  1087                                 # Add basic methods to Resource                                                                                                        
  1088     48.7 MiB      0.0 MiB       if 'methods' in resourceDesc:                                                                                                          
  1089     66.8 MiB      0.0 MiB         for methodName, methodDesc in six.iteritems(resourceDesc['methods']):                                                                
  1090     56.0 MiB      0.0 MiB           fixedMethodName, method = createMethod(                                                                                            
  1091     66.8 MiB     18.1 MiB               methodName, methodDesc, rootDesc, schema)                                                                                      
  1092     66.8 MiB      0.0 MiB           print(type(method))                                                                                                                
  1093     66.8 MiB      0.0 MiB           self._set_dynamic_attr(fixedMethodName,             
  1094     66.8 MiB      0.0 MiB                                  method.__get__(self, self.__class__))                                                                                        
  1095                                     # Add in _media methods. The functionality of the attached method will                                                                              
  1096                                     # change when it sees that the method name ends in _media.                                                                                          
  1097     66.8 MiB      0.0 MiB           if methodDesc.get('supportsMediaDownload', False):  
  1098                                       fixedMethodName, method = createMethod(           
  1099                                           methodName + '_media', methodDesc, rootDesc, schema)                                                                                          
  1100                                       self._set_dynamic_attr(fixedMethodName,           
  1101                                                              method.__get__(self, self.__class__))

(This method has a huge docstring.)

Nevertheless, there is probably a reference loop somewhere, as the gc.collect() call manages to collect all those unreachable objects.

@mcdonc
Copy link
Contributor

mcdonc commented Jun 29, 2018

I've been able to replicate this with a variant of your script, thanks!

I think it may be possible to cache the result of createMethod such that we don't create a new closure for every call, which would make the memory leak less dramatic (probably not noticeable), and faster. But before I do that I'll try to identify the actual cycle.

@mcdonc mcdonc added the status: investigating The issue is under investigation, which is determined to be non-trivial. label Jun 29, 2018
@mcdonc
Copy link
Contributor

mcdonc commented Jul 10, 2018

Running debugging commentary for future reference:

  • Not a function of the docstring being large; if I change createMethod to not set the docstring onto the dynamically generated method, it still leaks.

  • Causing the generated method to be a noop doesn't fix the leak.

  • Only creating "sheets" once prevents the leak too (if it's created more than
    once, the leak remains):

Code:

creds = get_credentials()
http = creds.authorize(httplib2.Http())
service = discovery.build("sheets", "v4", http=http,
            discoveryServiceUrl=(DISCOVERY_URL), cache_discovery=False)
sheets = service.spreadsheets()
for i in range(0, 50):
    get_responses(sheets)
    sleep(2) # no leak
  • Adding the following to discovery.Resource does not fix the leak:

Code:

def __del__(self):
  for dynamic_attr in self._dynamic_attrs:
    del self.__dict__[dynamic_attr]
  del self._dynamic_attrs
  • objgraph library fails to show growing references after the first iteration. Further frustration
    with this is that importing objgraph causes the leak to go away, causing a Heisenbug. This
    turns out to be because objgraph imports graphviz, and something in graphviz.. does.. something.
    But even if graphviz doesnt get imported by objgraph, calling show_growth() also causes a Heisenbug;
    calling show_growth() prevents the leak.

Code:

if __name__ == "__main__":
    creds = get_credentials()
    for i in range(0, 100):
        get_responses(creds)
        objgraph.show_growth() # prints nothing
        sleep(1)
  • I added a close method to googleapiclient.discovery.Resource implemented like so and calling
    this method explicitly against all Resource instances created by the reproducer script
    prevents "the leak" (I put "the leak" in quotes because it still leaks references between
    each iteration, but much fewer, and those references are somehow a cycle between
    HTTP-related objects, which is likely unrelated.

Code:

def close(self):
  for dynamic_attr in self._dynamic_attrs:
    del self.__dict__[dynamic_attr]
  self._dynamic_attrs = []
  • I tried using a weakref to bind methods to resource instances dynamically, but it works a little too well.
    Whenever one of these functions is called against a function chain like
    service.spreadsheets().values() the ValueError('Cannot call function; instance was purged') error
    is raised.

Code:

def _bind_weakmethod(self, attr_name, func):
  instance_ref = weakref.ref(self)
  cls = self.__class__
  def themethod(*arg, **kw):
    instance = instance_ref()
    if instance is None:
      raise ValueError('Cannot call function; instance was purged')
    function = func.__get__(instance, cls)
    return function(*arg, **kw)
  self._set_dynamic_attr(attr_name, themethod)

And then eg.:

# Add basic methods to Resource
if 'methods' in resourceDesc:
  for methodName, methodDesc in six.iteritems(resourceDesc['methods']):
    fixedMethodName, method = createMethod(
        methodName, methodDesc, rootDesc, schema)
    self._bind_weakmethod(fixedMethodName, method)

mcdonc added a commit to mcdonc/google-api-python-client that referenced this issue Jul 10, 2018
…en code causes createMethod to be repeatedly called against the same input params
@mcdonc
Copy link
Contributor

mcdonc commented Jul 12, 2018

This is the "leak", a very simple circref. It cannot be avoided without a redesign.

gapi-circref

It is caused by implementing "dynamic methods" using the descriptor protocol, ala https://github.com/google/google-api-python-client/blob/master/googleapiclient/discovery.py#L1088

If there is a way around this circref, I don't know it.

@mcdonc
Copy link
Contributor

mcdonc commented Jul 14, 2018

Now that we've figured why we perceive there is a leak, a) is it a problem? and b), if so, what can we do about it?

Is it a problem?

Yes and no.

Theoretically, no, it's not a problem. Reference cycles in programs are normal, and Python's garbage collector will eventually trash the objects involved in cycles. Every time a Resource object is created (by calling methods on other Resource objects), we get some cycles between that Resource and its dynamic methods, and, in theory, this is fine.

Practically, yes, it's a problem. Repeated Resource creatoin causes the process RSS to bloat, and, on Linux at least, the memory consumed by these references is not given back to the OS due to memory fragmentation, even after the cycles are broken.

What can we do about it?

I've put it some work on a branch (https://github.com/mcdonc/google-api-python-client/tree/fix-createmethod-memleak-535) trying to make the symptoms slightly better.

Try #1 on that branch, which is now a few commits back, and isn't represented by the current state of the branch, was caching the functions that become methods on Resource objects, only creating one function per input instead of one per call. This is not a reasonable fix, however, because refs involved in cycles still grow; every time a Resource is instantiated, it binds itself to some number of methods, and even if the functions representing these methods are not repeatedly created, the act of binding cached methods to each still creates cycles.

Try #2, which represents the current state of the branch, dynamically creates and caches one Resource class per set of inputs, instead of just caching the result of dynamic method creation. This disuses the descriptor protocol to bind dynamic methods to instances, so the only circrefs are those as if each resource type had its own class in sys.modules['googleapiclient.discovery']. The number of circrefs is dramatically reduced, and RSS growth is bounded after the first call of the replication script (unlike master, where it grows without bound on each call, although every so often gc kicks in and brings it down a little). According to gc.set_debug(gc.DEBUG_LEAK) under py 3.6, he length of gc.garbage is 2214 after 40 iterations of the reproducer script for-loop, instead of master's gargantuan 45218. And I believe we could bring that down more by fixing an unrelated different leak.
However, the resulting instances cannot be pickled, which is, I believe, part of their API.

So I think we have these options:

  • Cause pickling to no longer be part of the API and use the code on the https://github.com/mcdonc/google-api-python-client/tree/fix-createmethod-memleak-535 branch. If pickling was absolutely necessary, we could create (and instruct people to use) a custom unpickler that generated the dynamic classes for us, by subclassing Unpickler ala https://docs.python.org/3/library/pickle.html#pickle.Unpickler

  • Cause Resource instances to refer to any subresource they create and expose a "close()" method on resources, which would clear its dynamic methods and the dynamic methods of any subresource recursively, which would break the refcycle. However, this method would need to be explicitly called by library users; there isn't a natural place for us to call it.

  • Do nothing. We punt and tell people to a) create as few resource as possible (don't do more work than is necessary) and to b) call gc.collect() after sections of their code that have a side effect of creating lots of resources.

@mrc0mmand
Copy link
Author

Thank you very much for the thorough analysis. As you already said, a complete fix would require some time to implement, so the temporary fix using a cache along with gc.collect() is good enough for now at keeping the memory consumption at bay.

Again, thanks!

@mcdonc
Copy link
Contributor

mcdonc commented Jul 14, 2018

@mrc0mmand yes, in your particular case, creating "sheets" only once would make it leak so little that you won't need gc.collect()

@mcdonc mcdonc added status: acknowledged and removed status: investigating The issue is under investigation, which is determined to be non-trivial. labels Jul 14, 2018
@mcdonc
Copy link
Contributor

mcdonc commented Jul 14, 2018

@theacodes can you advise about which one of the options in #535 (comment) is most appropriate?

@JustinBeckwith JustinBeckwith added priority: p2 Moderately-important priority. Fix may not be included in next release. and removed priority: p2+ labels Jul 16, 2018
@mcdonc mcdonc added status: blocked Resolving the issue is dependent on other work. and removed status: acknowledged labels Jul 17, 2018
@hx2A
Copy link

hx2A commented Sep 7, 2018

@mcdonc @theacodes If I get a vote I'd like to see the second option, adding a .close() method. I've spent the past week or so tracking down this same memory error and found my way to this page. It happens I specifically looked for close() methods in the Resource objects because I knew something somewhere wasn't being released.

Adding a .close() method seems cleaner than my having to call gc.collect(). Either way I have to do something to cleanup resources and calling .close() is analogous to what we do already for files and other things.

In any case, this issue should be mentioned in the documentation and sample code, please!

@JustinBeckwith JustinBeckwith added the 🚨 This issue needs some love. label Dec 26, 2018
@sduskis sduskis added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. and removed 🚨 This issue needs some love. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels Feb 5, 2019
@willjhenry
Copy link

I have a cron job, on google app engine, that reads data in from a google sheet. I am noticing the same memory leak (or maybe a different memory leak?). I tried the recommend work arounds: 1. creating the "sheets" object only once, and use gc.collect(). Neither worked in my case. As a a test, I changed the few lines of code that read data from a google sheet to read data from a database table, and the memory leak went away.

@dcoleyoung
Copy link

dcoleyoung commented Apr 30, 2020

I have a cron job, on google app engine, that reads data in from a google sheet. I am noticing the same memory leak (or maybe a different memory leak?). I tried the recommend work arounds: 1. creating the "sheets" object only once, and use gc.collect(). Neither worked in my case. As a a test, I changed the few lines of code that read data from a google sheet to read data from a database table, and the memory leak went away.

Can you help me to clarify your last sentence here? Did you ever fix the code, or just confirm the memory leak? I'm in the same situation, app engine job that reads google sheet and started getting "Exceeded soft memory limit" errors. And like you the garbage collection suggestions did not help my situation.

@willjhenry
Copy link

I never fixed it... in the short term, I used a high mem appengine instance so that would take longer to hit the memory threshold and then, as a long term solution, I switched to airtable instead of google sheets.

@dcoleyoung
Copy link

I never fixed it... in the short term, I used a high mem appengine instance so that would take longer to hit the memory threshold and then, as a long term solution, I switched to airtable instead of google sheets.

Got it - I'll look into using airtable instead. Great suggestion. Appreciate the help.

@JustinBeckwith JustinBeckwith added the api: sheets Issues related to the Sheets API API. label Jun 16, 2020
@AmosDinh
Copy link

AmosDinh commented Aug 9, 2020

My solution in the end was just too use "pure" http requests

@hx2A
Copy link

hx2A commented Aug 9, 2020

@AmosDinh Can you elaborate? In my project the memory leak issue has reared its ugly head again and I am looking for new approaches to deal with it.

@AmosDinh
Copy link

AmosDinh commented Aug 9, 2020

@hx2A Well I don't use the the api at all,except for creating credentials. Here is an example from my class.

def buildCredentials(self,refresh=False):
   if refresh:
      self.credentials.refresh(httplib2.Http())
   self. credentials = ServiceAccountCredentials.from_json_keyfile_name(
            "myCreds.json",
            scopes=[
                'https://www.googleapis.com/auth/drive',
                'https://www.googleapis.com/auth/spreadsheets',

            ])
   delegated_credentials = credentials.create_delegated(
            [service_account_email])
   access_token = delegated_credentials.get_access_token().access_token
   self.headers = {
            "Authorization": "Bearer "+access_token,
        }

and

def spreadsheetAppend(self,spreadsheetId,list2d,secondcycle=False):
        valueInputOption="RAW"
        range='!A:A'
        url = f"https://sheets.googleapis.com/v4/spreadsheets/{spreadsheetId}/values/{range}:append?valueInputOption={valueInputOption}"
        data = json.dumps({
            "range":range,
            "majorDimension":"ROWS",
            "values":list2d
        })
        r = requests.post(url,headers=self.headers,data=data)
        if r.status_code>399:
             print(r,r.headers)
             if r.statuscode==401 and not secondcycle: 
                 self.buildCredentials(refresh=True) 
                 self.spreadsheetAppend(spreadsheetId,list2d,secondcycle=True)
            
        

@hx2A
Copy link

hx2A commented Aug 9, 2020

@AmosDinh Thank you, I understand now. This is helpful.

@AmosDinh
Copy link

AmosDinh commented Aug 9, 2020

@hx2A glad I could help you

@AmosDinh
Copy link

AmosDinh commented Aug 9, 2020

Alternatively, you could have the sheets api code in another process which you could terminate after execution / RAM usage hitting a certain threshold. - Just wanted to include that option

@hx2A
Copy link

hx2A commented Aug 18, 2020

@AmosDinh I'm trying to get this working now but I keep getting 403 'Forbidden' responses. I believe it has something to do with my service account's roles and permissions. Can you tell me about how your service account is configured? My current memory leaking code doesn't seem to be using the service account so I need to be sure it is configured correctly.

@AmosDinh
Copy link

AmosDinh commented Aug 18, 2020

Which drive are you accessing? Your personal one, which you can reach by going to https://drive.google.com? In that case you have to add the service account by its email to a folder as editor/owner, then you can edit or create files in that folder using the service account credentials.

@hx2A
Copy link

hx2A commented Aug 18, 2020

I just got it to work. I had multiple problems but a big part of it was I needed to share the files on my gdrive with the service acount's email address. Thanks!

@marvic2409
Copy link

I had the same memory leak so I just sprinkled gc.collect() everywhere and bam now its manageable. I doubt that this would count as a fix though

@hx2A
Copy link

hx2A commented Oct 21, 2020

@marvic2409 you might have a slow memory leak that leaks a small number of MB every hour. For a decent-sized system, it will take some time to become a problem.

@busunkim96
Copy link
Contributor

Circling back to this, this recommendation from #535 (comment) is the best way to avoid this.

  • Only creating "sheets" once prevents the leak too (if it's created more than
    once, the leak remains):
creds = get_credentials()
http = creds.authorize(httplib2.Http())
service = discovery.build("sheets", "v4", http=http,
            discoveryServiceUrl=(DISCOVERY_URL), cache_discovery=False)
sheets = service.spreadsheets()
for i in range(0, 50):
    get_responses(sheets)
    sleep(2) # no leak

Creating multiple service objects results in (1) potential memory problems and (2) takes extra time for refreshing credentials. If you're creating a service object inside a loop, or a function that's called more than once, move it outside the loop/function so it can be reused.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: sheets Issues related to the Sheets API API. status: blocked Resolving the issue is dependent on other work. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

No branches or pull requests