Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recommend Counter() in place of defaultdict(int) #323

Closed
peterjc opened this issue Dec 16, 2022 · 7 comments
Closed

Recommend Counter() in place of defaultdict(int) #323

peterjc opened this issue Dec 16, 2022 · 7 comments

Comments

@peterjc
Copy link

peterjc commented Dec 16, 2022

I am suggesting a new check to recommend replacing defaultdict(int) with Counter(), both of which are from the standard library collections, due to the former having a major gotcha with memory usage.

Consider this test case:

from collections import Counter
from collections import defaultdict

# Profiled using;
# $ mprof run defaultdict_leak.py && mprof plot
if True:
    # Apparent memory leak
    # Over 700MB using Python 3.9.13 on macOS
    # Over 900MB using Python 3.9.15 on Linux
    counts = defaultdict(int)
else:
    # Low and flat memory
    # about 16MB using Python 3.9.13 on macOS
    # about 19MB using Python 3.9.15 on Linux
    counts = Counter()

# Setup some sparse data
for x in [12,34,56,78,90]:
    counts[str(x)] = 123 + x

threshold = 100
excesses = 0
for x in range(int(1e7)):
    if threshold < counts[str(x)]:
        excesses += counts[str(x)]
print(f"Sum of entries exceeding the threshold {threshold} is {excesses}")

The apparent memory leak with defaultdict is due to the documented behaviour of recording the missing keys in the dictionary with the default value, https://docs.python.org/3/library/collections.html#collections.defaultdict says:

... provide a default value for the given key, this value is inserted in the dictionary for the key, and returned.

...

When each key is encountered for the first time, it is not already in the mapping; so an entry is automatically created using the default_factory ...

I assume I'm not the only person to have used defaultdict without appreciating this, and in the case of defaultdict(int) there is good alternative in Counter()

peterjc referenced this issue in peterjc/thapbi-pict Dec 16, 2022
This appears to solve an apparent memory leak.
@cooperlees
Copy link
Collaborator

Howdy - Good suggestion. TIL about Counter and I use defaultdict(int) a lot. With it being a dict subclass what fixes the memory? https://docs.python.org/3/library/collections.html#collections.Counter - Couldn't see anything noted here ... I guess the bug is in defaultdict itself and not dict (otherwise there would be leaks all over python as dicts are used everywhere internally).

Has the memory problem been fixed in later pythons (i.e. in >= 3.10) with defaultdict? If not, has a bug been raised? I'm torn with this one as it feels we could get it fixed in cpython, but maybe recommending it could be worth it. Would love others opinions here ...

@peterjc
Copy link
Author

peterjc commented Dec 16, 2022

I've not confirmed on recent Python, but assume that like Python 3.9 they do the same - i.e. they probably behave the same way as per the current documentation. I don't think it is a bug, just a potentially surprising feature.

I can understand if the calling factory function to make default objects were expensive then the current behaviour may have advantages, but with defaultdict(int) then I see no benefit - only a memory leak (the unwanted keys and zero values).

@dejvidq
Copy link
Contributor

dejvidq commented Mar 16, 2023

I'm happy to help with implementing this one. As it's not clear bug and just a recommendation, maybe it should be disabled by default?

@peterjc
Copy link
Author

peterjc commented Mar 16, 2023

If this is accepted as in scope, I agree it falls under the B9## opinionated warnings rather than being on by default. Thanks!

@cooperlees
Copy link
Collaborator

Yup - I'll accept an opinionated for this - Lets just make sure we share the memory savings etc. - Thanks!

dejvidq pushed a commit to dejvidq/flake8-bugbear that referenced this issue Aug 15, 2024
dejvidq pushed a commit to dejvidq/flake8-bugbear that referenced this issue Aug 15, 2024
@dejvidq
Copy link
Contributor

dejvidq commented Aug 15, 2024

Please have a look at #489 if it's what you meant.

I checked the code given below on python 3.10 and 3.12 and in both cases using Counter instead of defaultdict(int) resulted in reducing memory usage from ~1000MB to flat ~20 MB

Btw. I'm sorry it took me so much time to finally have a look at it 😞

dejvidq pushed a commit to dejvidq/flake8-bugbear that referenced this issue Aug 17, 2024
cooperlees pushed a commit that referenced this issue Aug 20, 2024
@cooperlees
Copy link
Collaborator

Implemented in #489

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants