Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow autocompletion in python/ipython console for large DataFrame containing strings #37947

Closed
flcong opened this issue Nov 18, 2020 · 8 comments
Labels
Dependencies Required and optional dependencies

Comments

@flcong
Copy link
Contributor

flcong commented Nov 18, 2020

I'm not sure if this is the right place to ask, but it seems the autocompletion in python or ipython console is especially slow for large DataFrame with strings (object) in it.

For example, consider the following two DataFrames:

import numpy as np
import pandas as pd
df1 = pd.DataFrame(np.random.rand(400000,60))
df2 = pd.DataFrame([["asdfasdf"]*60]*400000)

By typing df1.<TAB> in the interactive python/ipython session, the autocompletion is quick, but it takes a very long time for df2.<TAB> to finish (it gets stucked for many seconds).

I'm not sure if this is due to different implementation of pandas for DataFrames containing numbers vs. strings (object), or this is due to issues in the interactive python/ipython session.

@jreback
Copy link
Contributor

jreback commented Nov 19, 2020

pls pd.show_versions() this is not an issue on master

@flcong
Copy link
Contributor Author

flcong commented Nov 19, 2020

INSTALLED VERSIONS
------------------
commit           : 67a3d4241ab84419856b84fc3ebc9abcbe66c6b3
python           : 3.8.6.final.0
python-bits      : 64
OS               : Windows
OS-release       : 10
Version          : 10.0.19041
machine          : AMD64
processor        : Intel64 Family 6 Model 158 Stepping 13, GenuineIntel
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : English_United States.1252

pandas           : 1.1.4
numpy            : 1.19.4
pytz             : 2020.4
dateutil         : 2.8.1
pip              : 20.2.4
setuptools       : 49.6.0.post20201009
Cython           : 0.29.21
pytest           : 6.1.2
hypothesis       : None
sphinx           : 3.3.1
blosc            : None
feather          : None
xlsxwriter       : 1.3.7
lxml.etree       : 4.6.1
html5lib         : 1.1
pymysql          : None
psycopg2         : None
jinja2           : 2.11.2
IPython          : 7.19.0
pandas_datareader: None
bs4              : 4.9.3
bottleneck       : 1.3.2
fsspec           : 0.8.4
fastparquet      : None
gcsfs            : None
matplotlib       : 3.3.3
numexpr          : 2.7.1
odfpy            : None
openpyxl         : 3.0.5
pandas_gbq       : None
pyarrow          : None
pytables         : None
pyxlsb           : None
s3fs             : None
scipy            : 1.5.3
sqlalchemy       : 1.3.20
tables           : 3.6.1
tabulate         : None
xarray           : None
xlrd             : 1.2.0
xlwt             : 1.3.0
numba            : 0.51.2

Actually, it only happens if I type df2.<TAB>. If I give an initial letter, it is okay, e.g. df2.a<TAB>.

@jorisvandenbossche
Copy link
Member

This is also an issue on master.

The reason is that jedi, the package providing the tab completion info for IPython, is executing all attributes while getting this info. And doing a profile on your example case, it seems that it is df2.T that is the one very slow attribute of the DataFrame causing this slow tab completion (that's also the reason that df2.a<TAB> doesn't have the issue).
See eg davidhalter/jedi#1383 for some context (although that was about showing deprecation warnings when executing attributes, but it's related to the same root cause).

@jorisvandenbossche
Copy link
Member

Now, in the meantime, jedi has improved (see eg davidhalter/jedi#520 (comment)), and I was myself not using the latest version. After upgrading jedi in my local environment from 0.15 to 0.17, the issue mostly went away.

@flcong can you check the version of jedi that you are using? (import jedi; jedi.__version__)

@flcong
Copy link
Contributor Author

flcong commented Nov 20, 2020

Now, in the meantime, jedi has improved (see eg davidhalter/jedi#520 (comment)), and I was myself not using the latest version. After upgrading jedi in my local environment from 0.15 to 0.17, the issue mostly went away.

@flcong can you check the version of jedi that you are using? (import jedi; jedi.__version__)

Thank you. It's the latest version I guess: 0.17.2, but df2.T<TAB> still gets stuck for several second, but I think it's fine.

@hwalinga
Copy link
Contributor

hwalinga commented Dec 5, 2020

There has been improvements in Jedi, but there are still cases in which Jedi is still really slow: davidhalter/jedi#1696

And as it seems this won't improve much in the future: davidhalter/jedi#1059 (comment)

Seems like pandas is a bit too complex too handle, and the current implementation of Jedi isn't designed with that in mind.

@jbrockmendel
Copy link
Member

Is this actionable on our end?

@jbrockmendel jbrockmendel added the Dependencies Required and optional dependencies label Jun 19, 2021
@mroeschke
Copy link
Member

Sounds like the most recent versions of jedi has somewhat ameliorated this issue, and it's not to evident what pandas could do since auto-completion is handled by jedi. Closing, but happy to reopen if someone could identify what in pandas would need fixing to enhance the performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dependencies Required and optional dependencies
Projects
None yet
Development

No branches or pull requests

6 participants