Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance] Data viewer can't handle large DFs #3434

Closed
FranciscoRZ opened this issue Apr 25, 2019 · 14 comments
Closed

[Performance] Data viewer can't handle large DFs #3434

FranciscoRZ opened this issue Apr 25, 2019 · 14 comments
Assignees

Comments

@FranciscoRZ
Copy link

Environment data

  • VS Code version: 1.33.1
  • Extension version (available under the Extensions sidebar): 2019.4.1
  • OS and version: Windows 7
  • Python version (& distribution if applicable, e.g. Anaconda): Anaconda distribution, Python 3.6.2
  • Type of virtual environment used (N/A | venv | virtualenv | conda | ...): conda
  • Relevant/affected Python packages and their versions: None

Expected behaviour

View large DataFrames (>1000 columns, >1000 rows) in under 1 minute

Actual behaviour

When opening large DFs (current is 709x3201) the Data Viewer stops at showing the structure with all values at 'loading ...' (current runtime 20 minutes).

Steps to reproduce:

  1. Create synthetic data frame: 3000 series of 700 floats each
  2. In variable explorer click view in data viewer

Logs

Output for Python in the Output panel (ViewOutput, change the drop-down the upper-right of the Output panel to Python)

None

Output from Console under the Developer Tools panel (toggle Developer Tools on under Help; turn on source maps to make any tracebacks be useful by running Enable source map support for extension debugging)

Can't find relevant logs. Is 'View in Data Viewer' supposed to show up in the logs at some point ?

I was really looking forward to these features, so thanks for getting them in there! However, when dealing with quantitative finance problems we often have very large dataframes, and it would be nice to be able to use the data viewer to explore them.

Best regards,

Francisco

@FranciscoRZ
Copy link
Author

Update

I tried viewing just a (3226,) pandas.Series and got the following error thrown back:

Error: Failure during variable extraction:

TypeError Traceback (most recent call last)
in
78
79 # Transform this back into a string
---> 80 print(_VSCODE_json.dumps(_VSCODE_targetVariable))
81 del _VSCODE_targetVariable

C:\ProgramData\Anaconda3\envs\DEV64\lib\json_init_.py in dumps(obj, skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, default, sort_keys, **kw)
229 cls is None and indent is None and separators is None and
230 default is None and not sort_keys and not kw):
--> 231 return _default_encoder.encode(obj)
232 if cls is None:
233 cls = JSONEncoder

C:\ProgramData\Anaconda3\envs\DEV64\lib\json\encoder.py in encode(self, o)
197 # exceptions aren't as detailed. The list call should be roughly
198 # equivalent to the PySequence_Fast that ''.join() would do.
--> 199 chunks = self.iterencode(o, _one_shot=True)
200 if not isinstance(chunks, (list, tuple)):
201 chunks = list(chunks)

C:\ProgramData\Anaconda3\envs\DEV64\lib\json\encoder.py in iterencode(self, o, _one_shot)
255 self.key_separator, self.item_separator, self.sort_keys,
256 self.skipkeys, _one_shot)
--> 257 return _iterencode(o, 0)
258
259 def _make_iterencode(markers, _default, _encoder, _indent, _floatstr,

C:\ProgramData\Anaconda3\envs\DEV64\lib\json\encoder.py in default(self, o)
178 """
179 raise TypeError("Object of type '%s' is not JSON serializable" %
--> 180 o.class.name)
181
182 def encode(self, o):

TypeError: Object of type 'Timestamp' is not JSON serializable

So I created a DataFrame of 2 series of 700 values with a pandas.DatetimeIndex, and while I can see the values the index is not shown, which would defeat the purpose of viewing DFs when working with time series.
I'm guessing this is why the larger DF is not loaded. Is there a way around this?

@greazer
Copy link
Member

greazer commented Apr 25, 2019

Need to investigate.

@rchiodo
Copy link
Contributor

rchiodo commented Apr 25, 2019

@FranciscoRZ do you have an example piece of code you can share to create the failing problem?

At least for the datetimeIndex, it's working fine for me.

Also what version of pandas are you using?

@rchiodo
Copy link
Contributor

rchiodo commented Apr 25, 2019

For the huge amount of columns, I don't think we'll be able to support it for a while. The control we're using isn't virtualizing columns, just rows, so it adds the 3000 columns into the DOM and the nodejs process runs out of memory.

@FranciscoRZ, I think we're going to have to limit the number of columns displayed.

Here's the code I used to repro the first issue:

cols = range(1, 5000)
ls = []
for n in range(1, 3000):
  ls.append(pd.Series(data=cols))
df = pd.DataFrame(ls)

@FranciscoRZ
Copy link
Author

@rchiodo, thanks for the quick response (and sorry for the late reply, I'm guessing we're in different time zones 😅 )
My pandas version is 0.24.2.

To reproduce the problem with large DFs, I used the following code:

import numpy as np
import pandas as pd
col = pd.Series(data=np.random.random_sample((700,))*100)
dfInit = {}
idx = pd.date_range('2007-01-01', periods=700, freq='M')
for i in range(3000):
     dfInit[i] = col
dfInit['idx'] = idx
df = pd.DataFrame(dfInit).set_index('idx')

--> double click df in variable explorer

From that, I reproduced the problem with the DatetimeIndex as follows:

df2 = df.iloc[:, [0,1]]

--> double click df2 in variable explorer

Here's what I get:

image

Sorry I can't be of more help, I really have no experience / knowledge of web development technologies.

Also, while working yesterday I noticed that as my variable environment grew, the variable explorer started to flicker more and more, and took a while to reload. Is it reevaluating all variables each time a variable is defined? If so, it seems like a really resource intensive process. I haven't yet come accross performance issues I can pin to this, but maybe a "refresh" button in the variable explorer would be more user friendly / resource conscious?
Anyways, just a thought. I know it may seem like I'm nitpicking, but I'm a big fan of the work you guys are doing, so thanks again and good luck with the release! 👍 💪

@rchiodo
Copy link
Contributor

rchiodo commented Apr 26, 2019

Thanks for the code. That helps. Although I'm confused as to the bug for the second? It's not showing the 'TypeError: Object of type 'Timestamp' is not JSON serializable' anymore? So that part of it is fixed/not reproing?

Thanks for the feedback on the refresh too. You can disable refreshing by just collapsing the variables explorer, but yes we do refresh every variable every cell.

It works like a debugger does, every time you step it updates all of the variables.

We should do some perf testing to see if there's any interference with actually executing cells.

Adding @IanMatthewHuff for an FYI.

@FranciscoRZ
Copy link
Author

Sorry, I forgot to specify that one.
Working off of the last code snippet, you can reproduce the error as follows:

row = df.loc[df.index[0], :]
--> Double click row in variable explorer

row here is a pandas.Series made up of the row we specified, whose name is the index at which we extracted the row (in our case the first Timestamp of our DatetimeIndex).

I know some pandas attributes are not serializable (I'm pretty sure name is one of them) but the behavior remains somewhat confusing.
Namely, if you set the Series name to a string the data viewer works (row.name = "foo") and foo shows up as a column name. However, it breaks when name is a Timestamp even though if the index is a Timestamp it won't break, instead it just doesn't show anything. This looks to me like inconsistent behavior.
I'm guessing your data viewer requires serializable column names but not values? If so, seems like a workaround would be to catch this specific error and not show unserializable column names until a permanent fix can be made. In any case, I think if the data viewer really can't handle pandas.Timestamps (such as in the picture above) it would be better for the user to have a warning thrown such as to avoid confusion and more opened issues.

@FranciscoRZ
Copy link
Author

Lastly, I've been playing around with the date type to see if this is a pandas related problem.
First, I converted the Timestamp to native datetime.datetime:
row.name = df.index[0].to_pydatetime()
Then, I used the date attribute of the datetime.datetime object:
row.name = df.index[0].to_pydatetime().date()
And finally I tried a numpy.datetime64:
row.name = np.datetime64(df.index[0])

It's interesting that in the case of numpy and to_pydatetime the error ended with TypeError: Object of type 'Timestamp' is not JSON serializable while in the case of to_pydatetime().date() the error was TypeError: Object of type 'date' is not JSON serializable.
However, they all threw an error so I'm guessing as it stands the dataviewer can't handle dates at all.

So I took a quick look at the source code. I'm guessing that the Python side of the data viewer's magic happens in getJupyterVariableDataFrameInfo.py and getJupyterVariableDataFrameRows.py.
Since you're already checking by hand the input types for Series and DataFrames here
image
and here
image
maybe you could do a supplementary check for dates in the Series and Dataframes and set the default converter in the json.dumps() call to a custom converter as described here. Basically, if you find dates the default would be the custom converter otherwise simply raise TypeError (as is the default behavior of the default argument).

Anyways, hope this helps.

@rchiodo
Copy link
Contributor

rchiodo commented Apr 29, 2019

Yes it does help. Thanks.

I'm going to split out the dataframe index and timestamp problems. They're different from the too many columns not working.

Actually there's already another bug for this:
https://github.com/Microsoft/vscode-python/issues/5452

I'm going to fix the timestamp problem there too. It has to do with timestamps/datetimes having custom string formatting and pandas not using the same values as str() does.

@FranciscoRZ
Copy link
Author

You're right, best to keep them separate. Thanks, and good luck!

@rchiodo
Copy link
Contributor

rchiodo commented Apr 29, 2019

@FranciscoRZ the timestamp/index column problem is now fixed. You can try it out if you like from our insider's build

The column virtualization (or limitation) is going to take a little longer though.

@FranciscoRZ
Copy link
Author

Awesome, I will!

Best regards,

Francisco

@rchiodo rchiodo self-assigned this May 23, 2019
@rchiodo
Copy link
Contributor

rchiodo commented May 31, 2019

I just submitted a fix for the column virtualization. Please feel free to try it out in our next insiders
build (should be ready in about half an hour).

It should support any number of columns and rows, but it will ask if you're sure you want to open the view if there's more than 1000 columns. More than a 1000 columns causes the initial bring up to take awhile and fetching the data can take longer too (it has to turn the rows into a JSON string in order to send it to our UI - function of how VS code works).

1000 x 10000 DF takes me about 5 minutes to load.

However it also now supports filtering with expressions on numeric columns. Example:

Filter

@FranciscoRZ
Copy link
Author

That's really impressive, thanks a lot!

Take care,

Francisco

@rchiodo rchiodo closed this as completed Aug 6, 2019
@lock lock bot locked as resolved and limited conversation to collaborators Aug 14, 2019
@microsoft microsoft unlocked this conversation Nov 14, 2020
@DonJayamanne DonJayamanne transferred this issue from microsoft/vscode-python Nov 14, 2020
@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 7, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants