[Performance] Data viewer can't handle large DFs #3434

FranciscoRZ · 2019-04-25T08:45:07Z

Environment data

VS Code version: 1.33.1
Extension version (available under the Extensions sidebar): 2019.4.1
OS and version: Windows 7
Python version (& distribution if applicable, e.g. Anaconda): Anaconda distribution, Python 3.6.2
Type of virtual environment used (N/A | venv | virtualenv | conda | ...): conda
Relevant/affected Python packages and their versions: None

Expected behaviour

View large DataFrames (>1000 columns, >1000 rows) in under 1 minute

Actual behaviour

When opening large DFs (current is 709x3201) the Data Viewer stops at showing the structure with all values at 'loading ...' (current runtime 20 minutes).

Steps to reproduce:

Create synthetic data frame: 3000 series of 700 floats each
In variable explorer click view in data viewer

Logs

Output for Python in the Output panel (View→Output, change the drop-down the upper-right of the Output panel to Python)

None

Output from Console under the Developer Tools panel (toggle Developer Tools on under Help; turn on source maps to make any tracebacks be useful by running Enable source map support for extension debugging)

Can't find relevant logs. Is 'View in Data Viewer' supposed to show up in the logs at some point ?

I was really looking forward to these features, so thanks for getting them in there! However, when dealing with quantitative finance problems we often have very large dataframes, and it would be nice to be able to use the data viewer to explore them.

Best regards,

Francisco

The text was updated successfully, but these errors were encountered:

FranciscoRZ · 2019-04-25T09:09:16Z

Update

I tried viewing just a (3226,) pandas.Series and got the following error thrown back:

Error: Failure during variable extraction:

TypeError Traceback (most recent call last)
in
78
79 # Transform this back into a string
---> 80 print(_VSCODE_json.dumps(_VSCODE_targetVariable))
81 del _VSCODE_targetVariable

C:\ProgramData\Anaconda3\envs\DEV64\lib\json_init_.py in dumps(obj, skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, default, sort_keys, **kw)
229 cls is None and indent is None and separators is None and
230 default is None and not sort_keys and not kw):
--> 231 return _default_encoder.encode(obj)
232 if cls is None:
233 cls = JSONEncoder

C:\ProgramData\Anaconda3\envs\DEV64\lib\json\encoder.py in encode(self, o)
197 # exceptions aren't as detailed. The list call should be roughly
198 # equivalent to the PySequence_Fast that ''.join() would do.
--> 199 chunks = self.iterencode(o, _one_shot=True)
200 if not isinstance(chunks, (list, tuple)):
201 chunks = list(chunks)

C:\ProgramData\Anaconda3\envs\DEV64\lib\json\encoder.py in iterencode(self, o, _one_shot)
255 self.key_separator, self.item_separator, self.sort_keys,
256 self.skipkeys, _one_shot)
--> 257 return _iterencode(o, 0)
258
259 def _make_iterencode(markers, _default, _encoder, _indent, _floatstr,

C:\ProgramData\Anaconda3\envs\DEV64\lib\json\encoder.py in default(self, o)
178 """
179 raise TypeError("Object of type '%s' is not JSON serializable" %
--> 180 o.class.name)
181
182 def encode(self, o):

TypeError: Object of type 'Timestamp' is not JSON serializable

So I created a DataFrame of 2 series of 700 values with a pandas.DatetimeIndex, and while I can see the values the index is not shown, which would defeat the purpose of viewing DFs when working with time series.
I'm guessing this is why the larger DF is not loaded. Is there a way around this?

greazer · 2019-04-25T21:00:32Z

Need to investigate.

rchiodo · 2019-04-25T21:39:45Z

@FranciscoRZ do you have an example piece of code you can share to create the failing problem?

At least for the datetimeIndex, it's working fine for me.

Also what version of pandas are you using?

rchiodo · 2019-04-25T21:50:18Z

For the huge amount of columns, I don't think we'll be able to support it for a while. The control we're using isn't virtualizing columns, just rows, so it adds the 3000 columns into the DOM and the nodejs process runs out of memory.

@FranciscoRZ, I think we're going to have to limit the number of columns displayed.

Here's the code I used to repro the first issue:

cols = range(1, 5000)
ls = []
for n in range(1, 3000):
  ls.append(pd.Series(data=cols))
df = pd.DataFrame(ls)

FranciscoRZ · 2019-04-26T07:51:21Z

@rchiodo, thanks for the quick response (and sorry for the late reply, I'm guessing we're in different time zones 😅 )
My pandas version is 0.24.2.

To reproduce the problem with large DFs, I used the following code:

import numpy as np
import pandas as pd
col = pd.Series(data=np.random.random_sample((700,))*100)
dfInit = {}
idx = pd.date_range('2007-01-01', periods=700, freq='M')
for i in range(3000):
     dfInit[i] = col
dfInit['idx'] = idx
df = pd.DataFrame(dfInit).set_index('idx')

--> double click df in variable explorer

From that, I reproduced the problem with the DatetimeIndex as follows:

df2 = df.iloc[:, [0,1]]

--> double click df2 in variable explorer

Here's what I get:

Sorry I can't be of more help, I really have no experience / knowledge of web development technologies.

Also, while working yesterday I noticed that as my variable environment grew, the variable explorer started to flicker more and more, and took a while to reload. Is it reevaluating all variables each time a variable is defined? If so, it seems like a really resource intensive process. I haven't yet come accross performance issues I can pin to this, but maybe a "refresh" button in the variable explorer would be more user friendly / resource conscious?
Anyways, just a thought. I know it may seem like I'm nitpicking, but I'm a big fan of the work you guys are doing, so thanks again and good luck with the release! 👍 💪

rchiodo · 2019-04-26T16:40:07Z

Thanks for the code. That helps. Although I'm confused as to the bug for the second? It's not showing the 'TypeError: Object of type 'Timestamp' is not JSON serializable' anymore? So that part of it is fixed/not reproing?

Thanks for the feedback on the refresh too. You can disable refreshing by just collapsing the variables explorer, but yes we do refresh every variable every cell.

It works like a debugger does, every time you step it updates all of the variables.

We should do some perf testing to see if there's any interference with actually executing cells.

Adding @IanMatthewHuff for an FYI.

FranciscoRZ · 2019-04-29T08:10:52Z

Sorry, I forgot to specify that one.
Working off of the last code snippet, you can reproduce the error as follows:

row = df.loc[df.index[0], :]
--> Double click row in variable explorer

row here is a pandas.Series made up of the row we specified, whose name is the index at which we extracted the row (in our case the first Timestamp of our DatetimeIndex).

I know some pandas attributes are not serializable (I'm pretty sure name is one of them) but the behavior remains somewhat confusing.
Namely, if you set the Series name to a string the data viewer works (row.name = "foo") and foo shows up as a column name. However, it breaks when name is a Timestamp even though if the index is a Timestamp it won't break, instead it just doesn't show anything. This looks to me like inconsistent behavior.
I'm guessing your data viewer requires serializable column names but not values? If so, seems like a workaround would be to catch this specific error and not show unserializable column names until a permanent fix can be made. In any case, I think if the data viewer really can't handle pandas.Timestamps (such as in the picture above) it would be better for the user to have a warning thrown such as to avoid confusion and more opened issues.

FranciscoRZ · 2019-04-29T09:00:37Z

Lastly, I've been playing around with the date type to see if this is a pandas related problem.
First, I converted the Timestamp to native datetime.datetime:
row.name = df.index[0].to_pydatetime()
Then, I used the date attribute of the datetime.datetime object:
row.name = df.index[0].to_pydatetime().date()
And finally I tried a numpy.datetime64:
row.name = np.datetime64(df.index[0])

It's interesting that in the case of numpy and to_pydatetime the error ended with TypeError: Object of type 'Timestamp' is not JSON serializable while in the case of to_pydatetime().date() the error was TypeError: Object of type 'date' is not JSON serializable.
However, they all threw an error so I'm guessing as it stands the dataviewer can't handle dates at all.

So I took a quick look at the source code. I'm guessing that the Python side of the data viewer's magic happens in getJupyterVariableDataFrameInfo.py and getJupyterVariableDataFrameRows.py.
Since you're already checking by hand the input types for Series and DataFrames here

and here

maybe you could do a supplementary check for dates in the Series and Dataframes and set the default converter in the json.dumps() call to a custom converter as described here. Basically, if you find dates the default would be the custom converter otherwise simply raise TypeError (as is the default behavior of the default argument).

Anyways, hope this helps.

rchiodo · 2019-04-29T17:47:31Z

Yes it does help. Thanks.

I'm going to split out the dataframe index and timestamp problems. They're different from the too many columns not working.

Actually there's already another bug for this:
https://github.com/Microsoft/vscode-python/issues/5452

I'm going to fix the timestamp problem there too. It has to do with timestamps/datetimes having custom string formatting and pandas not using the same values as str() does.

FranciscoRZ · 2019-04-29T17:55:02Z

You're right, best to keep them separate. Thanks, and good luck!

rchiodo · 2019-04-29T22:16:39Z

@FranciscoRZ the timestamp/index column problem is now fixed. You can try it out if you like from our insider's build

The column virtualization (or limitation) is going to take a little longer though.

FranciscoRZ · 2019-04-30T08:04:41Z

Awesome, I will!

Best regards,

Francisco

rchiodo · 2019-05-31T15:44:53Z

I just submitted a fix for the column virtualization. Please feel free to try it out in our next insiders
build (should be ready in about half an hour).

It should support any number of columns and rows, but it will ask if you're sure you want to open the view if there's more than 1000 columns. More than a 1000 columns causes the initial bring up to take awhile and fetching the data can take longer too (it has to turn the rows into a JSON string in order to send it to our UI - function of how VS code works).

1000 x 10000 DF takes me about 5 minutes to load.

However it also now supports filtering with expressions on numeric columns. Example:

FranciscoRZ · 2019-06-03T08:55:40Z

That's really impressive, thanks a lot!

Take care,

Francisco

rchiodo self-assigned this May 23, 2019

rchiodo closed this as completed Aug 6, 2019

lock bot locked as resolved and limited conversation to collaborators Aug 14, 2019

microsoft unlocked this conversation Nov 14, 2020

DonJayamanne transferred this issue from microsoft/vscode-python Nov 14, 2020

github-actions bot locked as resolved and limited conversation to collaborators May 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] Data viewer can't handle large DFs #3434

[Performance] Data viewer can't handle large DFs #3434

FranciscoRZ commented Apr 25, 2019

FranciscoRZ commented Apr 25, 2019

greazer commented Apr 25, 2019

rchiodo commented Apr 25, 2019

rchiodo commented Apr 25, 2019

FranciscoRZ commented Apr 26, 2019

rchiodo commented Apr 26, 2019

FranciscoRZ commented Apr 29, 2019

FranciscoRZ commented Apr 29, 2019

rchiodo commented Apr 29, 2019

FranciscoRZ commented Apr 29, 2019

rchiodo commented Apr 29, 2019

FranciscoRZ commented Apr 30, 2019

rchiodo commented May 31, 2019

FranciscoRZ commented Jun 3, 2019

[Performance] Data viewer can't handle large DFs #3434

[Performance] Data viewer can't handle large DFs #3434

Comments

FranciscoRZ commented Apr 25, 2019

Environment data

Expected behaviour

Actual behaviour

Steps to reproduce:

Logs

FranciscoRZ commented Apr 25, 2019

Error: Failure during variable extraction:

greazer commented Apr 25, 2019

rchiodo commented Apr 25, 2019

rchiodo commented Apr 25, 2019

FranciscoRZ commented Apr 26, 2019

rchiodo commented Apr 26, 2019

FranciscoRZ commented Apr 29, 2019

FranciscoRZ commented Apr 29, 2019

rchiodo commented Apr 29, 2019

FranciscoRZ commented Apr 29, 2019

rchiodo commented Apr 29, 2019

FranciscoRZ commented Apr 30, 2019

rchiodo commented May 31, 2019

FranciscoRZ commented Jun 3, 2019