BUG: appropriately handle non-string column/field names: #899

jakevdp · 2018-05-24T16:54:02Z

Both the names of data columns, and the field names that reference them, must be strings within Altair. This more cleanly addresses the case when someone passes a dataframe with non-string column names or fields that are not strings.

in parse_shorthand, raise an informative error if the column is not a string or dict
in sanitize_dataframe, convert non-string column names to strings

Fixes #898

- in parse_shorthand, raise an informative error if the column is not a string or dict - in sanitize_dataframe, convert non-string column names to strings

jakevdp · 2018-05-24T17:03:41Z

The only question in my mind here is whether sanitize_dataframe should convert non-string column names to strings, as I did here, or raise an error when encountering non-string column names.

I lean toward the former because existing code that uses dataframes with non-referenced columns with non-string names works, and should probably keep working.

jakevdp · 2018-05-24T17:09:24Z

Also, as @deisi mentioned in #898, it's useful to be able to use raw data directly without manually re-mapping columns. This lets you use (string) "0", "1", etc. in those situations.

jakevdp · 2018-05-24T17:10:35Z

We could go further and silently convert integer field names to strings as well in parse_shorthand, but I'm worried that would be confusing because it's difficult for the user to ascertain that that is happening.

brainstorm · 2018-06-16T02:11:07Z

Regarding sanitize_dataframe: why discard (type) metadata so early in the altair dataframe processing?

You mentioned in #941 that data is serialized to JSON and that type info is lost anyways. Isn't it possible for vega-lite to have some field or convention (same way you do with :Q, :N, :O, :T) on data type encodings coming from dataframes as well?

I would suspect that it only helps code downstream to have type information but perhaps you get rid of it early to avoid opening a strongly typed can of worms? I'm not sure if type inference, duck typing or manual re-typing of data downstream is a worse problem to have or if I'm totally missing the point of that design decision? :-S

Just wondering if changing this handling could help alleviating some points in #947.

jakevdp · 2018-06-16T03:12:47Z

I'm not certain what you're after here... we sanitize dataframes just before converting them to JSON, because if we don't then the json module will raise an error. We don't convert to JSON until the specification is being generated, and that's at render time. I don't think it would be possible to do the sanitization any later in the process.

For data passed as a dataframe, we infer vega-lite types and add them to the schema before this sanitization and conversion.

Vega-Lite never actually sees the data (it's just passed down with the compiled schema and interpreted by Vega), so it cannot do any sort of type inference itself.

What other handling do you have in mind?

brainstorm · 2018-06-16T03:51:02Z

Alright, my bad for not understanding the whole picture then, thanks for clarifying 👍

I was just taking the categoricals as an example and thinking that keeping df types could inform sorting and ordering of column= in #940 if types were not sanitized as strings... I thought that one of the reasons for that shortcoming was data types being lost in the sanitization and/or bad type inferring process.

As a user I was just surprised to see my pandas dataframe correctly sorted (using categoricals) when I eyeballed it and then having the ordering completely changed on the other end in the Altair plot. I would have expected Altair to respect types all the way down to the plot/axis presentation.

jakevdp · 2018-06-16T04:55:09Z

Yeah, the fundamental issue is that JSON does not have the concept of ordered categoricals, and Vega ingests data as JSON. We could try to do something like infer the sort order from pandas dataframes the same way we infer the data type and then add sort info to the schema in the appropriate place, but that seems a bit too magic to me, and quite likely to lead to unintended side-effects.

ellisonbg

In thinking a bit more about this. I am a little hesitant to hardcode the logic to convert strings always like this. The problem is that the str(o) may not be what the user was expecting and they have to know exactly what the column name it to type it into the spec. At the same time, I see why this is a nice default in many case. An in between approach would be to put this operation in a separate data transformer that a user could easily remove if it was causing problems, or even replace with custom logic.

ellisonbg · 2018-08-18T01:28:13Z

altair/utils/core.py

@@ -106,6 +107,9 @@ def to_list_if_array(val):
        else:
            return val

+    # All column names must be strings
+    df.columns = map(str, df.columns)


If we want to convert column names to strings, we should probably put this line in a try/expect that raises an informative error telling the user how to fix the issue.

jakevdp · 2018-08-18T02:14:35Z

Yes, good point on this being a bit too eager... A safer option would be to simply error on non-string column names, so the user can do the appropriate thing...

I think the one case that automatic conversion would make sense is when someone is converting an array to a dataframe:

import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(4, 3))

in this case, df.columns is a RangeIndex object. What would you think about automatically converting RangeIndex to string, and erroring in every other case?

Or perhaps simply raising an error on any non-string column name would be the simplest option (explicit is better than implicit, etc.)...

ellisonbg · 2018-08-20T15:08:31Z

I think automatically doing the conversion for `RangeIndex` objects does make sense.

…

On Fri, Aug 17, 2018 at 10:14 PM Jake Vanderplas ***@***.***> wrote: Yes, good point on this being a bit too eager... A safer option would be to simply error on non-string column names, so the user can do the appropriate thing... I think the one case that automatic conversion would make sense is when someone is converting an array to a dataframe: import numpy as np import pandas as pd df = pd.DataFrame(np.random.rand(4, 3)) in this case, df.columns is a RangeIndex object. What would you think about automatically converting RangeIndex to string, and erroring in every other case? Or perhaps simply raising an error on any non-string column name would be the simplest option. — You are receiving this because your review was requested. Reply to this email directly, view it on GitHub <#899 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AABr0Fh0FGwB-pJa7fYPCWEet5Xy9RQ4ks5uR3iMgaJpZM4UMnwZ> .

-- Brian E. Granger Associate Professor of Physics and Data Science Cal Poly State University, San Luis Obispo @ellisonbg on Twitter and GitHub [email protected] and [email protected]

jakevdp · 2018-08-21T15:23:42Z

OK, I've implemented that in #1107

BUG: appropriately handle non-string column/field names:

eb4af27

- in parse_shorthand, raise an informative error if the column is not a string or dict - in sanitize_dataframe, convert non-string column names to strings

jakevdp added this to the 2.1 milestone May 24, 2018

jakevdp requested a review from ellisonbg May 24, 2018 17:15

jakevdp mentioned this pull request Jun 1, 2018

Candlestick gallery example has no data visible #910

Closed

jakevdp modified the milestones: 2.1, 2.2 Jun 6, 2018

brainstorm mentioned this pull request Jun 12, 2018

Sorting of alt.Columns should be similar to alt.Colors? #940

Closed

ellisonbg reviewed Aug 18, 2018

View reviewed changes

jakevdp mentioned this pull request Aug 21, 2018

Appropriately handle non-string column names #1107

Merged

jakevdp closed this Aug 21, 2018

jakevdp deleted the col-names branch August 30, 2018 00:02

jakevdp mentioned this pull request Nov 6, 2018

Altair unable to chart pd.Series and integer column names #1189

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: appropriately handle non-string column/field names: #899

BUG: appropriately handle non-string column/field names: #899

jakevdp commented May 24, 2018 •

edited

Loading

jakevdp commented May 24, 2018 •

edited

Loading

jakevdp commented May 24, 2018

jakevdp commented May 24, 2018 •

edited

Loading

brainstorm commented Jun 16, 2018 •

edited

Loading

jakevdp commented Jun 16, 2018 •

edited

Loading

brainstorm commented Jun 16, 2018 •

edited

Loading

jakevdp commented Jun 16, 2018 •

edited

Loading

ellisonbg left a comment

ellisonbg Aug 18, 2018

jakevdp commented Aug 18, 2018 •

edited

Loading

ellisonbg commented Aug 20, 2018 via email

jakevdp commented Aug 21, 2018

BUG: appropriately handle non-string column/field names: #899

BUG: appropriately handle non-string column/field names: #899

Conversation

jakevdp commented May 24, 2018 • edited Loading

jakevdp commented May 24, 2018 • edited Loading

jakevdp commented May 24, 2018

jakevdp commented May 24, 2018 • edited Loading

brainstorm commented Jun 16, 2018 • edited Loading

jakevdp commented Jun 16, 2018 • edited Loading

brainstorm commented Jun 16, 2018 • edited Loading

jakevdp commented Jun 16, 2018 • edited Loading

ellisonbg left a comment

Choose a reason for hiding this comment

ellisonbg Aug 18, 2018

Choose a reason for hiding this comment

jakevdp commented Aug 18, 2018 • edited Loading

ellisonbg commented Aug 20, 2018 via email

jakevdp commented Aug 21, 2018

jakevdp commented May 24, 2018 •

edited

Loading

jakevdp commented May 24, 2018 •

edited

Loading

jakevdp commented May 24, 2018 •

edited

Loading

brainstorm commented Jun 16, 2018 •

edited

Loading

jakevdp commented Jun 16, 2018 •

edited

Loading

brainstorm commented Jun 16, 2018 •

edited

Loading

jakevdp commented Jun 16, 2018 •

edited

Loading

jakevdp commented Aug 18, 2018 •

edited

Loading