-
Notifications
You must be signed in to change notification settings - Fork 795
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: appropriately handle non-string column/field names: #899
Conversation
- in parse_shorthand, raise an informative error if the column is not a string or dict - in sanitize_dataframe, convert non-string column names to strings
The only question in my mind here is whether I lean toward the former because existing code that uses dataframes with non-referenced columns with non-string names works, and should probably keep working. |
We could go further and silently convert integer field names to strings as well in |
Regarding You mentioned in #941 that data is serialized to JSON and that type info is lost anyways. Isn't it possible for vega-lite to have some field or convention (same way you do with I would suspect that it only helps code downstream to have type information but perhaps you get rid of it early to avoid opening a strongly typed can of worms? I'm not sure if type inference, duck typing or manual re-typing of data downstream is a worse problem to have or if I'm totally missing the point of that design decision? :-S Just wondering if changing this handling could help alleviating some points in #947. |
I'm not certain what you're after here... we sanitize dataframes just before converting them to JSON, because if we don't then the json module will raise an error. We don't convert to JSON until the specification is being generated, and that's at render time. I don't think it would be possible to do the sanitization any later in the process. For data passed as a dataframe, we infer vega-lite types and add them to the schema before this sanitization and conversion. Vega-Lite never actually sees the data (it's just passed down with the compiled schema and interpreted by Vega), so it cannot do any sort of type inference itself. What other handling do you have in mind? |
Alright, my bad for not understanding the whole picture then, thanks for clarifying 👍 I was just taking the categoricals as an example and thinking that keeping df types could inform sorting and ordering of As a user I was just surprised to see my pandas dataframe correctly sorted (using categoricals) when I eyeballed it and then having the ordering completely changed on the other end in the Altair plot. I would have expected Altair to respect types all the way down to the plot/axis presentation. |
Yeah, the fundamental issue is that JSON does not have the concept of ordered categoricals, and Vega ingests data as JSON. We could try to do something like infer the sort order from pandas dataframes the same way we infer the data type and then add sort info to the schema in the appropriate place, but that seems a bit too magic to me, and quite likely to lead to unintended side-effects. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In thinking a bit more about this. I am a little hesitant to hardcode the logic to convert strings always like this. The problem is that the str(o)
may not be what the user was expecting and they have to know exactly what the column name it to type it into the spec. At the same time, I see why this is a nice default in many case. An in between approach would be to put this operation in a separate data transformer that a user could easily remove if it was causing problems, or even replace with custom logic.
@@ -106,6 +107,9 @@ def to_list_if_array(val): | |||
else: | |||
return val | |||
|
|||
# All column names must be strings | |||
df.columns = map(str, df.columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to convert column names to strings, we should probably put this line in a try/expect
that raises an informative error telling the user how to fix the issue.
Yes, good point on this being a bit too eager... A safer option would be to simply error on non-string column names, so the user can do the appropriate thing... I think the one case that automatic conversion would make sense is when someone is converting an array to a dataframe: import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(4, 3)) in this case, Or perhaps simply raising an error on any non-string column name would be the simplest option (explicit is better than implicit, etc.)... |
I think automatically doing the conversion for `RangeIndex` objects does
make sense.
…On Fri, Aug 17, 2018 at 10:14 PM Jake Vanderplas ***@***.***> wrote:
Yes, good point on this being a bit too eager... A safer option would be
to simply error on non-string column names, so the user can do the
appropriate thing...
I think the one case that automatic conversion would make sense is when
someone is converting an array to a dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(4, 3))
in this case, df.columns is a RangeIndex object. What would you think
about automatically converting RangeIndex to string, and erroring in
every other case?
Or perhaps simply raising an error on any non-string column name would be
the simplest option.
—
You are receiving this because your review was requested.
Reply to this email directly, view it on GitHub
<#899 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AABr0Fh0FGwB-pJa7fYPCWEet5Xy9RQ4ks5uR3iMgaJpZM4UMnwZ>
.
--
Brian E. Granger
Associate Professor of Physics and Data Science
Cal Poly State University, San Luis Obispo
@ellisonbg on Twitter and GitHub
[email protected] and [email protected]
|
OK, I've implemented that in #1107 |
Both the names of data columns, and the field names that reference them, must be strings within Altair. This more cleanly addresses the case when someone passes a dataframe with non-string column names or fields that are not strings.
parse_shorthand
, raise an informative error if the column is not a string or dictsanitize_dataframe
, convert non-string column names to stringsFixes #898