Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: appropriately handle non-string column/field names: #899

Closed
wants to merge 1 commit into from

Conversation

jakevdp
Copy link
Collaborator

@jakevdp jakevdp commented May 24, 2018

Both the names of data columns, and the field names that reference them, must be strings within Altair. This more cleanly addresses the case when someone passes a dataframe with non-string column names or fields that are not strings.

  • in parse_shorthand, raise an informative error if the column is not a string or dict
  • in sanitize_dataframe, convert non-string column names to strings

Fixes #898

- in parse_shorthand, raise an informative error if the column is not a string or dict
- in sanitize_dataframe, convert non-string column names to strings
@jakevdp jakevdp added this to the 2.1 milestone May 24, 2018
@jakevdp
Copy link
Collaborator Author

jakevdp commented May 24, 2018

The only question in my mind here is whether sanitize_dataframe should convert non-string column names to strings, as I did here, or raise an error when encountering non-string column names.

I lean toward the former because existing code that uses dataframes with non-referenced columns with non-string names works, and should probably keep working.

@jakevdp
Copy link
Collaborator Author

jakevdp commented May 24, 2018

Also, as @deisi mentioned in #898, it's useful to be able to use raw data directly without manually re-mapping columns. This lets you use (string) "0", "1", etc. in those situations.

@jakevdp
Copy link
Collaborator Author

jakevdp commented May 24, 2018

We could go further and silently convert integer field names to strings as well in parse_shorthand, but I'm worried that would be confusing because it's difficult for the user to ascertain that that is happening.

@brainstorm
Copy link

brainstorm commented Jun 16, 2018

Regarding sanitize_dataframe: why discard (type) metadata so early in the altair dataframe processing?

You mentioned in #941 that data is serialized to JSON and that type info is lost anyways. Isn't it possible for vega-lite to have some field or convention (same way you do with :Q, :N, :O, :T) on data type encodings coming from dataframes as well?

I would suspect that it only helps code downstream to have type information but perhaps you get rid of it early to avoid opening a strongly typed can of worms? I'm not sure if type inference, duck typing or manual re-typing of data downstream is a worse problem to have or if I'm totally missing the point of that design decision? :-S

Just wondering if changing this handling could help alleviating some points in #947.

@jakevdp
Copy link
Collaborator Author

jakevdp commented Jun 16, 2018

I'm not certain what you're after here... we sanitize dataframes just before converting them to JSON, because if we don't then the json module will raise an error. We don't convert to JSON until the specification is being generated, and that's at render time. I don't think it would be possible to do the sanitization any later in the process.

For data passed as a dataframe, we infer vega-lite types and add them to the schema before this sanitization and conversion.

Vega-Lite never actually sees the data (it's just passed down with the compiled schema and interpreted by Vega), so it cannot do any sort of type inference itself.

What other handling do you have in mind?

@brainstorm
Copy link

brainstorm commented Jun 16, 2018

Alright, my bad for not understanding the whole picture then, thanks for clarifying 👍

I was just taking the categoricals as an example and thinking that keeping df types could inform sorting and ordering of column= in #940 if types were not sanitized as strings... I thought that one of the reasons for that shortcoming was data types being lost in the sanitization and/or bad type inferring process.

As a user I was just surprised to see my pandas dataframe correctly sorted (using categoricals) when I eyeballed it and then having the ordering completely changed on the other end in the Altair plot. I would have expected Altair to respect types all the way down to the plot/axis presentation.

@jakevdp
Copy link
Collaborator Author

jakevdp commented Jun 16, 2018

Yeah, the fundamental issue is that JSON does not have the concept of ordered categoricals, and Vega ingests data as JSON. We could try to do something like infer the sort order from pandas dataframes the same way we infer the data type and then add sort info to the schema in the appropriate place, but that seems a bit too magic to me, and quite likely to lead to unintended side-effects.

Copy link
Collaborator

@ellisonbg ellisonbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In thinking a bit more about this. I am a little hesitant to hardcode the logic to convert strings always like this. The problem is that the str(o) may not be what the user was expecting and they have to know exactly what the column name it to type it into the spec. At the same time, I see why this is a nice default in many case. An in between approach would be to put this operation in a separate data transformer that a user could easily remove if it was causing problems, or even replace with custom logic.

@@ -106,6 +107,9 @@ def to_list_if_array(val):
else:
return val

# All column names must be strings
df.columns = map(str, df.columns)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to convert column names to strings, we should probably put this line in a try/expect that raises an informative error telling the user how to fix the issue.

@jakevdp
Copy link
Collaborator Author

jakevdp commented Aug 18, 2018

Yes, good point on this being a bit too eager... A safer option would be to simply error on non-string column names, so the user can do the appropriate thing...

I think the one case that automatic conversion would make sense is when someone is converting an array to a dataframe:

import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(4, 3))

in this case, df.columns is a RangeIndex object. What would you think about automatically converting RangeIndex to string, and erroring in every other case?

Or perhaps simply raising an error on any non-string column name would be the simplest option (explicit is better than implicit, etc.)...

@ellisonbg
Copy link
Collaborator

ellisonbg commented Aug 20, 2018 via email

@jakevdp
Copy link
Collaborator Author

jakevdp commented Aug 21, 2018

OK, I've implemented that in #1107

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants