-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Converter class does not convert Athena string data to pandas str type #148
Comments
The problem with #118 is that the string type cannot distinguish between empty and null characters. Currently, all empty characters are treated as null. The reason Dtype is an Object is probably because the type is represented as a Python String object. I don't plan to implement PySparkCursor. |
That's my understanding too.
A
Thanks for the information. I'll look at the source code and see what I can come up with. |
https://pandas.pydata.org/pandas-docs/version/1.0.0/whatsnew/v1.0.0.html#experimental-na-scalar-to-denote-missing-values |
Good find. It seems though, that -
Any thoughts on a potential solution? Pinging @EdwardJRoss since he seems to understand the codebase better than I do. |
@krishanunandy As noted upthread in Pandas the canonical way of representing strings, before the dedicated >>> pd.Series(['a', 'b'], dtype=str).reindex([0, 1, 2])
0 a
1 b
2 NaN
dtype: object Pandas of course is developing the new String dtype which is the ultimate solution here. You could change your custom converter to use "string" instead of str and use the new dtype. However, given your use case, it is relevant that through version 3.0, Spark has no support for pd.NA, in any of the new nullable dtypes (including Int64, which PyAthena supports). See SPARK-30966 in the Spark Jira for the ticket on this. Realistically you're probably looking at Spark 3.1... maybe early 2021?... at the earliest. My suggestion would just be to write a wrapper method that calls .as_pandas(), takes the result, loops over the cursor description to get the type of each column, and runs |
First of all, thank you for creating this library! It's been immensely helpful and I've used it in multiple contexts over several years and would love to contribute - especially if it helps solve my current problem!
With
pyathena=1.10.7
andpandas=1.0.5
I am running the following code with the expectation that the converter class will cast the Athenastring
data type as anstr
pandas dtype.When I inspect the
dtypes
, Athenaint
s are converted to pandasint
s,decimals
are converted tofloats
andstring
s are consistently returned asobject
dtypes. However Athenastring
NULL
s are cast asNaN
s which require explicit column-by-columnfillna
operations. This is particularly inconvenient, since I'm trying to subsequently convert the pandas dataframe to a Spark dataframe. Now that I've typed all this out, I'm guessing this is related to #118?Also, I'm not sure where the right place to ask this is, but are there any plans to implement a
PySparkCursor
forPyAthena
? If not can I help by contributing?The text was updated successfully, but these errors were encountered: