-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pandas Column Typing #143
Comments
Yes this is a good idea! I am not sure who has time to get to it, but if you want to make a PR feel free! |
+1 ! You can tag me for review. I've dealt with similar issues in my own code. |
##Found Answer## I have the time right now but I'm not super familiar with the code. Basically what's the fastest way to get the table meta given the database and table names because that seems to be readily available in the existing code? I currently do the following in my code:
|
Hmmm. Those calls might require the table scanner. You could query the database for the info directly. |
The types will be populated without a full table scanner run. You can use something like db_id = client.get_database_id(database)
tables = client.tables.list(database_id=db_id, schema=schema, name=table)
columns = client.tables.get(tables[0].id).columns
dtypes = {col.name: redshift_to_py_type(col.sql_type)
for col in columns} |
Here is the function I plan on using:
|
Initial code where table_id is passed from if use_pandas:
table_meta = client.tables.get(table_id)
# The split is to handle e.g. DECIMAL(17,9)
redshift_types = {col['name']: col['sql_type'].split('(')[0]
for col in table_meta['columns']}
py_types = {name : SQL_PANDAS_MAP[type_]
for (name, type_) in redshift_types
if type_ not in DATE_TYPES}
date_cols = [name for (name, type_) in redshift_types
if type_ in DATE_TYPES]
sortkeys = table_meta['sortkeys'].split(',')
# Not sure about the best way to handle null fields
# or how they may appear
data = pd.read_csv(url, parse_dates=date_cols,
index_col=sort_keys,
converters=py_types) |
Cool. Go ahead and make a branch and start the PR. We can do actual code review there. |
This would be a major improvement. The difficult I've had with this before is that there doesn't seem to be a good way to get the type information from a general query passed into That said, it would still be useful to implement this for just |
Oof, we might have another problem. columns = ['my_var = 1']
read_civis('table', 'database', columns = columns) We simply wouldn't have the type information. |
Hm. Good point. I'm not sure what the right answer is. We could give a boolean switch to enable type inference when reading with |
Overview: Add functionality to get column types, including date, and sortkeys from Redshift meta info for Pandas IO
I modified the client to allow me to pass arguments to the pd.read_csv on line 227 in civis/io/_tables.py. I gather the column types using client.tables.get after which I map the SQL types to their corresponding Python types. Doing this eliminates the automatic str to int conversion, e.g. zipcodes, and automatically identifies the date columns. Is there any interest in adding this in a more thoughtful way to the code?
The text was updated successfully, but these errors were encountered: