Fix bug in converting a DH table with all columns of the same type #4566
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There is a bug in the
table.to_pandas()
function that causes tables that have columns of all the same data type to not be properly converted. Here is an example of the bug:Here are the resulting tables and metadata:
This is due to a small block of code that creates a set containing all of the column datatypes, checks to see if this set is singular, and makes a call to
np.stack()
to collect the data into a single nd-array before callingpd.DataFrame
. The problem is thatnp.stack()
is not made aware of the desired datatypes of its elements, and they are not type-aware automatically, so the resulting nd-array has typeobject
. This entire process can be bypassed, as pandas can look at the data and figure out the types correctly, regardless of whether or not all columns are the same.I suspect stacking things into a single nd-array if appropriate before calling
to_pandas()
was done for efficiency. I am curious what the set creation, checking length of the set, creating the list to callnp.stack()
on, and callingnp.stack()
costs in terms of efficiency. This whole process adds at least two for-loops through the columns. It is not clear to me that this beats just callingpd.DataFrame()
on the dict of data without worrying about what the types are. Someone can probably tell me more about this.Here are the results of the above code with my fix: