Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Construct frame with columns/index but no data #71

Conversation

colinalexander
Copy link
Contributor

Per issue #65, this PR allows construction of frames with no data but columns, an index or both.

# Creation of Frame with columns but no data.
>>> sf.Frame(None, index=[], columns=(3,4,5))  # or index=None
<Frame>
<Index>   3        4        5        <int64>
<Index>
<float64> <object> <object> <object>

# Creation of Frame with index but no data.
>>> sf.Frame(None, index=(1,2), columns=[])  # or columns=None
<Frame>
<Index> <float64>
<Index>
1
2
<int64>

@codecov-io
Copy link

codecov-io commented Jun 19, 2019

Codecov Report

Merging #71 into master will increase coverage by 0.08%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #71      +/-   ##
==========================================
+ Coverage   94.71%   94.79%   +0.08%     
==========================================
  Files          18       18              
  Lines        5354     5361       +7     
==========================================
+ Hits         5071     5082      +11     
+ Misses        283      279       -4
Impacted Files Coverage Δ
static_frame/core/frame.py 93.83% <100%> (+0.03%) ⬆️
static_frame/core/type_blocks.py 94.65% <0%> (+0.43%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 72eb2c6...6bbade1. Read the comment docs.

Copy link
Collaborator

@brandtbucher brandtbucher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking on this tricky one! It looks like this PR breaks a couple of things, though, so it may need some more work:

static_frame/core/frame.py Outdated Show resolved Hide resolved
static_frame/core/frame.py Outdated Show resolved Hide resolved
@flexatone
Copy link
Contributor

Thanks, @colinalexander for this implementation. As @brandtbucher points out, supporting generators as index and column constructors is important.

But I have a few additional considerations.

If we have a 2D array with one axis of size 0 ("non-fillable"), we cannot store any data, but our array will still have a type. Given that observation:

The meaning of data, in the constructor, when set to None, is ambiguous in the scenario of a non-fillable shape. If we are creating a Frame with non-zero index/columns ("fillable"), we can fill it with None values (or any other element), and the dtype will properly be the dtype of that type. But if the implied shape is non-fillable, supplying a non-default value for data does not make sense, and that that data is None should not be interpreted as meaningful, or (I think) used to induce the type of the non-fillable array. This leads me to two ideas:

The default of data should not be None, but an object() instance created (and stored at the module level) to serve as a default. With this change: (a) creating a fillable array will require setting data (no default None assignment); (b) creating a non-fillable array with a non-default data argument will raise an exception (as we cannot use that value in a meaningful way.

This leaves the question of what type the non-fillable array should be. Here, I think we should follow NumPy, which defaults to float64 in similar scenarios. This, I do not think we should use np.full() with data as the fill argument.

In [8]: np.array(())                                                                                                        
Out[8]: array([], dtype=float64)

In [9]: np.empty((0,3))                                                                                                     
Out[9]: array([], shape=(0, 3), dtype=float64)

@flexatone
Copy link
Contributor

Notice also that even the current support for a non-fillable shape (with an index but no columns) is flawed in that it returns a Frame with shape (0, 0), where it should (3, 0). I think I need to enhance how TypeBlocks handles these cases, then doing this in Frame will be more straightforward.

In [18]: f = sf.Frame(None, index=(1,2,3), columns=None)                 
                                                   
In [20]: f._blocks.shape                                                                                                    
Out[20]: (0, 0)

In [22]: f.shape                                                                                                            
Out[22]: (0, 0)

In [23]: f.values                                                                                                           
Out[23]: array([], shape=(0, 0), dtype=float64)

@colinalexander colinalexander force-pushed the f/construct_frame_with_columns_but_no_data branch from 0e4bae5 to 8c18c4e Compare June 21, 2019 00:19
@colinalexander colinalexander force-pushed the f/construct_frame_with_columns_but_no_data branch from 8c18c4e to 6bbade1 Compare June 21, 2019 01:05
@colinalexander
Copy link
Contributor Author

@flexatone and @brandtbucher Thanks for your detailed comments and patience. I believe this revised PR achieves most of your objectives without the need to enhance TypeBlocks. My solution is to initially instantiate the blocks with a size of (0, 0) and then to resize it after the creation of any indices.

Regarding the data type, I have filled missing data with np.nan values which obviously required changes to some of the existing tests. For simplicity, I filled the np.nan values with None so that they could be compared to the original test values, e.g. ((3, ((1, None), (2, None))), (4, ((1, None), (2, None))), (5, ((1, None), (2, None)))).

>>> Frame(None, index=('A', 'B'), columns=range(3))
<Frame>
<Index> 0         1         2         <int64>
<Index>
A       nan       nan       nan
B       nan       nan       nan
<<U1>   <float64> <float64> <float64>

Some examples per the new functionality:

>>> Frame(None, index=range(3))
<Frame>
<Index> <float64>
<Index>
0
1
2
<int64>

>>> Frame(None, index=Index(range(3), name='idx'))
<Frame>
<Index>      <float64>
<Index: idx>
0
1
2
<int64>

>>> Frame(None, columns=range(3))
<Frame>
<Index>   0         1         2         <int64>
<Index>
<float64> <float64> <float64> <float64>

I have added some tests per test_frame_init_iter to catch the case raised by Brandt. This checks that Frames are properly created from an index or columns created by an iterable.

I have also modified existing behavior. Currently, a frame that drops all of its columns results in a Frame with shape (0, 0). Now, that new Frame will contain the same index as the original Frame, resulting in a new Frame shape of (3, 0) assuming the original index had a length of 3.

@flexatone
Copy link
Contributor

Hi @colinalexander.

It turns out that a nice solution for this was available by expanding TypeBlocks.from_none to take a shape argument, and set that as the TypeBlocks shape (even though the TypeBlocks is empty). This permits TypeBlocks, and Frame, to report the correct unfillable shape and length (len of a (3, 0) shape is 3).

Then, in Frame.__init__, we can follow the pattern of defining a blocks_constructor that defers TypeBlocks construction until after index and columns are realized, and that constructor can use from_none. This avoid having to resize after creation.

Apologies for implementing an alternative while you were working on this PR: the approach in Frame became obvious once I update TypeBlocks.

Please see my changes here:
4928fde

It would be great if you can recast your PR to include the new tests you have authored.

I see also that my solution has not addressed the good issue that you raise: when extracting a Frame with 0 columns the shape goes to (0, 0), regardless of the index. This might need further enhancement to support, and I will create an issue to address it.

In general, I am not sure that putting an unfillable array in TypeBlocks is a good approach, though you have shown that it seems to work.

@colinalexander
Copy link
Contributor Author

Thanks Chris. What did you want to use as the datatype for None? The current change retains existing functionality (always a good thing) and also allows one to explicitly fill with np.nan. Since you raised the issue, I just thought I'd confirm the direction you'd like to take on this. The alternative would be to fill with NaN values instead of None, but that could be a breaking change.

>>> Frame(None, index=(1,2), columns=(3,4,5))
<Frame>
<Index> 3        4        5        <int64>
<Index>
1       None     None     None
2       None     None     None
<int64> <object> <object> <object>

>>> Frame(np.nan, index=(1,2), columns=(3,4,5))
<Frame>
<Index> 3         4         5         <int64>
<Index>
1       nan       nan       nan
2       nan       nan       nan
<int64> <float64> <float64> <float64>

I'll just close this PR and create a new one with the additional tests from Frame construction using an iterable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants