You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It's surprising that splitobs and DataLoader make views, when they mention only getobs in their docstrings, which does not:
help?> splitobs
splitobs(data; at, shuffle=false) -> Tuple
Split the data into multiple subsets proportional to the value(s) of at.
If shuffle=true, randomly permute the observations before splitting.
Supports any datatype implementing the numobs and getobs interfaces.
[...]
julia>splitobs(ones(1,5); at=0.2)[1]
1×1view(::Matrix{Float64}, :, 1:1) with eltype Float64:1.0
julia>getobs(ones(1,5), 1:2) # what it says it does1×2 Matrix{Float64}:1.01.0
julia>obsview(ones(1,5), 1:2) # what it actually uses1×2view(::Matrix{Float64}, :, 1:2) with eltype Float64:1.01.0
help?> DataLoader
search: DataLoader DataType
DataLoader(data; [batchsize, buffer, collate, parallel, partial, rng, shuffle])
An object that iterates over mini-batches of data, each mini-batch containing batchsize
observations (except possibly the last one).
Takes as input a single data array, a tuple (or a named tuple) of arrays, or in general any data
object that implements the numobs and getobs methods.
[...]
Looking back through the blame, I'm guessing the idea was to not materialize large datasets. The easiest path would be to change the docs, but that doesn't solve the OneHotArrays issue. Maybe
Yes I'm sure not materialising was the goal. However, some views are more useful than others. I wonder if the default should be something like splitobs(ones(1,10); at=0.5) makes two contiguous views, which are almost as good as Arrays, but splitobs(ones(1,10); at=0.5, shuffle=true) makes copies?
The OneHotArrays issue could be solved on that side, by changing what view does. Or more narrowly by changing what obsview does.
Regardless of solution, the docstring should 100% be updated to mention some type of view is returned with ObsView being the detault.
The OneHotArrays issue could be solved on that side, by changing what view does
How straightforward do you think this would be? Since FluxML/OneHotArrays.jl#40 is mostly about performance, another idea would be adding a kwarg which controls whether a view or copy is returned.
It's surprising that
splitobs
andDataLoader
make views, when they mention onlygetobs
in their docstrings, which does not:This means that they do not preserve OneHotArrays, which is FluxML/OneHotArrays.jl#40 .
But more generally, perhaps copies are just safer?
The text was updated successfully, but these errors were encountered: