Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use value indexing by default #84

Open
andreasnoack opened this issue May 17, 2017 · 17 comments
Open

Use value indexing by default #84

andreasnoack opened this issue May 17, 2017 · 17 comments

Comments

@andreasnoack
Copy link
Member

In the single use case I have for AxisArrays right now, it would be very convenient if the default indexing of an AxisArray used the values of the indices instead of the indices of the indices. E.g. if I do

x = AxisArray(Axis{:x}(-3:3))
x[[-2,1]]

it would be handy if it returned the second and fifth element of the AxisVector. You can always refer to the indices of the indices by using the Axis type. Could this work?

@mlubin
Copy link

mlubin commented Sep 11, 2017

I've started hacking on this on a fork here.

@mbauman
Copy link
Member

mbauman commented Jan 18, 2018

I'm coming to the conclusion that supporting both by-axis-value and by-axis-index within the A[] is fraught with trouble. I think it's crucial to look at prior art here and decisions others have made.

Pandas used to support four different ways of indexing into their dataframes: df[], df.ix[], df.iloc, and df.loc. iloc is by index and loc is by value. ix used to do both, changing based upon the type of the index. They found this too confusing and decided to deprecate it completely. The more similar project for Python, xarray, has been developed more recently, and they, too, decided to completely detangle things here: http://xarray.pydata.org/en/stable/indexing.html.

If we decide to completely detangle these things we have two decisions to make:

  • What should the default A[] syntax do? Indices or labels?
  • How should we toggle between the two? We have several options here: it could be per-dimension with a macro or wrapper (like @v), or now that we have getproperty we could do the pythonic thing and use a "index accessor" property like A.iloc[]. The downside there is that it applies to all dimensions. When I dealt with the Pandas deprecation of ix, I found that it was most commonly used in cases like df.ix['label', 0] — that is, one dimension by label and one dimension by offset.

@Petr-Hlavenka
Copy link

I'm an user of AxisArrays and I like its approach a lot. On the other hand I'm just an average user with little knowledge of the inner functionality, so regarding the architectural decisions i fully trust the people like M. Bauman and Tim Holy. This post is only to show how I'm using them and why I find this package great.

I'm almost always using the AxisArrays package in combination with Unitful or with Images. In both cases I use AxisArray as a way to annotate physical meaning of the Integer indexes and dimensions. The huge advantage for me is to use AxisArray for steering the dispatch into dedicated functions, and the dispatch is really pleasure to write thanks to rdeits suggestion in discourse with constructs like

const QAxis{Name, Dim} = Axis{Name, AX} where AX <: AbstractArray{Q} where Q <: Dim
const TimeAxis = QAxis{S, Unitful.Time} where S;
const TimeArray{T, N, V} = AxisArray{T, N, V, <:Tuple{TimeAxis, Vararg{Axis}}};

const QAxisW{Name, Dim} = Axis{Name, AX} where AX <: Range{Q} where Q <: Dim
const TimeAxisW = QAxisW{S, Unitful.Time} where S;
const TimeWaveformArray{T, N, V} = AxisArray{T, N, V, <:Tuple{TimeAxisW, Vararg{Axis}}};
const TimeWaveform{T, V} = TimeWaveformArray{T, 1, V}

const EnergyAxis = QAxis{S, Unitful.Energy} where S;
const EnergySpectrumArray{T, N, V} = AxisArray{T, N, V, <:Tuple{EnergyAxis, Vararg{Axis}}};

and this definitions I use to define specialized functions - some of them expecting Axis of StepRange kind (FIR, image processing, ...), other can work with arbitrary axis values (after e.g removal of outliers) such as plotting. I know I can define my own type and do the dispatch on its type - but this way it is for free, all playing just well with the rest of Julia.

A second usecase is when I use it instead of DataFrames - just that accessing any column of my measurement dataset is so simple, my main axis (time, or energy, or length) is still present and propagated - but I can pass it to any function expecting an AbstractArray and it just works without explicitly writing any conversions, any macro, just pure math-like syntax.

So my opinion is:

  • AxisArray is a AbstractArray thus default indexing without axis specification arr[1, 1], arr[:, 1], arr[1:end-5, 1] should be of the underlying array (I've even an use case I use a sparse array for spectrum representation in a huge datacube - works great!)
  • AxisArray is accessed by "dimension name" be it :Red, :time -> kind of DataFrames-light use case
  • AxisArray is accessed by "dimension name" and an Integer index -> again index into the underlying array
  • Any access to index or range through the Axis mapping is an opt-in for methods that expect and rely on the added information provided by the Axes, can dispatch only on a specific type of Axis and can opt-in a different way of indexing, e.g. what you propose with index accessor arr.ax[:time, 5u"ns"], arr.ax[5u"ns"], arr.ax[5u"ns"..15u"ns"] for explicit axis at given value, implicit axis on value and for implicit axis by range.

My feeling is, the three characters .ax to make sure indexing is by label is wort typing and not causing big pain for readability.

But again, don;t put too much weight to my examples and suggestions.

@timholy
Copy link
Member

timholy commented Jul 5, 2018

Since the upgrade-to-0.7 train is at full speed now and the exported name axes conflicts with Base, this issue has gone critical. Time to make a decision.

My proposal is to orthogonalize the concepts as much as possible, using the following new packages:

  • NamedAxisIndexing: all this does is map a name to a dimension slot. It's effectively like Base.PermutedDimsArray except the axis names can be anything. Any other behavior is a property of the object you wrap. This package could have one wrapper NamedAxisArray <: AbstractArray (for integer-indexing) and another NamedAxisDict <: AbstractDict (for value indexing). (pity we don't have traits, then we could have just one wrapper...)
  • RangedAxisArrays: this adds range metadata to each axis. These are still AbstractArrays and you index them with integers. The axis metadata is indexed by slot number rather than symbol or anything else.
  • People who want to index by value can create a new package or standardize on one of the others (AssociativeArrays, NamedArrays, DataCubes, etc.)

Either of the latter two can be combined with NamedAxisIndexing at the user's or package author's option. That way everyone gets to leverage common functionality without there being so much angst over what things should actually mean.

We might need a small glue package (AnnotatedAxisArrays?) that defines trivial implementations of core trait functions that all of these packages can extend for their specific types.

@iamed2
Copy link
Collaborator

iamed2 commented Jul 5, 2018

I'm not sure what the difference between NamedAxisArray and NamedAxisDict is. Are these for the data arrays or for axes?

@timholy
Copy link
Member

timholy commented Jul 5, 2018

The key point is the subtype relationship, NamedAxisArray <: AbstractArray. If you create a container that indexes by axis-range values (e.g., a[3.2s, 4.4mm]), it's not so obvious that you should be calling it a subtype of AbstractArray since we make the assumption in lots of places that you can index them by integers. But you could call that kind of object an AbstractDict.

@iamed2
Copy link
Collaborator

iamed2 commented Jul 5, 2018

It sounds like NamedAxis* are for mapping dimension names to dimension numbers only. In that case, shouldn't the wrapper not care how indexing works on the wrapped structure, and just generated reordered calls to getindex and setindex? Or is there extra functionality in this hypothetical package that I'm missing?

@timholy
Copy link
Member

timholy commented Jul 5, 2018

You have to make a decision about subtyping when you define the structure:

struct NamedAxisArray{T,N,A<:AbstractArray,S<:NTuple{N,Symbol}} <: AbstractArray{T,N}
    data::A
end

# Anything that isn't an Array but supports getindex/setindex!
struct NamedAxisDict{K,V,D,S<:Tuple{Vararg{Symbol}} <: AbstractDict{K,V}
    data::D
end

but the implementations of getindex and setindex! are identical for both types. It's really just so that the wrapped objects have the appropriate isa relationships.

Like I said, if we were trait-based rather than inheritance-based then this wouldn't be an issue, since the isarray trait could be defined isarray(W::NamedAxisWrapper) = isarray(W.data).

@timholy
Copy link
Member

timholy commented Jul 5, 2018

Hmm, though I realized there's still nothing to prevent someone for asking for value-based lookup with a RangedAxisArray. Really, is there anything wrong with things the way they are now? We have atvalue and atindex, these do allow one to index by value unambiguously and mix integer-indexing and value-indexing freely. Now I'm a little less certain we need to do the split in #84 (comment).

Overall I have the perception seems that existential angst about this package not quite being right is holding it back. Personally, I don't share this angst because I really do think this package should just be an axis metadata wrapper.

@iamed2
Copy link
Collaborator

iamed2 commented Jul 5, 2018

In my mind AxisArrays has these basic levels of processing (let me know if I'm missing anything):


A. Index into an AxisArray with integers/ranges:
- like a normal AbstractArray

B. Index into an AxisArray with positional indexes:
- translate value indexes into integer/range indexes, leave non-value indexes untouched
- index into the AxisArray with the results using A

C. Index into an AxisArray with named indexes:
- translate from names to positions, fill the remaining indices with :
- index into the AxisArray with the results using B


I see B as handled with axisindices right now, which uses Dimensional and Categorical traits to try and guess at whether to return arrays or ranges of indices, and then perform the translation using code that either assumes the axis is sorted or doesn't. I think the main complication in figuring out how to structure AxisArrays is capturing this behaviour, and abstracting this procedure out is the key to simplifying AxisArrays.

Does that sound right?

@timholy
Copy link
Member

timholy commented Jul 6, 2018

I agree with most of that, but I'd say that C (named axes) is actually orthogonal to B (value-indexing), in the sense that you can use integer/range indexing with named axes.

@iamed2
Copy link
Collaborator

iamed2 commented Jul 6, 2018

I agree, that makes sense

@iamed2
Copy link
Collaborator

iamed2 commented Aug 7, 2018

Matt has pointed out that sometimes AxisArrays.to_index will produce AxisArray{T, N} indexes sometimes (where N>=2) as in the README example. Since the output dimensionality and axis names depend on the index dimensionality and type, I think this means that named indexing and at least the indexing-by-value interface need to live in the same package.

@timholy
Copy link
Member

timholy commented Aug 23, 2018

Following the lead of a proposal I just made for Interpolations, perhaps we should make A[i...] be "normal" indexing and A(position...) be value-indexing. For handling a mix, we could export a Index wrapper and use A(3.2mm, Index(5), 8.4s) as an interface.

The advantage here is that AxisArrays could continue to be "normal" AbstractArrays, using integer indexing and where axes (the one in Base, not the one here) returns an AbstactUnitRange{<:Int}.
However, other than obviating the use of end and similar constructs (though we could use https://github.com/JuliaArrays/EndpointRanges.jl), it would offer nice syntax for value-indexing, too. Perhaps the new trait name of AxisArrays.axes should be valueaxes?

@mbauman
Copy link
Member

mbauman commented Aug 23, 2018

I find the function call interface to make much more sense for interpolations (where it feels like I'm actually executing a calculation to determine the value) vs. here (where it feels more like lookup). I know the distinction there is rather fuzzy, but I find it meaningful. I'd argue that the tension we feel here is because we're trying to be two data structures at once, both a [Sorted|Ordered]CartesianDict with named dimensions and an array with named and valued axes. While we could move the by-value indexing behind a function call interface, I'm more tempted to rip it out altogether and move it into a CartesianDict package. The dictionary package could be a wrapper around this one, just exposing a different API. It'd effectively be the dense/rectangular version of JuliaDB's NDSparse.

I don't think @andyferris follows this repository, so I'm pinging you because I know you've put in lots of effort on blurring the lines between dictionaries and arrays. I'd be interested in hearing your thoughts here.

@andyferris
Copy link
Member

Thanks @mbauman for the link.

Yes, this is very interesting. I like the way this mixes up arrays, offset arrays, generalizations of dictionaries and Cartesian indices all in the one place :). A few (some of them obvious) things:

  • If you want to not quite follow the AbstractArray interface, that's great, just don't subtype AbstractArray. We can make a new supertype with its own interface; things like broadcast can be extended these days pretty easily. In this case you could use something like parent(A)[i...] for array-style indexing, no?
  • It seems challenging to have more than one kind of keys for a container. Especially when you throw in multidimensional indexing into the mix. What exactly should keys return? Already, I'm pretty confused by having both linear and Cartesian indices in AbstractArray but having arbitrary column labels (which themselves may be containers) seems like it could lead to complexity, fast.
  • If you do have more than one "kind" of key, to limit complexity it would seem nice to have a very simple transformation between them. For example A[1, 2] being equivalent to A[CartesianIndex(1, 2)], or something like that, having keys return a container of CartesianIndex - this would seem entirely self-consistent (the first transformation is predictable syntactic sugar). There's also the new A[a = 1, b = 2] syntax available for use soon, so a similar transformation for named axes seems quite reasonable.
  • When the transformation is more complex, I'd tend to use a different generic function to getindex, or it gets hard to write down and understand all the guarantees that keys and getindex should provide.
  • For AbstractDicts (and in general, any containers with keys that themselves could be containers) I feel we will need to split getindex and getindices into two generic functions and syntaxes. My understanding is that @mbauman is on the same page here, and the A.[indices] syntax should be free for the taking in v1.x?
  • I'd recommend investigating a new interface for fast iteration over the elements of one or more idexable containers (suitable for dictionaries, etc). Obviously you want your axis arrays to be fast. AbstractArrays use eachindex to return the faster of Cartesian indices or linear indices for using in your loops. Dictionaries may need tokens, broadcasting multiple over axis arrays with the same axes should use the integer indices rather than symbols, etc. For AbstractDict this is complicated by the fact that your fast "token" type could also be a valid index... thus I would suggest something along the lines of eachtoken and gettoken, or whatever.
  • The function call interface is interesting. These days I'm basically confused why arrays/dictionaries/etc aren't just maps from index to value (i.e. functions). I can only say that when I switched from MATLAB to Julia I found the Julia code easier to read because of the distinction between [] and ().

I haven't had the time to experiment with any of this lately - but what I'd like to see is basically a dictionary that can be as fast as an array, and supporting multidimensional indices (via positional or named axes - I'm hoping that Symbol for axis names should be sufficient).

@junglegobs
Copy link

It's been almost 3 years since this first opened - would it be safe to say that indexing by integers will remain a special case for AxisArrays?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants