Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Common interface for raster data/DimensionalData.jl #6

Open
meggart opened this issue Sep 13, 2019 · 69 comments
Open

Common interface for raster data/DimensionalData.jl #6

meggart opened this issue Sep 13, 2019 · 69 comments

Comments

@meggart
Copy link
Member

meggart commented Sep 13, 2019

Hello all,

it looks like there are currently several packages trying to implement some xarray or R-raster like functionality in Julia and I think I don't have a complete overview of what is actually out there. The one's I know about are:

I think every one of these approaches has its own focus, but it would be nice if we could either merge/re-use some functionality or try to use a common interface so that e.g. plotting applications can work on different representations of gridded data.

Maybe package authors could comment here on where they see their package fit into the ecosystem.

@meggart
Copy link
Member Author

meggart commented Sep 13, 2019

So let me start with https://github.com/esa-esdl/ESDL.jl . This package was originally created as a toolkit for a single dataset, so started out to be quite limited in the datasets it can handle. Its focus is on an efficient implementation of mapslices and broadcasted mapslices on chunked out-of-core datasets (including automatic parallelization). It is optimized to work on Zarr datasets and the data model is completely compatible with xarray, which means that you can directly read data that has been written by xarray's to_zarr method.

The package introduces an ad-hoc AxisArray-like type as well that has named axes and values along these axes. However, I would be very happy to exchange this backend with whatever we might come up as a common interface. My main requirement would be that it is possible for this array to lazily hold the data, since all my datasets are larger than memory. The package also comes with some rudimentary plotting functionality (see docs), but I would really like to interface some other solution, since maintenance is quite time-demanding.

For a quick feeling how the package works, have a look at e.g. these notebooks https://github.com/esa-esdl/ESDLPaperCode.jl/tree/master/notebooks .

(Pinging @Alexander-Barth )as well, who might be interested

@visr
Copy link
Member

visr commented Sep 13, 2019

For some discussion on a raster interface see also JuliaGeo/GeoInterface.jl#16.

But I think this is more about labelled arrays in julia, which is also a good discussion to have. In the basis I think this should not be geo-specific, since it is something of use for a much larger audience. Hence also the split between DimensionalData and GeoData.

Another option is IndexedDims and NamedDims as mentioned here: https://discourse.julialang.org/t/status-of-axisarrays-jl/28682/5. Perhaps it is best to participate in that thread as well, for wider visibility.

@rafaqz
Copy link
Member

rafaqz commented Sep 15, 2019

Thanks for starting this @meggart. Im also keen to move towards integrating these efforts.

ESDL.jl looks interesting, I think we've done a lot of similar work. I'll have a thorough look at it during the week. To me there are two key interfaces to agree apon. As @visr is suggesting, there are general named axis/dimension data for use across many julia domains, but there are also container types specific to the geospatial domain. The reason DimensionalData and GeoData are separate is mostly to separate these discussuions.

Hopefully we can treat DimensionalData.jl as a replaceable building block for building geospatial tools, which shouldnt have to worry about it much besides defining dimensions during object construction.

Agreeing on common spatial objects and interfaces is probably more specific to this space. I've put forward the array/stack/series components, based on my needs for modelling packages (see the GeoData.jl readme). But we can certainly restructure that to fit the cube concept. Im also focussed on making everything potentially disk-based and lazy, with the same synax as eager versions.

@meggart
Copy link
Member Author

meggart commented Sep 16, 2019

Thanks to both of you for your replies. You are right, I am mixing two separate issues here, and since our packages touch both of these issues I thought I would mention both. I will focus on the more Goe-related one for now.

Agreeing on common spatial objects and interfaces is probably more specific to this space. I've put forward the array/stack/series components, based on my needs for modelling packages (see the GeoData.jl readme). But we can certainly restructure that to fit the cube concept. Im also focussed on making everything potentially disk-based and lazy, with the same synax as eager versions.

I have looked at GeoData.jl befire and I really like the interface. The only fundamental design decision that causes me headaches is that it AbstractGeoArray subtypes AbstractArray. As you know Julia does a lot of magic by implementing many fallback methods for AbstractArrays which makes it so easy to define your own Array types. However, if your array is mapped to a compressed on-disk array with a non-standard chunking these fallbacks will be very inefficient. Just accidentally calling the fallback show method can freeze your terminal when you have opened a remote dataset on S3, because the method assumes fast random access on your data. The same is true for the default implementations of broadcasting, map, etc.

So I have decided not to subtype AbstractArray for now. However, I would really like to experiment with returning AbstractGeoArrays when getindex is called on one of my cubes, or to represent the data as DimensionalData inside a mapslices operation, as is currently supported for AxisArrays.

BTW, I think it might be a good idea to discuss in person, so maybe we could have some kind of JuliaGeo raster teleconference open for interested people to show and discuss the different approaches and move forward with a common interface.

@rafaqz
Copy link
Member

rafaqz commented Sep 17, 2019

That is a good point. I've written DimensionalData.jl with this in mind, so that you don't really need to inherit from AbstractDimensionalArray, you just need dims and rebuild methods (refdims also helps). AbstractGeoStack uses dim indexing but its not <: AbstractArray

A teleconference is a good idea, @juliohm and a few others in the was also interested in that too over at the VerySimpleRaster thread. I'm in Australian Eastern standard time but 7am-2am my time is a tolerable window most days.

Edit: I'm also using disk based backends so avoiding these default access methods will be really important, but I haven't really though it through yet or hit any practical problems. But its relevent to the choice made in DD to rely on dimensions for functionality and dispatch when possible instead of the array types, something we are simultaneously discussing in this thread:
JuliaCollections/AxisArraysFuture#1 (comment)

@juliohm
Copy link
Member

juliohm commented Sep 17, 2019

Thank you @rafaqz for pinging. I am in BRT time, and can adjust my agenda accordingly after you decide a meeting hour. We definitely need to set a video call to address this challenge. It is too big to be discussed on GitHub threads only.

@meggart
Copy link
Member Author

meggart commented Sep 18, 2019

Oh dear, time-wise I am in the middle (CET). Maybe we could try to do 8 a.m. BRT, which would correspond be 9 pm Australian Eastern and it would be around noon for Europeans. Are there any suggestions for a Date? I would be available on normal weekdays.

In order to collect this information in a single place I have started a Google doc to collect availabilities and possible topics and might serve as an agenda in the end. https://docs.google.com/document/d/1ccaSltPDb5n-bLNgWyrotsWXz2n7ckMggOIPejy_GvQ/edit?usp=sharing Please feel free to edit as you like and share with interested people. Let me know if the google docs does not work for some reason (no google account and don't want one etc...) and suggest where to do this instead.

@meggart
Copy link
Member Author

meggart commented Sep 18, 2019

hat is a good point. I've written DimensionalData.jl with this in mind, so that you don't really need to inherit from AbstractDimensionalArray, you just need dims and rebuild methods (refdims also helps). AbstractGeoStack uses dim indexing but its not <: AbstractArray

This is great and it looks like your approach is indeed very composable. I will look into the AbstractGeoStack implementation, thanks for the pointer.

@Alexander-Barth
Copy link
Member

Alexander-Barth commented Sep 18, 2019

Thanks a lot for your great initiative, Fabian!

In my domain, grids are not necessarily aligned in the longitude and latitude directions (for example a satellite swath, https://docserver.gesdisc.eosdis.nasa.gov/public/project/Images/OMNO2_003.png or a model whose vertical grid depends on time). Should such raster data also be supported?

In the most general case, if a dataset has e.g. 4 dimensions, the arrays representing longitude, latitude, depth (or elevation) and time can also have 4 dimensions.

I think this could still be represented efficiently by having an special array type which is virtually is a 4d array but stores only a vector of the data and defines the getindex function appropriately, for example if the longitude varies only along the 1st dimension, one could have:

Base.getindex(x::Virtual4DArray,i,j,k,n) = x.small_vector[i] 

Similarly, the value could just be computed on the fly for specific projection using the corresponding formulas (e.g. http://mathworld.wolfram.com/MercatorProjection.html) so that the storage would actually just be a handful of parameters.

A georeferenced dataset would just be a collection including the actual data and arrays representing lon, lat, elevation and time (or a subset of these). Some data sets might have even more dimensions (e.g. frequency for hyperspectral satellite data, ensemble member for Monte Carlo simulation).

A possible interface could be:

value =  georeferenced_dataset.data[2,3,4] # value at the indices 2,3,4
lon = georeferenced_dataset.lon[2,3,4] # longitude of the value at the indices 2,3,4
lat = georeferenced_dataset.lat[2,3,4] # latitude of the value at the indices 2,3,4
frequency = georeferenced_dataset.frequency[2,3,4] # implemented using getproperty

Simply indexing such a data set would return a georeferenced subset.

slice_of_georeferenced_dataset =  georeferenced_dataset[:,3,4]

Multiple dispatch in julia allows us to implement optimized function, for some common special cases, for example extract all data withing a geographic bounding box if the dataset is nicely aligned along longitude and latitude (or in a given projection where the indices can be directly computed).

I am also in CET time zone.

@Alexander-Barth
Copy link
Member

I just came across the OceanGrids.jl package from @briochemc which looks quite interesting.

The tests give an idea about the interface:
https://github.com/briochemc/OceanGrids.jl/blob/master/test/runtests.jl

@rafaqz
Copy link
Member

rafaqz commented Sep 18, 2019

@Alexander-Barth. I was planning to eventually build that kind of indexing into the DimensionalData.jl dimensions - the n dimensional arrays would be stored in the Lat/Lon/Vert/Time etc structs rather than as fields on the array. A trait would then control which dimensions needed multiple indices - Ie Lat/Lon might be matrices while Time could still a vector if it doesn't affect the lat/lon coordinates. It should be flexible to represent any configuration over any dimensions.

GeoData.jl already does geo-referenced subsetting, you can easily extract an area between some lat/lon etc (using Between()) and get another GeoArray result. But it will of course be a bit more work when the dims are holding 4d arrays.

The NCDatasets implementation works for simple (axes as vectors) projections:
https://github.com/rafaqz/GeoData.jl/blob/master/src/sources/ncdatasets.jl

@Alexander-Barth
Copy link
Member

@rafaqz
This sounds quite interesting! Can you give an example what the code would look from a user's perspective (once it is implemented, if it is not already) to get for example lon/lat/time of a given index of the georeferenced raster assuming that the 1st and 2nd dimensions are not aligned with the longitude and the latitude direction? Would it the be same as I proposed above?

@rafaqz
Copy link
Member

rafaqz commented Sep 18, 2019

You can already load matrices into dims instead of a vector, but there is no method for getting the lat at a particular index, so you would have to do something like val(dims(a, Lat))[x, y, z].

But we can add a method that does that, like somemethod(a, Lat, x, y, z) if you have an idea what to call it. x, y, z could be indices 1, 7, 22 etc or Y(7), X(1), Time(At(DateTime(2005,1,10))).

Edit: It seems a little more complicated than this now I've written it out. Would the dims be called X, Y when they are not exactly Lat/Lon aligned? or still Lat/Lon?. We may need to store a second list of dims for mapping to. I actually never use these formats so I'm not really across how they are used.

@rafaqz
Copy link
Member

rafaqz commented Sep 18, 2019

@meggart 9pm is fine by me. Early next week (21st/22nd AEST is good), I will be travelling for a week or so after that but might be able to find time.

@visr
Copy link
Member

visr commented Sep 18, 2019

Would the dims be called X, Y when they are not exactly Lat/Lon aligned? or still Lat/Lon?. We may need to store a second list of dims for mapping to.

Yeah, to give you an example of a NetCDF where this is the case:

dimensions:
	x = 424 ;
	y = 412 ;
	time = UNLIMITED ; // (444 currently)
variables:
	double lon(y, x) ;
		lon:standard_name = "longitude" ;
		lon:long_name = "longitude" ;
		lon:units = "degrees_east" ;
		lon:_CoordinateAxisType = "Lon" ;
	double lat(y, x) ;
		lat:standard_name = "latitude" ;
		lat:long_name = "latitude" ;
		lat:units = "degrees_north" ;
		lat:_CoordinateAxisType = "Lat" ;
	double time(time) ;
		time:standard_name = "time" ;
		time:long_name = "time" ;
		time:bounds = "time_bnds" ;
		time:units = "day as %Y%m%d.%f" ;
		time:calendar = "standard" ;

So x, y, time are the orthogonal dimensions. And lon and lat are variables that each vary with both x and y.

@Alexander-Barth
Copy link
Member

Alexander-Barth commented Sep 18, 2019

Thanks @visr for sharing this example. The names of the dimensions (x and y) are indeed arbitrary. The important part is that the data variable re-uses the dimensions names, for example:

	double temperature(time, y, x) ;

in NCDatasets.jl I use the recently added function NCDatasets.coord to find the coordinate by the standard name (or by units). It is required that the dimensions of the coordinate variable (lon, lat, ...) is a subset of the dimensions the data variable (e.g. temperature) as a single NetCDF file can have multiple variables representing longitude, latitude or time (which is the case for some model output where not all variables are defined at the same location, I can share an example to make this point more concret if you want).

@rafaqz
Copy link
Member

rafaqz commented Sep 18, 2019

Thanks that example really helps. Do either of you have some demo files you can link to for testing? I think GeoData.jl will break if you try to load something like that currently. But it shouldn't be too much work to handle it.

I added a few points about formats and projections to the discussion agenda.

@Alexander-Barth
Copy link
Member

Sure, here is an example from the ROMS model. The lon/lat for the variable e.g. zeta are lon_rho and lat_rho while e.g. for ubar, lon_u and lat_u. Such grid staggering is used by almost all ocean and meteorological models nowadays. I just released a new version of NCDatasets if you want to reuse the coord function.

@Alexander-Barth
Copy link
Member

But we can add a method that does that, like somemethod(a, Lat, x, y, z) if you have an idea what to call it. x, y, z could be indices 1, 7, 22 etc or Y(7), X(1), Time(At(DateTime(2005,1,10))).

While rereading you comment, are Xand Y types defined by the GeoData.jl or types that the user has to define?

@rafaqz
Copy link
Member

rafaqz commented Sep 19, 2019

Thanks for that, I'll look at the coord function too. Will there be a way of standardising this across other datasets, say GDAL and custom HDF5 formats? Or is this kind of detail mostly only in netcdf?

DimensionalData.jl defines X, Y, Z and Time, GeoData.jl defines extends this with Lat, Lon, and Vert. Dim{:x} is the verbose fallback.

Package extenders can add more with @dim MyDim "My great dimension" but users shouldn't have to very often.

@meggart
Copy link
Member Author

meggart commented Sep 19, 2019

@meggart 9pm is fine by me. Early next week (21st/22nd AEST is good), I will be travelling for a week or so after that but might be able to find time.

Regarding the call time I would suggest next Monday Sep 23 at 11am UTC (which is 1pm CEST, 8am Sao Paulo, 9pm Melbourne). I hope this fits everyone.

@juliohm
Copy link
Member

juliohm commented Sep 19, 2019

I support the proposed time for the meeting. How we can organize the agenda to optimize the time?

@rafaqz
Copy link
Member

rafaqz commented Sep 20, 2019

It works for me too. I've just been editing the agenda doc a little with ideas.

https://docs.google.com/document/d/1ccaSltPDb5n-bLNgWyrotsWXz2n7ckMggOIPejy_GvQ/edit?usp=sharing

I'm also wondering if @evetion was aware of this seeing he has also written a raster package and has good ideas for using complex projections, and @mkborregaard as his name was misspelled above

@Alexander-Barth
Copy link
Member

Unfortunately, I have a lecture to give Monday 1pm (introduction to Julia btw ;-) )

@meggart
Copy link
Member Author

meggart commented Sep 22, 2019

Unfortunately, I have a lecture to give Monday 1pm (introduction to Julia btw ;-) )

I'm also wondering if @evetion was aware of this seeing he has also written a raster package and has good ideas for using complex projections, and @mkborregaard as his name was misspelled above

Given the miss-spelling and that @Alexander-Barth is not available, I would suggest to move to a later day, it would be important to give @evetion and @mkborregaard a chance to participate who might not have seen this thread yet. Do we need a doodle?

Regarding optimizing the agenda, I am not sure. Would anyone be willing to moderater the meeting and try to keep stuff focused? Maybe @visr or @juliohm who are not directly involved in writing raster packages?

@juliohm
Copy link
Member

juliohm commented Sep 22, 2019 via email

@meggart
Copy link
Member Author

meggart commented Oct 4, 2019

During the last weeks I spent some time browsing through the different approaches of named arrays and dimensions and did some experiments and tinkering. As I mentioned before, the goal and focus of each package as well as the implementation details are quite different, while the concept of named dimensions that carry values is the recurrent theme in them.

The result of this tinkering is a more or less concrete proposal for an interface that would let the different approaches talk to each other. I tried to quickly draft this in this gist. The basic idea is that, no matter if your grid is regular or not, wen want to be able to assign some coordinates to every entry in a multidimensional array.

The gist defines a set of THTT traits for hasgrid which play nicely with DimensionalData but should also be general enough to deal with irregular grids as described by @Alexander-Barth . For my own data types it would be very easy to support this interface as well and I would be very interested in a showcase where one of my operations can run on a multifile NCDataset through this interface.

I think it would be nice to try to factor out a few additional array traits that might be important in our domain and try to export them somewhere. Here are a few suggestions:

  1. One aspect that is important to me is to allow different interpretations of grids. In the climate modelling world it is quite common that a value of a grid cell (Like air temperature) corresponds to the value exactly at the center of the grid cell. However, most (aggregated) remote sensing products should be interpreted as an average over an area defined by the cell. The same goes for the time dimension: Is my value representing an average/aggregate over a time span or is it a measurement taken at the given time stamp. IMO this should become a trait of a Dimension and might affect how we handle indexing functions like Nearest, Between and At as well as regridding/aggregations
  2. Another trait might be if random access into an array is fast or not, so that Base mapslices and other simple arithmetic is only applied on small in-memory arrays
  3. For the out-of-core arrays a trait for chunks as proposed in https://github.com/meggart/ChunkedArrayBase.jl would be essential to me
  4. Finally, to make the bridge to GeoData, a method like is_geo_x and is_geo_y on a AbstractDimension to extract the dimensions that should be interpreted as lon and lat as well as an interface to check CRSes as discussed in other threads might be added to this set of interfaces as well.

I would be happy to work towards such a set of traits to have something more concrete to discuss during the upcoming teleconference. Shall we actually try to set a new date for this?

@rafaqz
Copy link
Member

rafaqz commented Oct 4, 2019

Those are great proposals. I was just thinking about chunking today in relation to Dask/Dagger and how to generalise it over multiple source file types. ChunkedArrayBase seems like a good solution.

The nature of grid cells is also a good point. Another trait marking the nature of the dimension is needed. Near and Between currently use very simple methods just to prove the idea is reasonable, but if the coordinate/time etc represents the middle of edge of a cell their result should be different. Combining the grid regularity/irregularity with the cell type also makes sense.

But instead of traits of the array, I think RegularGrid{Center} etc should be traits of the dimensions. Forward and Reverse traits currently mark if a dimension is in reverse order and allow different dispatch in the selectors (ie to use rev=true in searchsortedfirst), and we could add another field for the grid type that works in a similar way. There is often the case where the Lat/Lon grid is irregular but time or elevation is regular, and the traits required to describe all the combinations will get out of hand if they are on the array.

Traits for load cost could be useful as well, and these could be on the array.

Lets set another date for this soon. I can be free any weekday evenings next week to fit the same schedule agreed for the last attempt.

@juliohm
Copy link
Member

juliohm commented Oct 4, 2019

The basic idea is that, no matter if your grid is regular or not, wen want to be able to assign some coordinates to every entry in a multidimensional array.

Exactly. I would also try to follow classical naming for the types a la VTK. For example irregular spatial types can be further broke down into structured (when we can still index with i,j,k), which is the case handled here, and unstructured (general meshes for which the notion of up-down-left-right doesn't make sense).

In the climate modelling world it is quite common that a value of a grid cell (Like air temperature) corresponds to the value exactly at the center of the grid cell. However, most (aggregated) remote sensing products should be interpreted as an average over an area defined by the cell.

This is one of the requirements I had in mind from a geostatistics perspective. Ideally we should have a trait system that can express the notion of volume of each element in the grid (rectangle elements): JuliaEarth/GeoStats.jl#40 This volume of the i-th element volume(grid, i) is a general trait that could be used for non-regular spatial data types. My plan in GeoStats.jl is to add this notion of volume in a next minor release, and in the case of regular grid, we can just save a single constant value for all elements in the grid type, or to make the code more clean, use FillArrays.jl. At the moment the plot recipes in GeoStatsBase.jl are misleading and plot the regular grids as if they were corner-based. We should move away from this idea because volumes of each element can be used to improve the estimates spatially (Kriging methods).

Finally, to make the bridge to GeoData, a method like is_geo_x and is_geo_y on a AbstractDimension to extract the dimensions that should be interpreted as lon and lat as well as an interface to check

I would like to generalize this to N-dimensional spaces. Just keep in mind that the majority of use cases is 2D in a map projection, with lat/lon values and so on, but these are not all use cases for a regular grid. In our field, modeling the subsurface is quite common, and we need 3D grids with properties of the rock.

Shall we actually try to set a new date for this?

For sure! I am in! We need to set a date and time. I am traveling over the weekend with limited access to the internet, but that same time proposal we attempted last time works for me weekdays.

@Alexander-Barth
Copy link
Member

So in case of a rotated grid, the dimension would be e.g. Dim{:xi_rho}, right (for the example ocean_his.nc)? How would the user/library finds the longitude of a given grid point using the proposed interface? I do not want to push "rotated grid", if you think it is out-of-scope then it is fine for me. But if the interface should support those, then it is not so clear to me how they are handled.

@rafaqz
Copy link
Member

rafaqz commented Oct 9, 2019

Yes that could be the dim name.

I think the idea is we put the trait system in place and we can fill out later to match all of these different grid types, so talking about it now is good to make sure they can be covered, but they might not get implemented until someone needs them enough to do it.

But once we have the grid trait, say it's RotatedGrid <: IrregularGrid, we can dispatch to methods that handle rotated grid, in DimensionalData it might be sel2indices(::RotatedGrid, dim, selector) = do_something where selector is At(x), Near(x) etc.

At can hold anything, normally its one lat or lon lookup value, but it could be 2 or more for matrix lookups.

The RotatedGrid type could have a field holding an AffineMap from CoordinateTransformations, which transforms the coordinates. I'm not sure exactly how to handle coordinate transformations shared by multiple dimensions, but not by all (eg not by time). We may need to use the grid traits earlier in dispatch when we have the whole tuple of dims, and just push the RegularGrid dim like time into the sel2indices(::Regular..) methods, and work on the others together in RotatedGrid methods.

That could happen without too much difficulty in DimensionalData once it's dispatching on the RegularGrid trait, it's all recursive so you can intervene anywhere fairly easily. It starts with a tuple of the existing dimensions and a tuple of the indexing dimension, and works over them recursively until you have regular indices to pass to the parent array.

So the user maybe just do:

A[Dim{:xi_rho}(At(whatever_it_needs)), Dim{:eta_rho}(At(xxx)), Time(At(DateTIme360Day(2015, 2, 2))]

Or something. It might be more complicated than that, but we can probably use dispatch to resolve it.

Edit: thinking about this more - if we limit an array to a maximum of one irregular grid type on N dims, with M regular grids on the remaining dims, which is probably realistic (is it?), we can do:

A[Irreg(At(144.3), At(-37.5)), Time(At(DateTIme360Day(2015, 2, 2))]

or order doesn't matter:

A[Irreg(Lat(At(-37.5)), Lon(At(144.3))), Time(At(DateTIme360Day(2015, 2, 2))]

@meggart
Copy link
Member Author

meggart commented Oct 10, 2019

I think the idea is we put the trait system in place and we can fill out later to match all of these different grid types, so talking about it now is good to make sure they can be covered, but they might not get implemented until someone needs them enough to do it.

Thanks @rafaqz this was exactly the plan. Try to flesh out which is generic enough to fulfill future use cases and simple enough to be integrated for existing schemes in a few lines of code. One question would be where an interface like this could live. Since it ideally is not only geo-specific, probably a new small package like DimensionalArrayTraits or similar might make sense. I would be ok with adding the traits to DimensionalData as well, but maybe uptake of the interface by other packages is larger when we keep the dependency minimal.

Regarding a new telecon date, I am currently travelling, but would be available again from next week. I will create a doodle once I am back if nobody beats me to it.

@rafaqz
Copy link
Member

rafaqz commented Oct 11, 2019

I've started implementing the types in DD for simplicity but I agree we should move them out afterwards - maybe DimensionTraits.jl instead? it will contain things that could be used by non-array data and it's a shorter name.

Edit: One question is how to deal with the span of a grid cell. I.e. it might be a Month, not an exact amount of time. The cell span is also useful for bounds() to calculate the actual edges of the dataset if the coordinates are specifying the start or center of cells. IrregrularGrid and CategoricalGrid wont have a cell span.

I've written it up like this:

struct Center{T} <: CoordType{T} 
    span::T
end

In the branch of DD exploring grid traits:
https://github.com/rafaqz/DimensionalData.jl/blob/grid/src/grid.jl

@mkborregaard
Copy link

Hi guys, thanks for including me in this conversation and sorry for being late to the party. So to answer the original question, my goal for VerySimpleRasters.jl was to make a really light pure-julia package for doing most of the stuff I need to do with rasters.

So, I'm not a geographer, I'm an ecologist, and 99% of all my uses of rasters involve accessing larger-than-memory rasters, extracting values at certain points, aggregating/resampling, cropping/masking, window operations, cost-distance-landscape calculations etc. Time series too. These are all things you can easily do with generic (e.g. image) operations on julia arrays. So the package reads and writes mmapped arrays and opens for operations on the matrix with a simple transformation from coordinates to getindex. It also has like zero deps. I thought of it as a bit of an antitheses to most gis software that has big binary dependencies in order to read and write many raster formats, and passes a lot of c pointers around.

My stakes are 1) this functionality is crucial for transferring my workflow to julia, and 2) I'd prefer to use something better than VerySimpleRasters. So I'm really happy to merge/subsume/drop any VerySimpleRasters functionality, and also to contribute within the limits of me actually not knowing very much about existing geospatial data types.

I'm really enthusiastic about the developments here - happy to join a call if I can be useful.

@Alexander-Barth
Copy link
Member

Alexander-Barth commented Oct 11, 2019

For Fabian's interface, I think it would be necessary to add a function like gridcoordinatesname (or any shorter variant :-) ) which would help to identify that the first dimension is actually "longitude", (or latitude, elevation, time ...).

For what it is worth, here is an example of an API which uses the arbitrary grid as a general case: for e.g. a 3D data array, the coordinates are all 3D arrays too, but it uses a special type RepeatedArray is used to store them efficiently for the common case where the dimension are aligned with longitude/latitude/elevation/time....

https://gist.github.com/Alexander-Barth/41f70c292ec9cdac2543e6c6a01ef612

The advantage is that dimension names (which are not standardized) are not exposed to the user and one can more easily write generic code which work for different datasets. Optimization for the special and common cases (aligned coordinates) can be handled using julia's dispatch (but I have not tested this yet).

EDIT: The disadvantage is that users are nudged/forced to think about the arbitrary case, even is the dataset that they are using has a quite simple structure.

@mkborregaard
Copy link

For arbitrary grids, don't we just need 1) an array, and 2) a transformation function of coordinates onto array indices? Like GeoRasters has?

@Alexander-Barth
Copy link
Member

Yes, it boils down to a transformation function, but which is (in the context of netcdf files) usually defined as 2D arrays of longitude and latitude (representing the coordinates of every grid point). I am not familiar with GeoRasters to say if such cases can be handled.

@rafaqz
Copy link
Member

rafaqz commented Oct 11, 2019

I think it would be necessary to add a function like gridcoordinatesname (or any shorter variant :-)

@Alexander-Barth was just thinking this too. In DD the IrregularGrid trait would have its own dims method (which is also much shorter ;) that returns a tuple of the dimension names after the transformation.

Then we could just index with A[Lat(x), Lon(y), Time(At(t))] and not even think about the transformation from Dim{xi_rho} etc that happens underneath.

Edit: If it wasn't clear, the array dims field would itself hold the tuple (Dim{:xi_rho}, Dim{eta_rho}, Time). An IrregularGrid trait holding the transform function or matrix lookup, and the transformed dimension names tuple (Lat, Lon) would be attached to the first two dims. Time would just have a RegularGrid trait.

That would be general in terms of dimension number, order and names for both the array and transformation.

@rafaqz
Copy link
Member

rafaqz commented Oct 14, 2019

The pull request implementing something like what @meggart proposed is here:

https://github.com/rafaqz/DimensionalData.jl/pull/11/files

It's still a while off being finished but it seems fairly doable.

@juliohm
Copy link
Member

juliohm commented Oct 17, 2019

Dear all,

Did we set a date for our e-meeting? I am looking forward to brainstorming the issues we raised.

Best,

@rafaqz
Copy link
Member

rafaqz commented Oct 17, 2019

Maybe set up a doodle poll if you have it? or I think zvite is free and similar.

I'm keen to talk about GeoStats integration, its the next step in removing R from our workflow.

@juliohm
Copy link
Member

juliohm commented Oct 17, 2019

Thank you @rafaqz , I've created a zvite poll for the first time: https://zvite.co/i/PVySJie_ Please let me know if something is off.

Let's try it next week. I cannot do Monday but I marked all weekdays' mornings as available starting from Tuesday.

P.S.: I'd be happy to help with the GeoStats.jl integration ❤️

@rafaqz
Copy link
Member

rafaqz commented Oct 21, 2019

@Alexander-Barth and @meggart transformed indices work now on the grid branch of DimensionalData.jl.

https://github.com/rafaqz/DimensionalData.jl/blob/grid/test/selector.jl#L51-L84

There will be some work automating the generation of the transform functions from various file formats. I'm not sure how the netcdf format handles this, but it's fairly simple in for tif etc GDAL and the GeoArrays packages has some good examples.

Any updates on meeting times? have people filled out the https://zvite.co/i/PVySJie_ ?

@juliohm
Copy link
Member

juliohm commented Oct 21, 2019

Thank you @rafaqz for the update. Only you and @visr have voted for times in the zvite invite so far.

@mkborregaard
Copy link

mkborregaard commented Oct 21, 2019

Yeah the zvite page won't load for me
EDIT: it's because the final underscore isn't part of the hyperlink
Edit again: doesn't matter for me, I don't have time this week. Sorry for the noise.

@juliohm
Copy link
Member

juliohm commented Oct 21, 2019 via email

@visr
Copy link
Member

visr commented Oct 21, 2019

The link happens to end in an underscore. Github by default does not include that underscore in the link. That's why I edited your post @juliohm, to make an explicit []() url in markdown.

@meggart
Copy link
Member Author

meggart commented Oct 22, 2019

Thanks @visr the link works for me.

@evetion
Copy link
Member

evetion commented Oct 22, 2019

Hi all, sorry for being late to the party. In the past I have tried to extend AxisArrays.jl for spatial dimensions, but it couldn't handle irregular grids and their coordinates. I love the initiative and the ideas here and I'm happy to join the discussion.

Most of the common issues and design seem to have been mentioned/discussed at least once. I'll add one, in my tryouts to implement the indexing/slicing with coordinates in GeoArrays.jl I encountered problems when used with an Irregular grid, as the logical coordinates are not (always) an orthogonal slice anymore. The only thing that comes close is returning a Boolean mask of the grid representing locations that do fall within the requested bounds. The other option is flattening the result to a vector, losing the structure.

To make this a bit clearer, imagine indexing into a Irregular GeoArray with [Dim{foo}(1), Dim{bar}(2)] as proposed by @rafaqz. Dim{foo} is regular and sliceable with a[:, 1].

[ 1 2 3
  1 2 3
  1 2 3 ]

However, imagine Dim{bar} as

[ 1 2 2
  1 2 3
  2 3 4 ]

which can't be sliced easily, only masked. How would we handle such cases?

For arbitrary grids, don't we just need 1) an array, and 2) a transformation function of coordinates onto array indices? Like GeoRasters has?

GeoRasters has become GeoArrays, because it exports the GeoArray type. That name is now also used by GeoData but with a different definition.

Regarding the planned meeting, is there an overview of posssible dates/times? The zvite link only enables new input.

@rafaqz
Copy link
Member

rafaqz commented Oct 22, 2019

Yeah that is difficult to deal with. Masking and slicing to the bounding box could also work, especially for loading from disk as we would only grab the bounding box. Currently only single indexes are working in DD, but it just needs method for doing the masking and it will work. I was imagining applying the mask to the array using the missingval that is stored in GeoData.

And yes it would be good if we could see whats happening with zvite and decide a day

@juliohm
Copy link
Member

juliohm commented Oct 22, 2019

So far everyone can do Friday 8am BRT according to zvite:

Screenshot from 2019-10-22 19-50-06

Should we set the date/time?

@rafaqz would you like to host the meeting? I can help moderating as you suggested.

@rafaqz
Copy link
Member

rafaqz commented Oct 23, 2019

Ok that works for me

@Alexander-Barth
Copy link
Member

I will try to join too for Friday (11:00 UTC; 13:00 CET)

@juliohm
Copy link
Member

juliohm commented Oct 23, 2019

Let's set a online room for the meeting. I think the time is set already.

@rafaqz do you have suggestions on how we should do it? I will review our agenda in the Google Docs today or tomorrow at latest.

@rafaqz
Copy link
Member

rafaqz commented Oct 23, 2019

I think someone already set up some kind of room linked in the docs.

@meggart
Copy link
Member Author

meggart commented Oct 24, 2019

I set up the room some weeks ago because no alternative was suggested. I will activate it approx 1 hour prior to the meeting.

@juliohm
Copy link
Member

juliohm commented Oct 25, 2019

Just a reminder of our meeting in an hour from now. Looking forward to meeting you guys.

@evetion
Copy link
Member

evetion commented Oct 25, 2019

Made https://github.com/JuliaGeo/DimensionalArrayTraits.jl as suggested in the meeting.

@Balinus
Copy link

Balinus commented Dec 19, 2019

Awesome work! Just found out this thread. An unified interface will open up a lot of possibilities. Having out-of-core arrays capabilities baked-in will certainly be a massive step for Geospatial analysis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants