-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Common interface for raster data/DimensionalData.jl #6
Comments
So let me start with https://github.com/esa-esdl/ESDL.jl . This package was originally created as a toolkit for a single dataset, so started out to be quite limited in the datasets it can handle. Its focus is on an efficient implementation of mapslices and broadcasted mapslices on chunked out-of-core datasets (including automatic parallelization). It is optimized to work on Zarr datasets and the data model is completely compatible with xarray, which means that you can directly read data that has been written by xarray's The package introduces an ad-hoc AxisArray-like type as well that has named axes and values along these axes. However, I would be very happy to exchange this backend with whatever we might come up as a common interface. My main requirement would be that it is possible for this array to lazily hold the data, since all my datasets are larger than memory. The package also comes with some rudimentary plotting functionality (see docs), but I would really like to interface some other solution, since maintenance is quite time-demanding. For a quick feeling how the package works, have a look at e.g. these notebooks https://github.com/esa-esdl/ESDLPaperCode.jl/tree/master/notebooks . (Pinging @Alexander-Barth )as well, who might be interested |
For some discussion on a raster interface see also JuliaGeo/GeoInterface.jl#16. But I think this is more about labelled arrays in julia, which is also a good discussion to have. In the basis I think this should not be geo-specific, since it is something of use for a much larger audience. Hence also the split between DimensionalData and GeoData. Another option is IndexedDims and NamedDims as mentioned here: https://discourse.julialang.org/t/status-of-axisarrays-jl/28682/5. Perhaps it is best to participate in that thread as well, for wider visibility. |
Thanks for starting this @meggart. Im also keen to move towards integrating these efforts. ESDL.jl looks interesting, I think we've done a lot of similar work. I'll have a thorough look at it during the week. To me there are two key interfaces to agree apon. As @visr is suggesting, there are general named axis/dimension data for use across many julia domains, but there are also container types specific to the geospatial domain. The reason DimensionalData and GeoData are separate is mostly to separate these discussuions. Hopefully we can treat DimensionalData.jl as a replaceable building block for building geospatial tools, which shouldnt have to worry about it much besides defining dimensions during object construction. Agreeing on common spatial objects and interfaces is probably more specific to this space. I've put forward the array/stack/series components, based on my needs for modelling packages (see the GeoData.jl readme). But we can certainly restructure that to fit the cube concept. Im also focussed on making everything potentially disk-based and lazy, with the same synax as eager versions. |
Thanks to both of you for your replies. You are right, I am mixing two separate issues here, and since our packages touch both of these issues I thought I would mention both. I will focus on the more Goe-related one for now.
I have looked at GeoData.jl befire and I really like the interface. The only fundamental design decision that causes me headaches is that it So I have decided not to subtype BTW, I think it might be a good idea to discuss in person, so maybe we could have some kind of JuliaGeo raster teleconference open for interested people to show and discuss the different approaches and move forward with a common interface. |
That is a good point. I've written DimensionalData.jl with this in mind, so that you don't really need to inherit from AbstractDimensionalArray, you just need A teleconference is a good idea, @juliohm and a few others in the was also interested in that too over at the VerySimpleRaster thread. I'm in Australian Eastern standard time but 7am-2am my time is a tolerable window most days. Edit: I'm also using disk based backends so avoiding these default access methods will be really important, but I haven't really though it through yet or hit any practical problems. But its relevent to the choice made in DD to rely on dimensions for functionality and dispatch when possible instead of the array types, something we are simultaneously discussing in this thread: |
Thank you @rafaqz for pinging. I am in BRT time, and can adjust my agenda accordingly after you decide a meeting hour. We definitely need to set a video call to address this challenge. It is too big to be discussed on GitHub threads only. |
Oh dear, time-wise I am in the middle (CET). Maybe we could try to do 8 a.m. BRT, which would correspond be 9 pm Australian Eastern and it would be around noon for Europeans. Are there any suggestions for a Date? I would be available on normal weekdays. In order to collect this information in a single place I have started a Google doc to collect availabilities and possible topics and might serve as an agenda in the end. https://docs.google.com/document/d/1ccaSltPDb5n-bLNgWyrotsWXz2n7ckMggOIPejy_GvQ/edit?usp=sharing Please feel free to edit as you like and share with interested people. Let me know if the google docs does not work for some reason (no google account and don't want one etc...) and suggest where to do this instead. |
This is great and it looks like your approach is indeed very composable. I will look into the AbstractGeoStack implementation, thanks for the pointer. |
Thanks a lot for your great initiative, Fabian! In my domain, grids are not necessarily aligned in the longitude and latitude directions (for example a satellite swath, https://docserver.gesdisc.eosdis.nasa.gov/public/project/Images/OMNO2_003.png or a model whose vertical grid depends on time). Should such raster data also be supported? In the most general case, if a dataset has e.g. 4 dimensions, the arrays representing longitude, latitude, depth (or elevation) and time can also have 4 dimensions. I think this could still be represented efficiently by having an special array type which is virtually is a 4d array but stores only a vector of the data and defines the getindex function appropriately, for example if the longitude varies only along the 1st dimension, one could have: Base.getindex(x::Virtual4DArray,i,j,k,n) = x.small_vector[i] Similarly, the value could just be computed on the fly for specific projection using the corresponding formulas (e.g. http://mathworld.wolfram.com/MercatorProjection.html) so that the storage would actually just be a handful of parameters. A georeferenced dataset would just be a collection including the actual data and arrays representing lon, lat, elevation and time (or a subset of these). Some data sets might have even more dimensions (e.g. frequency for hyperspectral satellite data, ensemble member for Monte Carlo simulation). A possible interface could be: value = georeferenced_dataset.data[2,3,4] # value at the indices 2,3,4
lon = georeferenced_dataset.lon[2,3,4] # longitude of the value at the indices 2,3,4
lat = georeferenced_dataset.lat[2,3,4] # latitude of the value at the indices 2,3,4
frequency = georeferenced_dataset.frequency[2,3,4] # implemented using getproperty Simply indexing such a data set would return a georeferenced subset. slice_of_georeferenced_dataset = georeferenced_dataset[:,3,4] Multiple dispatch in julia allows us to implement optimized function, for some common special cases, for example extract all data withing a geographic bounding box if the dataset is nicely aligned along longitude and latitude (or in a given projection where the indices can be directly computed). I am also in CET time zone. |
I just came across the OceanGrids.jl package from @briochemc which looks quite interesting. The tests give an idea about the interface: |
@Alexander-Barth. I was planning to eventually build that kind of indexing into the DimensionalData.jl dimensions - the n dimensional arrays would be stored in the Lat/Lon/Vert/Time etc structs rather than as fields on the array. A trait would then control which dimensions needed multiple indices - Ie Lat/Lon might be matrices while Time could still a vector if it doesn't affect the lat/lon coordinates. It should be flexible to represent any configuration over any dimensions. GeoData.jl already does geo-referenced subsetting, you can easily extract an area between some lat/lon etc (using The NCDatasets implementation works for simple (axes as vectors) projections: |
@rafaqz |
You can already load matrices into dims instead of a vector, but there is no method for getting the lat at a particular index, so you would have to do something like But we can add a method that does that, like Edit: It seems a little more complicated than this now I've written it out. Would the dims be called X, Y when they are not exactly Lat/Lon aligned? or still Lat/Lon?. We may need to store a second list of dims for mapping to. I actually never use these formats so I'm not really across how they are used. |
@meggart 9pm is fine by me. Early next week (21st/22nd AEST is good), I will be travelling for a week or so after that but might be able to find time. |
Yeah, to give you an example of a NetCDF where this is the case:
So |
Thanks @visr for sharing this example. The names of the dimensions (
in |
Thanks that example really helps. Do either of you have some demo files you can link to for testing? I think GeoData.jl will break if you try to load something like that currently. But it shouldn't be too much work to handle it. I added a few points about formats and projections to the discussion agenda. |
Sure, here is an example from the ROMS model. The lon/lat for the variable e.g. |
While rereading you comment, are |
Thanks for that, I'll look at the DimensionalData.jl defines Package extenders can add more with |
Regarding the call time I would suggest next Monday Sep 23 at 11am UTC (which is 1pm CEST, 8am Sao Paulo, 9pm Melbourne). I hope this fits everyone. |
I support the proposed time for the meeting. How we can organize the agenda to optimize the time? |
It works for me too. I've just been editing the agenda doc a little with ideas. https://docs.google.com/document/d/1ccaSltPDb5n-bLNgWyrotsWXz2n7ckMggOIPejy_GvQ/edit?usp=sharing I'm also wondering if @evetion was aware of this seeing he has also written a raster package and has good ideas for using complex projections, and @mkborregaard as his name was misspelled above |
Unfortunately, I have a lecture to give Monday 1pm (introduction to Julia btw ;-) ) |
Given the miss-spelling and that @Alexander-Barth is not available, I would suggest to move to a later day, it would be important to give @evetion and @mkborregaard a chance to participate who might not have seen this thread yet. Do we need a doodle? Regarding optimizing the agenda, I am not sure. Would anyone be willing to moderater the meeting and try to keep stuff focused? Maybe @visr or @juliohm who are not directly involved in writing raster packages? |
I think it is a good idea to reschedule the meeting to include everyone. I
can also help moderate as suggested. My raster type is in GeoStatsBase.jl
called RegularGridData but definitely not general enough to handle the
features discussed here.
I also have some interesting requirements from a GeoStats perspective that
I would like to share.
…On Sun, Sep 22, 2019, 16:13 Fabian Gans ***@***.***> wrote:
Unfortunately, I have a lecture to give Monday 1pm (introduction to Julia
btw ;-) )
I'm also wondering if @evetion <https://github.com/evetion> was aware of
this seeing he has also written a raster package and has good ideas for
using complex projections, and @mkborregaard
<https://github.com/mkborregaard> as his name was misspelled above
Given the miss-spelling and that @Alexander-Barth
<https://github.com/Alexander-Barth> is not available, I would suggest to
move to a later day, it would be important to give @evetion
<https://github.com/evetion> and @mkborregaard
<https://github.com/mkborregaard> a chance to participate who might not
have seen this thread yet. Do we need a doodle?
Regarding optimizing the agenda, I am not sure. Would anyone be willing to
moderater the meeting and try to keep stuff focused? Maybe @visr
<https://github.com/visr> or @juliohm <https://github.com/juliohm> who
are not directly involved in writing raster packages?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#6?email_source=notifications&email_token=AAZQW3JCUOZWFIPXQHKIQSLQK67TZA5CNFSM4IWM5CJKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7JM53Y#issuecomment-533909231>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAZQW3PEJDIFNGFOASRVDG3QK67TZANCNFSM4IWM5CJA>
.
|
During the last weeks I spent some time browsing through the different approaches of named arrays and dimensions and did some experiments and tinkering. As I mentioned before, the goal and focus of each package as well as the implementation details are quite different, while the concept of named dimensions that carry values is the recurrent theme in them. The result of this tinkering is a more or less concrete proposal for an interface that would let the different approaches talk to each other. I tried to quickly draft this in this gist. The basic idea is that, no matter if your grid is regular or not, wen want to be able to assign some coordinates to every entry in a multidimensional array. The gist defines a set of THTT traits for I think it would be nice to try to factor out a few additional array traits that might be important in our domain and try to export them somewhere. Here are a few suggestions:
I would be happy to work towards such a set of traits to have something more concrete to discuss during the upcoming teleconference. Shall we actually try to set a new date for this? |
Those are great proposals. I was just thinking about chunking today in relation to Dask/Dagger and how to generalise it over multiple source file types. ChunkedArrayBase seems like a good solution. The nature of grid cells is also a good point. Another trait marking the nature of the dimension is needed. But instead of traits of the array, I think Traits for load cost could be useful as well, and these could be on the array. Lets set another date for this soon. I can be free any weekday evenings next week to fit the same schedule agreed for the last attempt. |
Exactly. I would also try to follow classical naming for the types a la VTK. For example irregular spatial types can be further broke down into structured (when we can still index with i,j,k), which is the case handled here, and unstructured (general meshes for which the notion of up-down-left-right doesn't make sense).
This is one of the requirements I had in mind from a geostatistics perspective. Ideally we should have a trait system that can express the notion of volume of each element in the grid (rectangle elements): JuliaEarth/GeoStats.jl#40 This volume of the i-th element
I would like to generalize this to N-dimensional spaces. Just keep in mind that the majority of use cases is 2D in a map projection, with lat/lon values and so on, but these are not all use cases for a regular grid. In our field, modeling the subsurface is quite common, and we need 3D grids with properties of the rock.
For sure! I am in! We need to set a date and time. I am traveling over the weekend with limited access to the internet, but that same time proposal we attempted last time works for me weekdays. |
So in case of a rotated grid, the dimension would be e.g. Dim{:xi_rho}, right (for the example ocean_his.nc)? How would the user/library finds the longitude of a given grid point using the proposed interface? I do not want to push "rotated grid", if you think it is out-of-scope then it is fine for me. But if the interface should support those, then it is not so clear to me how they are handled. |
Yes that could be the dim name. I think the idea is we put the trait system in place and we can fill out later to match all of these different grid types, so talking about it now is good to make sure they can be covered, but they might not get implemented until someone needs them enough to do it. But once we have the grid trait, say it's
The That could happen without too much difficulty in DimensionalData once it's dispatching on the So the user maybe just do: A[Dim{:xi_rho}(At(whatever_it_needs)), Dim{:eta_rho}(At(xxx)), Time(At(DateTIme360Day(2015, 2, 2))] Or something. It might be more complicated than that, but we can probably use dispatch to resolve it. Edit: thinking about this more - if we limit an array to a maximum of one irregular grid type on N dims, with M regular grids on the remaining dims, which is probably realistic (is it?), we can do: A[Irreg(At(144.3), At(-37.5)), Time(At(DateTIme360Day(2015, 2, 2))] or order doesn't matter: A[Irreg(Lat(At(-37.5)), Lon(At(144.3))), Time(At(DateTIme360Day(2015, 2, 2))] |
Thanks @rafaqz this was exactly the plan. Try to flesh out which is generic enough to fulfill future use cases and simple enough to be integrated for existing schemes in a few lines of code. One question would be where an interface like this could live. Since it ideally is not only geo-specific, probably a new small package like Regarding a new telecon date, I am currently travelling, but would be available again from next week. I will create a doodle once I am back if nobody beats me to it. |
I've started implementing the types in DD for simplicity but I agree we should move them out afterwards - maybe DimensionTraits.jl instead? it will contain things that could be used by non-array data and it's a shorter name. Edit: One question is how to deal with the span of a grid cell. I.e. it might be a I've written it up like this:
In the branch of DD exploring grid traits: |
Hi guys, thanks for including me in this conversation and sorry for being late to the party. So to answer the original question, my goal for VerySimpleRasters.jl was to make a really light pure-julia package for doing most of the stuff I need to do with rasters. So, I'm not a geographer, I'm an ecologist, and 99% of all my uses of rasters involve accessing larger-than-memory rasters, extracting values at certain points, aggregating/resampling, cropping/masking, window operations, cost-distance-landscape calculations etc. Time series too. These are all things you can easily do with generic (e.g. image) operations on julia arrays. So the package reads and writes mmapped arrays and opens for operations on the matrix with a simple transformation from coordinates to getindex. It also has like zero deps. I thought of it as a bit of an antitheses to most gis software that has big binary dependencies in order to read and write many raster formats, and passes a lot of c pointers around. My stakes are 1) this functionality is crucial for transferring my workflow to julia, and 2) I'd prefer to use something better than VerySimpleRasters. So I'm really happy to merge/subsume/drop any VerySimpleRasters functionality, and also to contribute within the limits of me actually not knowing very much about existing geospatial data types. I'm really enthusiastic about the developments here - happy to join a call if I can be useful. |
For Fabian's interface, I think it would be necessary to add a function like For what it is worth, here is an example of an API which uses the arbitrary grid as a general case: for e.g. a 3D data array, the coordinates are all 3D arrays too, but it uses a special type https://gist.github.com/Alexander-Barth/41f70c292ec9cdac2543e6c6a01ef612 The advantage is that dimension names (which are not standardized) are not exposed to the user and one can more easily write generic code which work for different datasets. Optimization for the special and common cases (aligned coordinates) can be handled using julia's dispatch (but I have not tested this yet). EDIT: The disadvantage is that users are nudged/forced to think about the arbitrary case, even is the dataset that they are using has a quite simple structure. |
For arbitrary grids, don't we just need 1) an array, and 2) a transformation function of coordinates onto array indices? Like GeoRasters has? |
Yes, it boils down to a transformation function, but which is (in the context of netcdf files) usually defined as 2D arrays of longitude and latitude (representing the coordinates of every grid point). I am not familiar with GeoRasters to say if such cases can be handled. |
@Alexander-Barth was just thinking this too. In DD the Then we could just index with Edit: If it wasn't clear, the array That would be general in terms of dimension number, order and names for both the array and transformation. |
The pull request implementing something like what @meggart proposed is here: https://github.com/rafaqz/DimensionalData.jl/pull/11/files It's still a while off being finished but it seems fairly doable. |
Dear all, Did we set a date for our e-meeting? I am looking forward to brainstorming the issues we raised. Best, |
Maybe set up a doodle poll if you have it? or I think zvite is free and similar. I'm keen to talk about GeoStats integration, its the next step in removing R from our workflow. |
Thank you @rafaqz , I've created a zvite poll for the first time: https://zvite.co/i/PVySJie_ Please let me know if something is off. Let's try it next week. I cannot do Monday but I marked all weekdays' mornings as available starting from Tuesday. P.S.: I'd be happy to help with the GeoStats.jl integration ❤️ |
@Alexander-Barth and @meggart transformed indices work now on the grid branch of DimensionalData.jl. https://github.com/rafaqz/DimensionalData.jl/blob/grid/test/selector.jl#L51-L84 There will be some work automating the generation of the transform functions from various file formats. I'm not sure how the netcdf format handles this, but it's fairly simple in for tif etc GDAL and the GeoArrays packages has some good examples. Any updates on meeting times? have people filled out the https://zvite.co/i/PVySJie_ ? |
Yeah the zvite page won't load for me |
Anyone else having issues with the zvite page? It's my first time using the
service...
…On Mon, Oct 21, 2019, 11:24 Michael Krabbe Borregaard < ***@***.***> wrote:
Yeah the zvite page won't load for me
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#6?email_source=notifications&email_token=AAZQW3OAYXBDXHT5BVMTM3DQPW3SVA5CNFSM4IWM5CJKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEB2QB2Q#issuecomment-544538858>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZQW3OK5JGDP7P5VD73GZDQPW3SVANCNFSM4IWM5CJA>
.
|
The link happens to end in an underscore. Github by default does not include that underscore in the link. That's why I edited your post @juliohm, to make an explicit |
Thanks @visr the link works for me. |
Hi all, sorry for being late to the party. In the past I have tried to extend Most of the common issues and design seem to have been mentioned/discussed at least once. I'll add one, in my tryouts to implement the indexing/slicing with coordinates in To make this a bit clearer, imagine indexing into a Irregular
However, imagine
which can't be sliced easily, only masked. How would we handle such cases?
GeoRasters has become GeoArrays, because it exports the GeoArray type. That name is now also used by GeoData but with a different definition. Regarding the planned meeting, is there an overview of posssible dates/times? The zvite link only enables new input. |
Yeah that is difficult to deal with. Masking and slicing to the bounding box could also work, especially for loading from disk as we would only grab the bounding box. Currently only single indexes are working in DD, but it just needs method for doing the masking and it will work. I was imagining applying the mask to the array using the And yes it would be good if we could see whats happening with zvite and decide a day |
So far everyone can do Friday 8am BRT according to zvite: Should we set the date/time? @rafaqz would you like to host the meeting? I can help moderating as you suggested. |
Ok that works for me |
I will try to join too for Friday (11:00 UTC; 13:00 CET) |
Let's set a online room for the meeting. I think the time is set already. @rafaqz do you have suggestions on how we should do it? I will review our agenda in the Google Docs today or tomorrow at latest. |
I think someone already set up some kind of room linked in the docs. |
I set up the room some weeks ago because no alternative was suggested. I will activate it approx 1 hour prior to the meeting. |
Just a reminder of our meeting in an hour from now. Looking forward to meeting you guys. |
Made https://github.com/JuliaGeo/DimensionalArrayTraits.jl as suggested in the meeting. |
Awesome work! Just found out this thread. An unified interface will open up a lot of possibilities. Having out-of-core arrays capabilities baked-in will certainly be a massive step for Geospatial analysis. |
Hello all,
it looks like there are currently several packages trying to implement some xarray or R-raster like functionality in Julia and I think I don't have a complete overview of what is actually out there. The one's I know about are:
I think every one of these approaches has its own focus, but it would be nice if we could either merge/re-use some functionality or try to use a common interface so that e.g. plotting applications can work on different representations of gridded data.
Maybe package authors could comment here on where they see their package fit into the ecosystem.
The text was updated successfully, but these errors were encountered: