Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connecting with numpytiles #37

Open
jhamman opened this issue Nov 10, 2020 · 31 comments
Open

Connecting with numpytiles #37

jhamman opened this issue Nov 10, 2020 · 31 comments

Comments

@jhamman
Copy link
Member

jhamman commented Nov 10, 2020

The folks @planetlabs (@cholmes, @theduckylittle, and @qscripter) have just released a new project called numpytiles: https://github.com/planetlabs/numpytiles-spec. Its a cool little project designed to make multidimensional arrays usable in the browser. It also has some obvious overlap with Zarr that may be worth exploring.

tweet-refs:

Pinging @rabernat, @kylebarron, and @freeman-lab who have all engaged in a bit of prior conversation about this.

@kylebarron
Copy link

Specifically see https://twitter.com/rabernat/status/1319723780076376064 and its descendants for a small bit of twitter discussion.

@joshmoore
Copy link
Member

Big 👍 for rescuing these conversations from twitter.

@rabernat
Copy link

rabernat commented Nov 10, 2020

I regret on too strong on twitter arguing that numpytiles should be "deprecated". Twitter is not the best platform for subtle discussion. 😞

Really what I meant was, I crave interoperability between all the cool tools that are part of this numpytiles ecosystem and zarr cloud-based data. 😁

Let me try to explain what I mean. IIUC, numpytiles is a spec that replaces, say .png or .jpg as a data format within a tiled map. If you limit your view to a single image / tile, and you consider the format to be completely self-contained to that file, then indeed there is no connection to zarr. However, the existence of a tile server implies that there will be many such files, organized in a hierachy like this

http://tileserver.com/{z}/{x}/{y}.npy

For a particular z value (zoom level), tiles are accessible at paths such as

http://tileserver.com/{z}/2/3.npy

At this point, it is looking like a 2D zarr array. A 2D Zarr array served over HTTP has chunks like

http://zarrserver.com/myarray/2.3

(I have transposed x and y following standard python C ordering of dimensions, but this is trivial.)
In zarr-developers/zarr-python#395, and in the V3 spec, we have discussed the need for a nested cloud store, which would allow the paths to become

http://zarrserver.com/myarray/2/3

We could imagine making a separate array for each zoom level and naming these array with numbers, such that myarray becomes 0. At that point, we have fully reproduced the layout of a tiled map server within a zarr group.

Now what about the files themselves? Numpy tiles (e.g. 3.npy) is a limited subset of the numpy file spec. Specifically, they are regular 2D or 3D arrays in C order with a limit range of datatypes. Such data are guaranteed to be representable as a zarr chunk!

The big difference between an npy file (and, by extension, a numpytile) is that, with npy, the metadata for decoding the array contents live inside the file, while with zarr, they live in an external .json object, the .zarray file. According to the zarr spec, the file must be available at the root of the array:

http://zarrserver.com/myarray/.zarray

An example .zarray for a tiled map server might look like

{
    "chunks": [
        3,
        256,
        256
    ],
    "compressor": {},
    "dtype": "<f8",
    "fill_value": "NaN",
    "filters": [],
    "order": "C",
    "shape": [
        3,
        4092,
        4092
    ],
    "zarr_format": 2
}

The only real difference on the client side is that the client application only has to fetch this metadata once for the entire array. When it fetches individual chunks (e.g. 3/2), it get a pure binary blob, which it decodes according to the metadata from .zarray. It doesn't matter whether the zarr array is materialized on disk or generated lazily, on the fly, and served via an API (e.g. xpublish).

I disagree with the claim I heard on twitter that NPY files are somehow more standard or widely used than zarr chunks. The reason is that, in this context (no compressors or filters), the zarr chunk is just a flat binary file, the simplest and most universal way of transmitting scientific data. To decode it you need to know its shape, its dtype, and its ordering (in this case, fixed as C order).


Why does this matter? Because we would like Zarr to become a leading standard for cloud-based scientific data, with a wide ecosystem of cloud-native tools. There are already some efforts at building js-based visualizers for zarr data, mostly in the biomedical realm. For example, the amazing VIV project by @manzt (github, demo)

Viv is a JavaScript library providing utilities for rendering primary imaging data. Viv supports WebGL-based multi-channel rendering of both pyramidal and non-pyramidal images. The rendering components of Viv are provided as Deck.gl layers, making it easy to compose images with existing layers and efficiently update rendering properties within a reactive paradigm.

Try the demo--is this not essentially the same thing a tiled map app is trying to do?!? The coordinates and context differ, but the fundamentals are the same.

Imagine the sort of amazing stuff we could build if we were able to harness the aggregate creativity and technical chops of the geospatial and bioimaging worlds! (@freeman-lab is perhaps uniquely poised to comment on this.)

So let's work towards aligning! It won't happen tomorrow, but within one year, could the tiled map ecosystem develop the capability to work natively with appropriately formatted Zarr data? That would be an excellent goal.

@davidbrochart
Copy link

I'm very interested in zarr supporting tiles, but I'd also like not to duplicate data. In zarr-developers/zarr-specs#80 I'm imagining an API where:

arr[z, y, x]  # for a 2D array

would fetch the data for a zoom z, followed by the usual indices (this is generalizable to ND arrays, not just 2D). For z=0, their would be only one chunk for the whole data. For z=1 there would be 4 (2**N), and the chunks would be encoded in such a way that in order to get the data for z=1 you would need to add up all the chunks for the previous zoom levels (z=0 and z=1). That way, you use less bandwidth for a low zoom level (lower resolution), and more bandwidth for a high zoom level (higher resolution).
But I don't think this can be implemented by providing a custom store to zarr, because of the chunk sum operation. It should be easy to do by wrapping the zarr array object, but then I'm not sure it will be usable by e.g. xarray.
Any interest in this?

@rabernat
Copy link

@davidbrochart IIUC it sounds like you are talking about a way to generate pyramidal tiles on the fly. (Pyramids have come up elsewhere, e.g. zarr-developers/zarr-specs#50, zarr-developers/zarr-python#520). While this sounds very useful, I'd like to suggest that we pursue that in a separate issue.

This issue is about possible alignment between zarr and the numpytiles specification. From the point of view of the spec, it does not matter how the data are generated--only the fly, materialized to disk, etc. I'd prefer not to confuse these points.

@davidbrochart
Copy link

Sure, sorry for the noise 😄

@vincentsarago
Copy link

👋 So just to be clear, the Planet project is a specification trying to normalize a transmission format, from the raw data to the web client.

We use the NumpyTile format to create web map tiles dynamically from COG (or whatever cloud-friendly format GDAL can read) and then instead to apply rescaling/colormap to the data to create a visual image (PNG) we return the data in NumpyTiles (float, complex...) so the web client (right now @kylebarron is working on a deckGL plugin.

Let me try to explain what I mean. IIUC, numpytiles is a spec that replaces, say .png or .jpg as a data format within a tiled map. If you limit your view to a single image / tile, and you consider the format to be completely self-contained to that file, then indeed there is no connection to zarr.

👌

However, the existence of a tile server implies that there will be many such files, organized in a hierachy like this

In our case, we create the tiles dynamically from a big file, so we do not store the .npy file

@rabernat
Copy link

rabernat commented Nov 10, 2020

In our case, we create the tiles dynamically from a big file, so we do not store the .npy file

Understood. You do not store any npy data on disk. As I hope I made clear in my comment, Zarr over http likewise does not have to be materialized on disk. The Zarr data can be served via an API, with the chunks generated on demand. Xpublish is an example of this.

I see the question of on-disk vs dynamically generated as orthogonal to the questions about formats and specifications. Am I missing something?

@manzt
Copy link
Member

manzt commented Nov 10, 2020

The Zarr data can be served via an API, with the chunks generated on demand. Xpublish is an example of this.

Yes, instead of converting each tile to a .npy blob in memory (which contains the metadata for the chunk), you can just create the .zarray metadata (JSON) once for your large nd-array and return contiguous buffers directly to the client.

# overly-simplified pseudo-code
def handle_request(key):
  if ".zarray" in key:
    return array_json # metadata that describes entire array

  x, y = key.split('.') # here just simple large 2d array
  arr = read_region(x, y) # custom function to read region from image as np array
  
  if compressor: 
    return compressor.encode(arr.tobytes()) # can dynamically compress data as well!

  return arr.tobytes() # just return uncompressed buffer directly!

EDIT: Similar to xpublish, I've experimented with a library called simple-zarr-server that converts any zarr-python store into a HTTP server (https://github.com/manzt/simple-zarr-server). If you were to implement your COG reader as a zarr-python store, you could view your data in napari with dask or via a web-client with something like simple-zarr-server (or xpublish) to forward tile requests:

store = COGStore(file, tilesize=512, compressor=Blosc()) # can configure custom tilesize / compressor for tiles

# in python 
z = zarr.open(store)
z[:] # use numpy syntax to return chunks

# as a webserver
from simple_zarr_server import serve
serve(z) # HTTP endpoint for zarr on localhost:8000

Colab example using a custom store (for openslide-tiff image) to serve image over HTTP as zarr.

@kylebarron
Copy link

First, I see the reason for the "Numpy Tile" existence in the first place as a solution for a simple N-D array format that's very easy to parse in JS. Image formats like jpg/png don't support arbitrary data types, and sending GeoTIFFs to the browser directly requires bringing in a new complex dependency (geotiff.js), which is 135kb minified.

Most tiled datasets require some sort of external metadata file. Vector and raster datasets in the Mapbox ecosystem often use TileJSON files, which at a minimum describe the valid zoom levels for the dataset and the URL paths for each tile. The Numpy Tile spec doesn't touch this broader metadata at all.

It seems like it would be possible to use Zarr for many of these tiled cases where you need to fetch some sort of metadata file anyways.

Here are some discussion points in my view for the use case of tiled map data:

  • In @rabernat's comment above, he said that each zoom level would be considered an independent Zarr array. The downside of this is that a new metadata file would need to be fetched at every zoom level. For this use case, fetching one Zarr metadata file that describes all zoom levels would be preferable.
  • Is there a specification for geospatial referencing in zarr? I noticed that the mur-sst .zattrs file has keys for {easternmost,westernmost}_longitude, {northernmost,southernmost}_latitude , geospatial_{lon,lat}_resolution and spatial_resolution, but are these standard? I'd guess that this dataset is implicitly in WGS84/EPSG:4326, but there's no metadata key that defines the datum. How would a Zarr dataset in another CRS be described?
  • XYZ indexing: Most web maps use the EPSG:3857/web mercator projection, and make their requests to the server using the OSM/XYZ indexing system. It seems like Zarr blocks always need to start from 0, but if the data's bounds aren't the entire globe, the server would need to convert between XYZ indexing and indexing within the Zarr dataset.
  • I looked into Zarr a bit when I was poking around with the mur-sst dataset on AWS. I was turned away when I saw that that dataset's block size is somewhat large, up to 100MB or so I think, and too large to request directly from the browser. I didn't consider before the potential for a "virtual" Zarr dataset where the chunks are generated on demand.
  • Compression: if compression is used, then you need to bring in a new dependency, though @manzt 's numcodecs.js looks promising and supports Blosc. If uncompressed I'd guess that you could just call new TypedArrayType(arrayBuffer) in JS and get the C-ordered array?
  • In terms of geospatial referencing, related work might be the recent OGC TileMatrixSet specification?

@cholmes
Copy link

cholmes commented Nov 11, 2020

Thanks for pushing us towards interoperability @rabernat! I'm definitely all for aligning standards as much as is reasonably possible.

I think I'm still wrapping my head around what interoperability here would look like, to understand if it truly makes sense and what that path would actually be. And if there are real wins from having them be the same format.

My mental model had been zarr + COG are cloud-native formats, and though they can be accessed directly from javascript there are cases where it makes sense to have a tile server running close by (lambda or k8's based for scalability) to help get data to the browser. And indeed on learning that most zarr files had large chunks (as Kyle mentioned), that reinforced my view more - store in zarr, optimized for direct scientific use cases on the cloud, but then run a tile server to break it up into the chunks our browser-based mapping libraries already understand.

So the thing I'm getting my head around is the same thing Kyle mentioned - this idea of a 'virtual' zarr dataset generating chunks on demand. Is that a common thing today? And then also trying to understand using that for the established 'web map tiles' use case. I guess it could be cool if zarr parsers could talk directly to any geospatial tile server? But does it make sense? If we wanted to go down this path I think it'd involved trying to add an extension to the new ogc api - tiles, that would generate a .zarray file at the base of every tileset. If a client wanted to support the zarr-tiles format then it would know to go there (or I suppose we could include that same metadata in the /tiles resource - but we'd want the .zarray file to exist for zarr clients that don't want to worry about tiles). And then we'd have to work through all the points Kyle raises.

It still feels a bit 'weird' to me, like we're maybe trying too hard to combine two things that have pretty similar goals and made pretty similar choices. And I'm just not sure what the exact 'win' is. Cloud-native formats to me are all about putting the compute close to the data. Supporting cross-cloud interoperability when your compute is not next to the data is a nice side-effect of interoperability. But sending data as efficiently as possible to the browser on a desktop or mobile device feels like a 'different' problem.

The lower hanging fruit interoperability that I'm surprised doesn't exist (or maybe I just didn't find it) is a GDAL driver for Zarr. That would make much more sense to me as a place to start, so that the general GIS toolchain can read zarr directly. And then we don't need to go through tile services for that interoperability. Like it feels like there's a baseline level of geo tools with zarr data that we should tackle first.

That said, I'm all in support of trying to evolve these things to come together, and to have a tiles format that is based on zarr chunks. From the Planet perspective, I think we'll complete this 'release' of the spec and open source library. But we don't mean it for it to be a 'big' thing, just a small piece that is compatible with XYZ / WMTS tiles, as well as the coming ogc api - tiles. We see 'COG's as the 'big thing' that we support (and remain interested in Zarr), numpy tiles really a minor thing to work with our tile services.

I could seeing evolving to a zarr-tiles standard, a sort of 'numpy tiles 2.0' once we sort out all the details of how it actually fits into tile services.

@rabernat
Copy link

Thanks so much to @manzt, @vincentsarago, @kylebarron, and @cholmes for weighing in on this. I have so much admiration and respect for the work you all are doing...it's a real honor to have this discussion!

The most important takeaway from what I have read above seems to be the argument that "numpytiles is a relatively minor thing" and therefore probably not worth a lot of effort for aligning. We remain very interested in browser-based visualization, so my concern is that a big ecosystem would emerge around numpytiles that would not interoperate with zarr. Given the limited scope and ambitious for numpytiles, I see now that this is probably not a big concern.

That said, let me respond to some specific points.

  • each zoom level would be considered an independent Zarr array. The downside of this is that a new metadata file would need to be fetched at every zoom level

That is true for the current zarr spec (v2.5). Going forward the v3 spec will likely have an extension that supports image pyramids (see zarr-developers/zarr-specs#50). This is major need for the bioimaging community as well.

  • Is there a specification for geospatial referencing in zarr?

No. Zarr is a generic container for chunked, compressed arrays with metadata (similar to HDF5). What we need is a community standard or convention, on top of the base Zarr spec, for encoding geospatial data. In Pangeo, we encode our data following NetCDF / CF conventions, which have a convention for CRS. From Zarr's perspective, this is all considered "user metadata". OGC is currently reviewing Zarr as a community standard. By the same token, is there a specification for geospatial referencing in NPY files? No, of course not. It's a low level container for array data. So that seems like a slightly inconsistent comparison.

  • It seems like Zarr blocks always need to start from 0, but if the data's bounds aren't the entire globe, the server would need to convert between XYZ indexing and indexing within the Zarr dataset.

Yes this sounds like a significant challenge. The Zarr spec supports the concept of missing chunks, which are defined as filled by a constant fill_value. Maybe this could help mitigate the problem?

  • was turned away when I saw that that dataset's block size is somewhat large, up to 100MB or so I think, and too large to request directly from the browser. I didn't consider before the potential for a "virtual" Zarr dataset where the chunks are generated on demand.

As we discussed on Twitter, Zarr has absolutely no inherent preference about chunks. They can be as small or as large as the application demands. Those particular chunks were chosen with backend bulk processing in mind, not interactive visualization. On-the-fly, lambda-based rechunking of a zarr dataset would be a really cool application to develop!

  • Compression: if compression is used, then you need to bring in a new dependency,

Compression is optional. It improves speed and storage / bandwidth utilization at the cost of complexity. Most existing zarr datasets use compression, but it can be turned off easily.

So the thing I'm getting my head around is the same thing Kyle mentioned - this idea of a 'virtual' zarr dataset generating chunks on demand. Is that a common thing today?

This is not particularly common, although it definitely does work! We saw two example libraries that do this (simple-zarr-server and xpublish) in the comment above. The client doesn't need to know or care about whether the chunks are on disk or generated dynamically. I think there is a lot of potential here for building APIs. But of course none of this is vetted or approved by any standardization body.

It still feels a bit 'weird' to me, like we're maybe trying too hard to combine two things that have pretty similar goals and made pretty similar choices. And I'm just not sure what the exact 'win' is.

This is a convincing point. I guess I tried to explain that in my earlier post. The "win" would be that tools for interactive browser-based visualization of multiscale chunked array data could share a common file format and set of javascript libraries. However, that idea neglects the fact that there is already an extensive implementation of tileset viewers which have no real incentive to refactor their operation around this goal.

The lower hanging fruit interoperability that I'm surprised doesn't exist (or maybe I just didn't find it) is a GDAL driver for Zarr. That would make much more sense to me as a place to start, so that the general GIS toolchain can read zarr directly.

Yes, this is a great idea. I would personally have no idea where to start though. I guess we would need a motivated person who wants to use Zarr data from GDAL land.

@kylebarron
Copy link

By the same token, is there a specification for geospatial referencing in NPY files? No, of course not. It's a low level container for array data. So that seems like a slightly inconsistent comparison

My argument is that Numpy Tiles' metadata describe a single tile whereas Zarr metadata describes an entire dataset. Numpy Tiles don't need geospatial referencing because that's assumed to be external (just as it would be for PNG/JPG/MVT), but since Zarr describes the whole dataset, it needs to include geospatial referencing.

  • It seems like Zarr blocks always need to start from 0, but if the data's bounds aren't the entire globe, the server would need to convert between XYZ indexing and indexing within the Zarr dataset.

Yes this sounds like a significant challenge. The Zarr spec supports the concept of missing chunks, which are defined as filled by a constant fill_value. Maybe this could help mitigate the problem?

I don't think it's necessarily a challenge as long as you know the geospatial referencing of the dataset. For example, the TileMatrixSet spec is very similar in that its internal indexing starts from (0, 0), but since you know the real-world position of that point, it's not hard to change coordinate systems.

Regarding chunksize, compression: I think my point here is the difference between Zarr as a storage format and Zarr as an API. Zarr as a storage format needs to make these decisions for backend processing; the API server could rechunk to the client's preferred dimensions.

@jhamman
Copy link
Member Author

jhamman commented Nov 12, 2020

Hi all! Thanks for the super informative discussion here. (My apologies for getting the party started before going dark for a few days)

I have a few questions after reading through all of this.

  1. We have learned that Numpytiles are more analogous to Zarr chunks, than zarr arrays. And that the primary difference in the specification of the two objects is whether the array metadata is stored internally (.npy) or externally (.zarray). When uncompressed chunks are created by zarr, they are openable via numpy directly:

    arr = np.fromfile('./foo/0/0/0', dtype=dtype).reshape(shape)

    Just as you could with a numpytile:

    arr = np.load('./foo/0/0/0.npy')

    My question for numpytile folks is: how often is the metadata different from one tile to the next (e.g. 0/0/0 -> 0/0/1)? My impression is that in most cases, they are going to be the same array shape/dtype/etc and that the reading and parsing of the second tiles metadata is duplicative.

  2. We are super interested in exploring the many opportunities of zarr in the browser. I now @kylebarron has looked into zarr a bit but seems to have found existing resources (data and/or software) to be lacking or not properly oriented to his application. So what are we missing to properly explore this space? Would it help to have a public zarr dataset with smaller uncompressed chunks (256x256)?

Again, thanks all for the thoughtful and productive discussion!

@kylebarron
Copy link

  1. how often is the metadata different from one tile to the next (e.g. 0/0/0 -> 0/0/1)?

Agreed, very rarely/never. And the Numpy format includes only enough metadata to read the data back into the original array, but no metadata to describe what the data is. Aside from shape and geospatial referencing, you have no way to know which bands correspond to which wavelength, and if one of the bands is a data mask.

  1. found existing resources (data and/or software) to be lacking

I was interested in direct reading of large datasets straight from the browser. Due to chunksize issues, that wasn't possible with the one dataset I looked at, but zarr.js looks to be perfectly adequate. A demo dataset with small chunks (256px) might be nice; I don't think it's important for it to be uncompressed... numcodecs.js looks to support the normal gzip/blosc.

@manzt
Copy link
Member

manzt commented Nov 12, 2020

I now @kylebarron has looked into zarr a bit but seems to have found existing resources (data and/or software) to be lacking or not properly oriented to his application.

Agreed. Zarr.js aims to be a feature complete implementation of Zarr (including indexing and slicing of the full nd-array), but for web-based visualization applications, you really just want to load a tile/chunk by key (e.g. y.x)

The equivalent of arr = np.fromfile('./foo/0/0/0', dtype=dtype).reshape(shape)

In that case, Zarr.js is certainly adequate, but it's a large, bulky dependency whose internals can't be totally tree-shaken. This means that there is a a high "cost" to bringing on Zarr.js as a dependency for a project, which web-developers are especially opposed,

The conversation here (and experience with our own applications) led me to think about what is the minimal amount of JavaScript required to load a Zarr array "chunk" by key. Right now I'm experimenting with zarr-lite. It's a single (dependency-less) JS file (~2kb minified / ~1kb gzipped), and it has an identical API to Zarr.js for loading a "chunk", but that's it:

import { openArray } from 'zarr-lite';
import HTTPStore from 'zarr-lite/httpStore';

// import { openArray } from 'zarr';
// import { HTTPStore } from 'zarr'; // could use a any store from Zarr.js!

(async () => {
  const store = new HTTPStore('http://localhost:8080/data.zarr');
  const z = await openArray({ store });
  console.log(z.dtype, z.shape, z.chunks);
  // "<i4", [10, 1000, 1000], [5, 500, 500]

  const chunk = await z.getRawChunk('0.0.0'); // get chunk by key; can also use [0, 0, 0];
  console.log(chunk);
  // {
  //   data: Int32Array(1250000),
  //   shape: [5, 500, 500],
  //   strides: [250000, 500, 1],
  // }
})();

Please see this interactive example for more info: https://observablehq.com/@manzt/using-zarr-lite

NOTE: Codecs (via numcodecs.js) are loaded dynamically at runtime only if needed. The default is to pull modules from a CDN, but this behavior can be overridden by updating the registry with user-provided functions to import codecs.

@davidbrochart
Copy link

So I guess zarr-lite implements what we call the store API in Zarr v3, and especially the readable store API.

@manzt
Copy link
Member

manzt commented Nov 12, 2020

So I guess zarr-lite implements what we call the store API in Zarr v3, and especially the readable store API.

Perhaps in part? Reading what you sent, I don't see where the store is responsible for 1.) decoding a binary blob (if compression is used) and 2.) creating a view of a decoded buffer on get.

@davidbrochart
Copy link

You're right, the store is not responsible for compression/decompression, it only deals with raw bytes. And no view either. So yes, zarr-lite is more than the store API.

@lbrindze
Copy link

A while ago I started experimenting with cutting tiles on the fly on top of a redis backed zarr store. The resulting efforts ended up in this project (https://github.com/lbrindze/angle_grinder) which is half baked to say the least. Given what is being discussed here I think it would be fairly trivial to re purpose the drawing of pngs on the fly to just dynamically cutting a zarr chunk + custom .zattrs that would make up the tile. the mercantile lib already does all the tile math for you, which is awesome!

I would be more than happy to adapt this into a tile endpoint feature for Xpublish if there is interest in that.

At least in the consumer weather application space, a lot of folks are moving away from png tiles and just sending off numerical data and letting the client draw it. This most often happens independent of data storage format. The 2 most common ways I've seen of doing this, based on a small survey I did for an internal company presentation a couple months ago, was to shove the tile into protobuf or json frames and send the data that way. Each one I found was a completely different implementation with no standard interoperability of data tiles between these products.

@rabernat
Copy link

Thanks so much for sharing @lbrindze!

At least in the consumer weather application space, a lot of folks are moving away from png tiles and just sending off numerical data and letting the client draw it. This most often happens independent of data storage format. The 2 most common ways I've seen of doing this, based on a small survey I did for an internal company presentation a couple months ago, was to shove the tile into protobuf or json frames and send the data that way. Each one I found was a completely different implementation with no standard interoperability of data tiles between these products.

👆 THIS is why we need to be having this discussion. There is a broader opportunity here to develop standards for browser-based viz of numerical array data.

@jhamman
Copy link
Member Author

jhamman commented Nov 12, 2020

To respond to @kylebarron's last comments first...

Agreed, very rarely/never. And the Numpy format includes only enough metadata to read the data back into the original array, but no metadata to describe what the data is. Aside from shape and geospatial referencing, you have no way to know which bands correspond to which wavelength, and if one of the bands is a data mask.

Perhaps this is where zarr has something to offer. The metadata associated with a collection of chunks is shared in the .zarray and/or .zattrs keys. The various Zarr APIs and libraries built on top of them are then responsible for interpreting the array metadata.

I've put a smallish sample zarr dataset in an open-access Azure cloud bucket.

  • root: https://carbonplan.blob.core.windows.net/carbonplan-share/zarr-demo
  • nftd array: https://carbonplan.blob.core.windows.net/carbonplan-share/zarr-demo/nftd/.zarray
  • chunks are 1x256x256 and there is no compression turned on

I've also uploaded the notebook I used to make this array which includes a path to a COG version of the same dataset: https://gist.github.com/jhamman/2a95102567025b3b69e2873a89aaed22


Now to @lbrindze's comments. I'd be super excited to see the development of a tile endpoint to Xpublish. @benbovy has recently refactored the plugin architecture for new API endpoints and we'd be happy to give you any pointers needed. Check out his twitter thread from just today.

@manzt
Copy link
Member

manzt commented Nov 12, 2020

I've put a smallish sample zarr dataset in an open-access Azure cloud bucket.

Hmm I'm getting errors with the public access datasets. Are they available via HTTP?

$ curl -L https://carbonplan.blob.core.windows.net/carbonplan-share/zarr-demo/nftd/.zarray
# <?xml version="1.0" encoding="utf-8"?><Error><Code>ResourceNotFound</Code><Message>The specified resource does not exist.
# RequestId:964c9a75-c01e-0010-162d-b95a85000000
# Time:2020-11-12T19:56:39.1862516Z</Message></Error>% 

@jhamman
Copy link
Member Author

jhamman commented Nov 12, 2020

@manzt - my apologies, the bucket was missing a part of its public setting. Should work now.

@manzt
Copy link
Member

manzt commented Nov 12, 2020

Ah thank you! BTW, I don't think the metadata in nftd/.zattrs (and thus the consolidated metadata) is valid JSON. "nftd/.zattrs".nodatavals contains [ NaN ] not ["NaN"].

I notice that the default fill_value for .zarray data uses a string for "NaN". This behavior is likely due to the default ignore_nan === true with json.dumps in python. Clearly not an issue for decoding in Python, but important for cross language compatibility where this isn't standard. In javascript JSON.parse(utf8_decodedBuffer) results in an error if NaN is present.

Update: I've made an observable notebook using zarr-lite with these data: https://observablehq.com/@manzt/nftd-zarr-lite-example

@jhamman
Copy link
Member Author

jhamman commented Nov 12, 2020

@manzt - very cool. On the subject of the NaN strings, let's follow up here: zarr-developers/zarr-python#429

@benbovy
Copy link

benbovy commented Nov 13, 2020

Great discussion here! It took some time to catch up, but I enjoyed reading it.

I guess that choosing between zarr and numpytiles really matters if we need to directly deal with those stores and/or files on the client side?

On a general note, I agree that standards will greatly help towards better interoperability between web applications dealing with scientific, array data. But I also think that it would be better to have a handful of standards, each optimal for given situations, rather than trying to come up with a unique standard that more-or-less fits all use cases.

Nowadays, with tools like FastAPI, it's really easy to develop lightweight backends that are adapted to different situations. Supporting multiple specs would not add much maintenance burden IMO. The server and client sides have different constraints and we can easily decouple that. For example, load a zarr store of large, compressed data chunks in the backend, and then serve it via the API as dynamically generated, uncompressed, small tiles with another standard format for data and/or metadata.

@lbrindze, I had a brief look at your angle_grinder code. Like @jhamman, I'd love to see this implemented in Xpublish! I think that 80% of your code can be reused as-is! One thing that Xpublish does not support is loading new datasets using the API (i.e., like your load_netcdf endpoint), instead the datasets are loaded (perhaps lazily) when the application is launched.

@lbrindze
Copy link

@lbrindze, I had a brief look at your angle_grinder code. Like @jhamman, I'd love to see this implemented in Xpublish! I think that 80% of your code can be reused as-is! One thing that Xpublish does not support is loading new datasets using the API (i.e., like your load_netcdf endpoint), instead the datasets are loaded (perhaps lazily) when the application is launched.

Happy to take a look at submitting a PR over the weekend here. I will follow up with something in Xpublish's project directly :)

@clbarnes
Copy link

We use the NumpyTile format to create web map tiles dynamically from COG (or whatever cloud-friendly format GDAL can read) and then instead to apply rescaling/colormap to the data to create a visual image (PNG) we return the data in NumpyTiles (float, complex...) so the web client (right now @kylebarron is working on a deckGL plugin.

This use case seems to have some overlap with one or both of h2n5, a web server which encodes 2D slices of N5 volumes into a regular image, and n5-wasm, which is used to decode and slice fetched N5 chunks for use as 2D images in CATMAID (see here).

@davidbrochart
Copy link

The lower hanging fruit interoperability that I'm surprised doesn't exist (or maybe I just didn't find it) is a GDAL driver for Zarr. That would make much more sense to me as a place to start, so that the general GIS toolchain can read zarr directly.

Yes, this is a great idea. I would personally have no idea where to start though. I guess we would need a motivated person who wants to use Zarr data from GDAL land.

Maybe xtensor-zarr could be used to implement a GDAL driver for Zarr. It is written in C++, I'm not sure this makes it a better candidate than zarr-python.
xtensor-io can read GDAL datasets too.

@jhamman
Copy link
Member Author

jhamman commented Sep 27, 2021

Just a quick note here to share some recent progress using Zarr for multiscale array visualization: https://carbonplan.org/blog/maps-library-release

cc @freeman-lab, @katamartin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests