Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dataset info in .json format #2656

Closed
rabernat opened this issue Jan 6, 2019 · 9 comments
Closed

dataset info in .json format #2656

rabernat opened this issue Jan 6, 2019 · 9 comments

Comments

@rabernat
Copy link
Contributor

rabernat commented Jan 6, 2019

I am exploring the world of Spatio Temporal Asset Catalogs (STAC), in which all datasets are described using json/ geojson:

The STAC specification aims to standardize the way geospatial assets are exposed online and
queried.

I am thinking about how to put the sort of datasets that xarray deals with into STAC items (see https://github.com/radiantearth/stac-spec). This would be particular valuable in the context of Pangeo and the zarr-based datasets we have been putting in cloud storage.

For this purpose, it would be very useful to have a concise summary of an xarray dataset's contents (minus the actual data) in .json format. I'm talking about the kind of info we currently get from the .info() method, which is designed to mirror the CDL output of ncdump -h.

For example

ds = xr.Dataset({'foo': ('x', np.ones(10, 'f8'), {'units': 'm s-1'})},
                 {'x': ('x', np.arange(10), {'units': 'm'})},
                 {'conventions': 'made up'})
ds.info()
xarray.Dataset {
dimensions:
	x = 10 ;

variables:
	float64 foo(x) ;
		foo:units = m s-1 ;
	int64 x(x) ;
		x:units = m ;

// global attributes:
	:conventions = made up ;

I would like to be able to do ds.info(format='json') and see something like this

{
 "coords": {
  "x": {
   "dims": [
    "x"
   ],
   "attrs": {
    "units": "m"
   }
  }
 },
 "attrs": {
  "conventions": "made up"
 },
 "dims": {
  "x": 10
 },
 "data_vars": {
  "foo": {
   "dims": [
    "x"
   ],
   "attrs": {
    "units": "m s-1"
   }
  }
 }
}

Which is what I get by doing print(json.dumps(ds.to_dict(), indent=2)) and manually stripping out all the data fields. So an alternative api might be something like ds.to_dict(data=False).

If anyone is aware of an existing spec for expressing Common Data Language in json, we should probably use that instead of inventing our own. But I think some version of this would be a very useful addition to xarray.

@shoyer
Copy link
Member

shoyer commented Jan 6, 2019

I like the look of ds.to_dict(data=False). I'm pretty sure this topic has come up before -- there is definitely value in having a standard way to express the schema of xarray.Dataset objects.

ds.info(format='json') is another option, though a slightly weird feature of ds.info() is that it writes to a buffer rather than returning an object. This is because otherwise you get quotes for a string around the result when you type this at the command line.

@rabernat
Copy link
Contributor Author

rabernat commented Jan 6, 2019

I will ping @dopplershift of Unidata, my go-to for all things netCDF. 😉 Ryan, do you know of any work on this area? The best I could google is this thread from the netcdf mailing list.

@jhamman
Copy link
Member

jhamman commented Jan 6, 2019

Just to say, I really like this idea. I think I prefer the ds.to_dict syntax but feel like we could also work on making ds.info() more useful through some relatively simple changes.

@rabernat rabernat mentioned this issue Jan 7, 2019
3 tasks
@dopplershift
Copy link
Contributor

I'm not aware of any standard out there for JSON representation of netCDF, but I know it's been at least (briefly) discussed. @WardF, anything out there you're aware of?

Another spelling of this could be ds.to_dict(header_only=True), which I only suggest to mirror ncdump -h.

@rabernat
Copy link
Contributor Author

Since my PR was merged, I have discovered two different JSON representations of netcdf

Oops!

@jhamman
Copy link
Member

jhamman commented Feb 12, 2019

It would be good to figure out if either of these are used. It's not too late to update your implementation.

@rafa-guedes
Copy link
Contributor

Would it make sense having to_json / from_json methods that would take care of datetime serialisation?

@shoyer
Copy link
Member

shoyer commented Jan 8, 2020

Would it make sense having to_json / from_json methods that would take care of datetime serialisation?

What's the right way to serialize datetime objects in JSON?

One option would be to add an encode_times option to to_dict that creates the units attribute.

@rafa-guedes
Copy link
Contributor

rafa-guedes commented Jan 8, 2020

Pandas has an option date_format in to_json to serialize it either as iso8601 or epoch. The encode_times option to to_dict could also be useful...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants