Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Challenge 12 -Compression of Geospatial Data with Varying Information Density #3

Open
EsperanzaCuartero opened this issue Feb 24, 2023 · 5 comments
Assignees
Labels
Stream 1 Software Development for Earth Sciences

Comments

@EsperanzaCuartero
Copy link
Contributor

EsperanzaCuartero commented Feb 24, 2023

Challenge 12 - Compression of Geospatial Data with Varying Information Density

Stream 1 - Software Developments for Earth Sciences

Goal

Development of an information-density adapting compression

  • Implement compression and bitinformation retrieval on a chunk basis
  • Analyse the information density of climate variables across time and space
  • Study the optimal chunk size depending on different features like hurricanes, gulf-stream, precipitating clouds and resolutions (large-eddy simulation vs. GCM)
  • Generally improve xbitinfo performance

Mentors and skills

  • Mentors: Miha Razinger, Juan Jose Dominguez (both ECMWF), Milan Klöwer (MIT), Hauke Schulz (University of Washington)
  • Skills required:
    • Python
    • Git
    • Familiarity with xarray
    • Zarr and Dask are beneficial

Note: Only nationals or residents from the ECMWF Member States and Co-operating States are eligible to participate (see Terms and Conditions).


Challenge description

Geospatial data can vary in its information density from one part of the world to another. A dataset containing streets will be very dense in cities but contains little information in remote places like the Alps or even the ocean. The same is also true for datasets about the ocean or the atmosphere. The variability of sea surface temperatures and currents is much larger in the vicinity of the golf stream than in the middle of the Atlantic basin. This variability might also change in time. A hurricane, for example, has a lot of variability in winds, temperature and rain rates, and travels in addition across entire ocean basins.

The challenge of this project is to improve xbitinfo to preserve the natural variability of these features but not to save random noise where the real information density is rather low. This means in particular that the number of bits needed to preserve in compression changes with location. A hurricane has a different information density than a same-sized area in the steadily blowing trade-wind regimes. Compressibility of climate data therefore can change drastically in time and space, which we want to exploit.

Currently in the bitinformation framework, to preserve all real information, the maximum information content calculated by xbitinfo needs to be used for the entire dataset. However, bitinformation can also be calculated on subsets, such that the ‘boring’ parts can therefore be more efficiently compressed.

Xbitinfo is an open-source Python package that enables lossy compression of geo-spatial data based on its information content. Embedded into the pangeo ecosystem, xbitinfo builds on top of xarray and dask and allows for fast compression and analysis of various data formats including netCDF and zarr. Xbitinfo addresses the challenge of increasingly large datasets split into chunks that are currently created due to increasingly available compute power. Climate simulations with resolutions of sub-km scale with petabytes of output are just one example where xbitinfo can help to keep the dataset manageable.

The successful applicant will refine the implementation of xbitinfo to data subsections (chunks) and improve our ability to compress spatially and temporal varying fields. Furthermore, the applicant will learn about information theory and software engineering with international mentors.

References:

@EsperanzaCuartero EsperanzaCuartero added the Stream 1 Software Development for Earth Sciences label Feb 24, 2023
@EsperanzaCuartero EsperanzaCuartero changed the title Challenge 3 -Compression of Geospatial Data with Varying Information Density Challenge 12 -Compression of Geospatial Data with Varying Information Density Feb 27, 2023
@edwardhartnett
Copy link

Note that we recently added some compression features to netCDF, including support of lossy compression and support for the faster zstandard compression library. These may be helpful to those working on this challenge. For more details see: https://www.researchgate.net/publication/365006139_NetCDF_Compression_Improvements

@milankl
Copy link

milankl commented Mar 13, 2023

Amazing, thanks Ed. Great summary!

@ayoubft
Copy link

ayoubft commented Apr 3, 2023

Hello there!

I came across this project and it immediately caught my attention. The idea seems very interesting and I would love to learn more about it. I am writing to express my keen interest in this project.

Started out a draft for the proposal and during my research, I found out that this project is listed as a Google Summer of Code (GSoC) project .

Please let me know if there are any updates regarding the project considering that GSoC deadline is April 4th.

@milankl
Copy link

milankl commented Apr 3, 2023

Hi Ayoub! Thanks for your interest!! Yes, indeed, we also got this project into the Google Summer of Code, meaning that it is possible to get funding through either track. Note the different deadlines though. We therefore expect two participants (one from code for earth, one from summer of code) to work on xbitinfo simultaneously. Depending on the proposals we will then define the individual projects in discussion with the participants so that they are somewhat independent of another. For us mentors there's no difference once you get accepted through summer of code or code for earth, but the programmes are distinct and there's only funding to accept one from each.

So yes, please write down your ideas and interests into a proposal and apply! You can also pick up ideas from the project ideas we wrote down for GSoC. In the end, we would like to see that you understood the challenge and have ideas how to solve it and a motivation to work on this during the summer.

@ayoubft
Copy link

ayoubft commented Apr 3, 2023

Thank you so much Milan for your response and for clarifying the details about the project and the funding options available.
I've already taken a look at the project idea listed for GSoC, and I'm excited to continue working on my proposal. This project is a fantastic opportunity to learn and develop new skills, and I'm eager to understand the challenge and come up with innovative ideas for how to solve it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Stream 1 Software Development for Earth Sciences
Projects
None yet
Development

No branches or pull requests

8 participants