Skip to content

Commit

Permalink
reordering of cells; add more description
Browse files Browse the repository at this point in the history
  • Loading branch information
observingClouds committed Dec 28, 2023
1 parent 0b962ff commit c06fcfb
Showing 1 changed file with 120 additions and 116 deletions.
236 changes: 120 additions & 116 deletions docs/chunking.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,9 @@
"source": [
"Geospatial data can vary in its information density from one part of the world to another. A dataset containing streets will be very dense in cities but contains little information in remote places like the Alps or even the ocean. The same is also true for datasets about the ocean or the atmosphere.\n",
"\n",
"Currently in the bitinformation framework, to preserve all real information, the maximum information content calculated by `xbitinfo` needs to be used for the entire dataset. However, bitinformation can also be calculated on subsets, such that the ‘boring’ parts can therefore be more efficiently compressed. This notebook portrays how to do it."
"By default the number of bits that need to be kept (`keepbits`) to preserve the requested amount of information is determined based on the entire dataset. This approach doesn't always result in the best compression rates as it preserves too many keepbits in regions with anomalously low information density. The following steps show how the `keepbits` can be retrieved and applied on subsets. In this case, subsets are defined as dataset chunks.\n",
"\n",
"This work is a result of the ECMWF Code4Earth 2023. Please have a look at the [presentation of this project](https://youtu.be/IOi4XvECpsQ?si=hwZkppNRa-J2XVZ9) for additional details."
]
},
{
Expand Down Expand Up @@ -614,121 +616,6 @@
"ds"
]
},
{
"cell_type": "markdown",
"id": "b9e8fe5a-2e4e-4dfd-8026-0991e9988668",
"metadata": {},
"source": [
"## Saving to `NetCDF` file"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "e011a900-5da2-40be-a292-d81a0cafcd6d",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/tmp/ipykernel_24883/1840452313.py:2: SerializationWarning: saving variable air with floating point data as an integer dtype without any _FillValue to use for NaNs\n",
" ds.to_netcdf(\"0.air_original.nc\")\n"
]
}
],
"source": [
"# Saving the dataset as NetCDF file\n",
"ds.to_netcdf(\"0.air_original.nc\")"
]
},
{
"cell_type": "markdown",
"id": "2b98628e-cbcb-4018-8565-4c0324cf2d61",
"metadata": {},
"source": [
"## Compress with `to_compressed_netcdf`"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "99d02f35-85fc-4a8d-94a0-880ac2ffbb72",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/ayoubf/Projects/xbitinfo/xbitinfo/save_compressed.py:121: SerializationWarning: saving variable air with floating point data as an integer dtype without any _FillValue to use for NaNs\n",
" self._obj.to_netcdf(\n"
]
}
],
"source": [
"# Compress and save the dataset as NetCDF file\n",
"ds.to_compressed_netcdf(\"1.air_compressed_all.nc\")"
]
},
{
"cell_type": "markdown",
"id": "5f5aae30-4a0a-401c-9018-9e34626c3d2c",
"metadata": {},
"source": [
"## Compress with bitrounding"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "fdf077dd-6494-4c38-9461-5ea7ac370a01",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "d7a346d6cbd14460890ae3be8ec11ff0",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
" 0%| | 0/1 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Get bitinformation of the dataset along the 'longitude' dimension\n",
"info_per_bit = xb.get_bitinformation(ds, dim=\"lon\", implementation=\"python\")\n",
"\n",
"# Get the number of bits necessary to keep 99% of information in our dataset\n",
"keepbits = xb.get_keepbits(info_per_bit, 0.99)\n",
"\n",
"# Round the dataset using the keepbits number\n",
"ds_bitrounded = xb.xr_bitround(ds, keepbits)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "d9fd85fd-e1f8-46f4-8236-39450dbc665e",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# Compress and save the bitrounded dataset as NetCDF file\n",
"ds_bitrounded.to_compressed_netcdf(\"2.air_bitrounded_compressed.nc\")"
]
},
{
"cell_type": "markdown",
"id": "6b1b95de-f8e5-45c3-be3b-0555a67efb77",
Expand Down Expand Up @@ -997,6 +884,123 @@
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "b9e8fe5a-2e4e-4dfd-8026-0991e9988668",
"metadata": {},
"source": [
"## Reference compression\n",
"For comparision with other compression approaches the dataset is also saved as:\n",
"- uncompressed netCDF\n",
"- lossless compressed netCDF\n",
"- lossy compressed netCDF while preserving 99% of bitinformation"
]
},
{
"cell_type": "markdown",
"id": "a77919ff",
"metadata": {},
"source": [
"### Saving to uncompressed `NetCDF` file"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "e011a900-5da2-40be-a292-d81a0cafcd6d",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/tmp/ipykernel_24883/1840452313.py:2: SerializationWarning: saving variable air with floating point data as an integer dtype without any _FillValue to use for NaNs\n",
" ds.to_netcdf(\"0.air_original.nc\")\n"
]
}
],
"source": [
"# Saving the dataset as NetCDF file\n",
"ds.to_netcdf(\"0.air_original.nc\")"
]
},
{
"cell_type": "markdown",
"id": "2b98628e-cbcb-4018-8565-4c0324cf2d61",
"metadata": {},
"source": [
"### Saving as compressed NetCDF with `to_compressed_netcdf`"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "99d02f35-85fc-4a8d-94a0-880ac2ffbb72",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/ayoubf/Projects/xbitinfo/xbitinfo/save_compressed.py:121: SerializationWarning: saving variable air with floating point data as an integer dtype without any _FillValue to use for NaNs\n",
" self._obj.to_netcdf(\n"
]
}
],
"source": [
"# Compress and save the dataset as NetCDF file\n",
"ds.to_compressed_netcdf(\"1.air_compressed_all.nc\")"
]
},
{
"cell_type": "markdown",
"id": "5f5aae30-4a0a-401c-9018-9e34626c3d2c",
"metadata": {},
"source": [
"### Saving while preserving 99% of information based on bitrounding algorithm"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "fdf077dd-6494-4c38-9461-5ea7ac370a01",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "d7a346d6cbd14460890ae3be8ec11ff0",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
" 0%| | 0/1 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Get bitinformation of the dataset along the 'longitude' dimension\n",
"info_per_bit = xb.get_bitinformation(ds, dim=\"lon\", implementation=\"python\")\n",
"\n",
"# Get the number of bits necessary to keep 99% of information in our dataset\n",
"keepbits = xb.get_keepbits(info_per_bit, 0.99)\n",
"\n",
"# Round the dataset using the keepbits number\n",
"ds_bitrounded = xb.xr_bitround(ds, keepbits)\n",
"\n",
"# Compress and save the bitrounded dataset as NetCDF file\n",
"ds_bitrounded.to_compressed_netcdf(\"2.air_bitrounded_compressed.nc\")"
]
},
{
"cell_type": "markdown",
"id": "d3b60c66-252d-48a6-af93-a00c9ca8f0ba",
Expand Down

0 comments on commit c06fcfb

Please sign in to comment.