Merge pull request Blosc#279 from omaech/remote_proxy_tutorial

Remote Proxy tutorial using fetch and __getitem__
DimitriPapadopoulos · Oct 8, 2024 · eec90b2 · eec90b2
2 parents 235b843 + 16241cc
commit eec90b2
Show file tree

Hide file tree

Showing 2 changed files with 373 additions and 0 deletions.
diff --git a/doc/getting_started/tutorials/05.remote_proxy.ipynb b/doc/getting_started/tutorials/05.remote_proxy.ipynb
@@ -0,0 +1,373 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "a33c4f0335308f35",
+   "metadata": {},
+   "source": [
+    "# Using Proxies for Efficient Handling of Remote Multidimensional Data with Blosc2\n",
+    "\n",
+    "Next, in this tutorial, we will explore the key differences between the `fetch` and `__getitem__` methods when working with data in a Blosc2 proxy. Through this comparison, we will not only understand how each method optimizes data access but also measure the time of each operation to evaluate their performance. \n",
+    "\n",
+    "Additionally, we will monitor the size of the local file, ensuring that it matches the expected size based on the compressed size of the chunks, allowing us to verify the efficiency of data management. Get ready to dive into the fascinating world of data caching!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "92755a11cc34e834",
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2024-10-08T08:42:54.896377Z",
+     "start_time": "2024-10-08T08:42:52.998694Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "import blosc2\n",
+    "import asyncio\n",
+    "from blosc2 import ProxyNDSource\n",
+    "import time\n",
+    "import os"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5ee57ce91fc28bbd",
+   "metadata": {},
+   "source": [
+    "## Proxy Class for Data Access\n",
+    "The Proxy class is a design pattern that acts as an intermediary between a client and a real data containers, enabling more efficient access to the latter. Its primary objective is to provide a caching mechanism for effectively accessing data stored in remote or large containers that utilize the **ProxySource** or **ProxyNDSource** interfaces. \n",
+    "\n",
+    "These containers are divided into chunks (data blocks), and the proxy is responsible for downloading and storing only the requested chunks, progressively filling the cache as the user accesses the data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "bab50ca19740a1aa",
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2024-10-08T08:43:00.603510Z",
+     "start_time": "2024-10-08T08:43:00.589048Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def get_file_size(filepath):\n",
+    "    \"\"\"Returns the file size in megabytes.\"\"\"\n",
+    "    return os.path.getsize(filepath) / (1024 * 1024)\n",
+    "\n",
+    "\n",
+    "class MyProxySource(ProxyNDSource):\n",
+    "    def __init__(self, data):\n",
+    "        self.data = data\n",
+    "        print(f\"Data shape: {self.shape}, chunks: {self.chunks}, dtype: {self.dtype}\")\n",
+    "\n",
+    "    @property\n",
+    "    def shape(self):\n",
+    "        return self.data.shape\n",
+    "\n",
+    "    @property\n",
+    "    def chunks(self):\n",
+    "        return self.data.chunks\n",
+    "\n",
+    "    @property\n",
+    "    def blocks(self):\n",
+    "        return self.data.blocks\n",
+    "\n",
+    "    @property\n",
+    "    def dtype(self):\n",
+    "        return self.data.dtype\n",
+    "\n",
+    "    # This method must be present\n",
+    "    def get_chunk(self, nchunk):\n",
+    "        return self.data.get_chunk(nchunk)\n",
+    "\n",
+    "    # This method is optional\n",
+    "    async def aget_chunk(self, nchunk):\n",
+    "        await asyncio.sleep(0.1)  # Simulate an asynchronous operation\n",
+    "        return self.data.get_chunk(nchunk)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "32fffd14035b20c4",
+   "metadata": {},
+   "source": "Next, we will establish a connection to a multidimensional array stored remotely on the server (https://demo.caterva2.net/) using the [Blosc2 library](https://www.blosc.org/python-blosc2/index.html). The remote_array object will represent this dataset on the server, enabling us to access the information without the need to load all the data into local memory at once."
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "aa92e842ec2a2fd7",
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2024-10-08T08:43:10.073998Z",
+     "start_time": "2024-10-08T08:43:09.567381Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "urlbase = \"https://demo.caterva2.net/\"\n",
+    "path = \"example/lung-jpeg2000_10x.b2nd\"\n",
+    "remote_array = blosc2.C2Array(path, urlbase=urlbase)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "9360ba9e4f946fe0",
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2024-10-08T08:43:11.259755Z",
+     "start_time": "2024-10-08T08:43:11.238191Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Data shape: (10, 1248, 2689), chunks: (1, 1248, 2689), dtype: uint16\n",
+      "<class 'blosc2.proxy.Proxy'>\n",
+      "Initial local file size: 321 bytes\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Define a local file path to save the proxy container\n",
+    "local_path = \"local_proxy_container.b2nd\"\n",
+    "# Delete the file if it already exists.\n",
+    "blosc2.remove_urlpath(local_path)\n",
+    "source = MyProxySource(remote_array)\n",
+    "proxy = blosc2.Proxy(source, urlpath=local_path)\n",
+    "print(type(proxy))\n",
+    "initial_size = get_file_size(local_path)\n",
+    "print(f\"Initial local file size: {os.path.getsize(local_path)} bytes\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "19b226b63acc7f59",
+   "metadata": {},
+   "source": "As can be seen, the local container is just a few hundreds of bytes in size, which is significantly smaller than the remote dataset (around 64 MB). This is because the local container only contains metadata about the remote dataset, such as its shape, chunks, and data type, but not the actual data. The proxy will download the data from the remote source as needed, storing it in the local container for future access."
+  },
+  {
+   "cell_type": "markdown",
+   "id": "32260c8fd2969107",
+   "metadata": {},
+   "source": [
+    "## Fetching data with a Proxy\n",
+    "The `fetch` function is designed to return the local proxy, which serves as a cache for the requested data. This proxy, while representing the remote container, allows only a portion of the data to be initialized, with the rest potentially remaining empty or undefined (e.g., `slice_data[1:3, 1:3]`).\n",
+    "\n",
+    "In this way, `fetch` downloads only the specific data that is required, which reduces the amount of data stored locally and optimizes the use of resources. This method is particularly useful when working with large datasets, as it allows for the efficient handling of multidimensional data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "ae1babeebf0a75ee",
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2024-10-08T08:43:51.053692Z",
+     "start_time": "2024-10-08T08:43:49.190803Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Time to fetch: 1.49 s\n",
+      "File size after fetch (2 chunks): 1.28 MB\n",
+      "[[[15712 13933 18298 ... 21183 22486 20541]\n",
+      "  [18597 21261 23925 ... 22861 21008 19155]]\n",
+      "\n",
+      " [[    0     0     0 ...     0     0     0]\n",
+      "  [    0     0     0 ...     0     0     0]]]\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Fetch a slice of the data from the proxy\n",
+    "t0 = time.time()\n",
+    "slice_data = proxy.fetch(slice(0, 2))\n",
+    "t1 = time.time() - t0\n",
+    "print(f\"Time to fetch: {t1:.2f} s\")\n",
+    "print(f\"File size after fetch (2 chunks): {get_file_size(local_path):.2f} MB\")\n",
+    "print(slice_data[1:3, 1:3])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "38960b586bd84851",
+   "metadata": {},
+   "source": [
+    "Above, using the `fetch` function with a slice involves downloading data from a chunk that had not been previously requested. This can lead to an increase in the local file size as new data is loaded.\n",
+    "\n",
+    "In the previous result, only 2 chunks have been downloaded and initialized, which is reflected in the array with visible numerical values, as seen in the section  `[[15712 13933 18298 ... 21183 22486 20541], [18597 21261 23925 ... 22861 21008 19155]]`. These represent data that are ready to be processed. On the other hand, the lower part of the array, `[[0 0 0 ... 0 0 0], [0 0 0 ... 0 0 0]]`, shows an uninitialized section (normally filled with zeros).\n",
+    " \n",
+    "This indicates that those chunks have not yet been downloaded or processed. The `fetch` function could eventually fill these chunks with data when requested, replacing the zeros (which indicate uninitialized data) with the corresponding values:\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "937180b9469272ae",
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2024-10-08T08:44:13.262876Z",
+     "start_time": "2024-10-08T08:44:12.132677Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Time to fetch: 0.93 s\n",
+      "File size after fetch (1 chunk): 1.92 MB\n",
+      "[[[15712 13933 18298 ... 21183 22486 20541]\n",
+      "  [18597 21261 23925 ... 22861 21008 19155]]\n",
+      "\n",
+      " [[16165 14955 19889 ... 21203 22518 20564]\n",
+      "  [18610 21264 23919 ... 20509 19364 18219]]]\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Fetch a slice of the data from the proxy\n",
+    "t0 = time.time()\n",
+    "slice_data2 = proxy.fetch((slice(2, 3), slice(6, 7)))\n",
+    "t1 = time.time() - t0\n",
+    "print(f\"Time to fetch: {t1:.2f} s\")\n",
+    "print(f\"File size after fetch (1 chunk): {get_file_size(local_path):.2f} MB\")\n",
+    "print(slice_data[1:3, 1:3])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "209d8b62d81e33d8",
+   "metadata": {},
+   "source": "Now the `fetch` function has downloaded another two additional chunks, which is reflected in the local file size. The print show how all the slice `[1:3, 1:3]` has been initialized with data, while the rest of the array may remain uninitialized."
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4069a43a15ae3980",
+   "metadata": {},
+   "source": [
+    "## Data access using `__getitem__`\n",
+    "The `__getitem__` function in the Proxy class is similar to `fetch` in that it allows for the retrieval of specific data from the remote container. However, `__getitem__` returns a NumPy array, which can be used to access specific subsets of the data. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "4f4fb754d2c34a48",
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2024-10-08T08:44:27.127563Z",
+     "start_time": "2024-10-08T08:44:25.512306Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Proxy __getitem__ time: 1.611 s\n",
+      "[[[16540 15270 20144 ... 20689 21494 19655]\n",
+      "  [17816 21097 24378 ... 21449 20582 19715]]\n",
+      "\n",
+      " [[16329 14563 18940 ... 20186 20482 19166]\n",
+      "  [17851 21656 25461 ... 23705 21399 19094]]]\n",
+      "<class 'numpy.ndarray'>\n",
+      "File size after __getitem__ (2 chunks): 3.20 MB\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Using __getitem__ to get a slice of the data\n",
+    "t0 = time.time()\n",
+    "result = proxy[5:7, 1:3]\n",
+    "t1 = time.time() - t0\n",
+    "print(f\"Proxy __getitem__ time: {t1:.3f} s\")\n",
+    "print(result)\n",
+    "print(type(result))\n",
+    "print(f\"File size after __getitem__ (2 chunks): {get_file_size(local_path):.2f} MB\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a6cb08b7108e8e76",
+   "metadata": {},
+   "source": "See? New data has been downloaded and initialized, as shown by the array values and the size of the local file. The `__getitem__` function has accessed the data in the chunks, and put the slice in the `result` array, which is now available for processing. The local file size has increased due to the new data that has been downloaded and stored in the cache.\n"
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6377016f45b2796",
+   "metadata": {},
+   "source": [
+    "## Differences between `fetch` and `__getitem__`\n",
+    "\n",
+    "<img src=\"images/remote_proxy.png\" alt=\"Descripción de la imagen\" width=\"800\"/>\n",
+    "\n",
+    "Although `fetch` and `__getitem__` have distinct functions, they work together to facilitate efficient access to data. `fetch` manages the loading of data into the local cache by checking if the necessary chunks are available. If they are not, it downloads them from the remote source for future access.\n",
+    "\n",
+    "On the other hand, `__getitem__` handles the indexing and retrieval of data through a **NumPy** array, allowing access to specific subsets. Before accessing the data, `__getitem__` calls `fetch` to ensure that the necessary chunks are in the cache. If the data is not present in the cache, `fetch` takes care of downloading it from its original location (for example, from disk or an external source). This ensures that when `__getitem__` performs the indexing operation, it has immediate access to the data without interruptions.\n",
+    "\n",
+    "An important detail is that, while both fetch and __getitem__ ensure the necessary data is available, they may download more information than required because they download entire chunks. However, this can be advantageous when accessing large remote arrays, as it avoids downloading the whole dataset at once. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "432c728702703cd8",
+   "metadata": {},
+   "source": [
+    "## About the remote dataset\n",
+    "\n",
+    "The remote dataset is available at: https://demo.caterva2.net/roots/example/lung-jpeg2000_10x.b2nd?roots=example. You may want to explore the data values by clicking on the *Data* tab; this dataset is actually a tomography of a lung, which you can visualize by clicking on the *Tomography* tab. Finally, by clicking on the **Download** button, it can be downloaded locally in case you want to experiment more with the data.\n",
+    "\n",
+    "As we have seen, every time that we dowloaded a chunk, the size of the local file increased by a fix amount (around 0.64 MB). This is because the chunks (whose size is around 6.4 MB) are compressed with the `Codec.GROK` codec, which has been configured to reduce the size of the data by a *constant* factor of 10. This means that the compressed data occupies only one-tenth of the space that it would occupy without compression.  This reduction in data size optimizes both storage and transfer, as data is always handled in a compressed state when downloading or storing images, which accelerates the transfer process.\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c508507d74434ecd",
+   "metadata": {},
+   "source": [
+    "## Conclusion\n",
+    "\n",
+    "This tutorial has highlighted how the efficient integration of the **Proxy** class in **Blosc2**, combined with the `fetch` and `__getitem__` functions, optimizes access to multidimensional data. This combination of techniques not only enables the handling of large volumes of information more agilely, but also maximizes storage and processing resources, which is crucial in data-intensive environments and in scientific or industrial applications that require high efficiency and performance."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cec882cb3a545d57",
+   "metadata": {},
+   "source": ""
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 2
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython2",
+   "version": "2.7.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/doc/getting_started/tutorials/images/remote_proxy.png b/doc/getting_started/tutorials/images/remote_proxy.png