utils/cached_file: reduce latency (and increase overhead) of partiall…

…y-cached reads Currently, `cached_file::stream` (currently used only by index_reader, to read index pages), works as follows. Assume that the caller requested a read of the range [pos, pos + size). Then: - If the first page of the requested range is uncached, the entire [pos, pos + size) range is read from disk (even if some later pieces of it are cached), the resulting pages are added to the cache, and the read completes (most likely) from the cached pages. - If the first page of the read is cached, then the rest of the read is handled page-by-page, in a sequential loop, serving each page either from cache (if present) or from disk. For example, assume that pages 0, 1, 2, 3, 4 are requested. If exactly pages 1, 2 are cached, then `stream` will read the entire [0, 4] range from disk and insert the missing 0, 3, 4, and then it will continue serving the read from cache. If exactly pages 0 and 3 are cached, then it will serve 0 from cache, then it will read 1 from disk and insert it into cache, then it will read 2 from disk and insert it into cache, then it will serve 3 from cache, then it will read 4 from disk and insert it into cache. If exactly the first page is cached, a 128 kiB read turns into 31 I/O sequential read ops. This is weird, and doesn't look intended. In one case, we are reading even pages we already have, just to avoid fragmenting the read, and in the other case we are reading pages one-by-one (sequentially!) even if they are neighbours. I'm not sure if cached_file should minimize IOPS or byte throughput, but the current state is surely suboptimal. Even if its read strategy is somehow optimal, it should still at least coalesce contiguous reads and perform the non-contiguous reads in parallel. This patch leans into minimizing IOPS. After the patch, we serve as many front pages from the cache as we can, but when we see an uncached page, we read the entire remainder of the read from disk. As if we trimmed the read request by the longest cached prefix, and then performed the rest using the logic from before the patch. For example, if exactly pages 0 and 3 are cached, then we serve 0 from cache, then we read [1, 4] from disk and insert everything into cache. For partially-cached files, this will result in more bytes read from disk, but less IOPS. This might be a bad thing. But if so, then we should lean the other way in a more explicit and efficient way than we currently do. Closes scylladb#20935
tchaikov · Oct 4, 2024 · 882a3c6 · 882a3c6
1 parent af12499
commit 882a3c6
Showing 1 changed file with 35 additions and 10 deletions.
diff --git a/utils/cached_file.hh b/utils/cached_file.hh
@@ -154,15 +154,16 @@ private:
     offset_type _last_page_size;
     page_idx_type _last_page;
 private:
-    future<cached_page::ptr_type> get_page_ptr(page_idx_type idx,
+    // Returns (page, true) if the page was cached, and (page, false) if the page was uncached.
+    future<std::pair<cached_page::ptr_type, bool>> get_page_ptr(page_idx_type idx,
             page_count_type read_ahead,
             tracing::trace_state_ptr trace_state) {
         auto i = _cache.lower_bound(idx);
         if (i != _cache.end() && i->idx == idx) {
             ++_metrics.page_hits;
             tracing::trace(trace_state, "page cache hit: file={}, page={}", _file_name, idx);
             cached_page& cp = *i;
-            return make_ready_future<cached_page::ptr_type>(cp.share());
+            return make_ready_future<std::pair<cached_page::ptr_type, bool>>(cp.share(), true);
         }
         tracing::trace(trace_state, "page cache miss: file={}, page={}, readahead={}", _file_name, idx, read_ahead);
         ++_metrics.page_misses;
@@ -197,14 +198,15 @@ private:
                 utils::get_local_injector().inject("cached_file_get_first_page", []() {
                     throw std::bad_alloc();
                 });
-                return first_page;
+                return std::pair<cached_page::ptr_type, bool>(std::move(first_page), false);
             });
     }
-    future<temporary_buffer<char>> get_page(page_idx_type idx,
+    // Returns (page, true) if the page was cached, and (page, false) if the page was uncached.
+    future<std::pair<temporary_buffer<char>, bool>> get_page(page_idx_type idx,
                                             page_count_type count,
                                             tracing::trace_state_ptr trace_state) {
-        return get_page_ptr(idx, count, std::move(trace_state)).then([] (cached_page::ptr_type cp) {
-            return cp->get_buf();
+        return get_page_ptr(idx, count, std::move(trace_state)).then([] (std::pair<cached_page::ptr_type, bool> cp) {
+            return std::make_pair(cp.first->get_buf(), cp.second);
         });
     }
 public:
@@ -298,6 +300,27 @@ public:
                 ? std::make_optional(_permit->consume_memory(size))
                 : std::nullopt;
         }
+        void shrink_size_hint(bool page_was_cached) {
+            if (page_was_cached) {
+                // If the page was cached, shrink the _size_hint by page_size,
+                // but don't reduce it below page_size.
+                _size_hint = std::max(_size_hint, 2 * page_size) - page_size;
+            } else {
+                // If the page was uncached, then get_page read the entire _size_hint bytes from disk,
+                // (in one I/O operation) and inserted the read pages into the cache.
+                // We will most likely serve the remainder of the stream from them.
+                //
+                // But if some of those pages happen to be evicted before we complete the read
+                // (this shouldn't really happen in practice, because in practice stay in cache
+                // for much, much longer than any read takes, but still), we don't want to read
+                // something on the order of _size_hint again, as that could result, in theory,
+                // in a quadratic amount of work.
+                //
+                // So in the very unlikely chance that we will have to re-read something from disk,
+                // let's do it page-by-page.
+                _size_hint = page_size;
+            }
+        }
     public:
         // Creates an empty stream.
         stream()
@@ -324,9 +347,9 @@ public:
             }
             auto units = get_page_units(_size_hint);
             page_count_type readahead = div_ceil(_size_hint, page_size);
-            _size_hint = page_size;
             return _cached_file->get_page(_page_idx, readahead, _trace_state).then(
-                    [units = std::move(units), this] (temporary_buffer<char> page) mutable {
+                    [units = std::move(units), this] (std::pair<temporary_buffer<char>, bool> read_result) mutable {
+                auto page = std::move(read_result.first);
                 if (_page_idx == _cached_file->_last_page) {
                     page.trim(_cached_file->_last_page_size);
                 }
@@ -338,6 +361,7 @@ public:
                 page.trim_front(_offset_in_page);
                 _offset_in_page = 0;
                 ++_page_idx;
+                shrink_size_hint(read_result.second);
                 return page;
             });
         }
@@ -352,16 +376,17 @@ public:
             }
             auto units = get_page_units(_size_hint);
             page_count_type readahead = div_ceil(_size_hint, page_size);
-            _size_hint = page_size;
             return _cached_file->get_page_ptr(_page_idx, readahead, _trace_state).then(
-                    [this, units = std::move(units)] (cached_page::ptr_type page) mutable {
+                    [this, units = std::move(units)] (std::pair<cached_page::ptr_type, bool> read_result) mutable {
+                auto page = std::move(read_result.first);
                 size_t size = _page_idx == _cached_file->_last_page
                         ? _cached_file->_last_page_size
                         : page_size;
                 units = get_page_units(page_size);
                 page_view buf(_offset_in_page, size - _offset_in_page, std::move(page), std::move(units));
                 _offset_in_page = 0;
                 ++_page_idx;
+                shrink_size_hint(read_result.second);
                 return buf;
             });
         }