Skip to content

Commit

Permalink
utils/cached_file: reduce latency (and increase overhead) of partiall…
Browse files Browse the repository at this point in the history
…y-cached reads

Currently, `cached_file::stream` (currently used only by index_reader,
to read index pages), works as follows.

Assume that the caller requested a read of the range [pos, pos + size).
Then:

- If the first page of the requested range is uncached,
  the entire [pos, pos + size) range is read from disk (even if some
  later pieces of it are cached), the resulting pages are added to the cache,
  and the read completes (most likely) from the cached pages.
- If the first page of the read is cached, then the rest of the read
  is handled page-by-page, in a sequential loop, serving each page
  either from cache (if present) or from disk.

For example, assume that pages 0, 1, 2, 3, 4 are requested.

If exactly pages 1, 2 are cached, then `stream` will read the entire [0, 4] range
from disk and insert the missing 0, 3, 4, and then it will continue serving the
read from cache.

If exactly pages 0 and 3 are cached, then it will serve 0 from cache,
then it will read 1 from disk and insert it into cache,
then it will read 2 from disk and insert it into cache,
then it will serve 3 from cache,
then it will read 4 from disk and insert it into cache.

If exactly the first page is cached, a 128 kiB read turns
into 31 I/O sequential read ops.

This is weird, and doesn't look intended. In one case, we are reading even pages
we already have, just to avoid fragmenting the read, and in the other case
we are reading pages one-by-one (sequentially!) even if they are neighbours.

I'm not sure if cached_file should minimize IOPS or byte throughput,
but the current state is surely suboptimal. Even if its read strategy
is somehow optimal, it should still at least coalesce contiguous reads
and perform the non-contiguous reads in parallel.

This patch leans into minimizing IOPS. After the patch, we serve
as many front pages from the cache as we can, but when we see
an uncached page, we read the entire remainder of the read from disk.

As if we trimmed the read request by the longest cached prefix,
and then performed the rest using the logic from before the patch.

For example, if exactly pages 0 and 3 are cached,
then we serve 0 from cache,
then we read [1, 4] from disk and insert everything into cache.

For partially-cached files, this will result in more bytes read
from disk, but less IOPS. This might be a bad thing. But if so,
then we should lean the other way in a more explicit and efficient
way than we currently do.

Closes scylladb#20935
  • Loading branch information
michoecho authored and tgrabiec committed Oct 4, 2024
1 parent af12499 commit 882a3c6
Showing 1 changed file with 35 additions and 10 deletions.
45 changes: 35 additions & 10 deletions utils/cached_file.hh
Original file line number Diff line number Diff line change
Expand Up @@ -154,15 +154,16 @@ private:
offset_type _last_page_size;
page_idx_type _last_page;
private:
future<cached_page::ptr_type> get_page_ptr(page_idx_type idx,
// Returns (page, true) if the page was cached, and (page, false) if the page was uncached.
future<std::pair<cached_page::ptr_type, bool>> get_page_ptr(page_idx_type idx,
page_count_type read_ahead,
tracing::trace_state_ptr trace_state) {
auto i = _cache.lower_bound(idx);
if (i != _cache.end() && i->idx == idx) {
++_metrics.page_hits;
tracing::trace(trace_state, "page cache hit: file={}, page={}", _file_name, idx);
cached_page& cp = *i;
return make_ready_future<cached_page::ptr_type>(cp.share());
return make_ready_future<std::pair<cached_page::ptr_type, bool>>(cp.share(), true);
}
tracing::trace(trace_state, "page cache miss: file={}, page={}, readahead={}", _file_name, idx, read_ahead);
++_metrics.page_misses;
Expand Down Expand Up @@ -197,14 +198,15 @@ private:
utils::get_local_injector().inject("cached_file_get_first_page", []() {
throw std::bad_alloc();
});
return first_page;
return std::pair<cached_page::ptr_type, bool>(std::move(first_page), false);
});
}
future<temporary_buffer<char>> get_page(page_idx_type idx,
// Returns (page, true) if the page was cached, and (page, false) if the page was uncached.
future<std::pair<temporary_buffer<char>, bool>> get_page(page_idx_type idx,
page_count_type count,
tracing::trace_state_ptr trace_state) {
return get_page_ptr(idx, count, std::move(trace_state)).then([] (cached_page::ptr_type cp) {
return cp->get_buf();
return get_page_ptr(idx, count, std::move(trace_state)).then([] (std::pair<cached_page::ptr_type, bool> cp) {
return std::make_pair(cp.first->get_buf(), cp.second);
});
}
public:
Expand Down Expand Up @@ -298,6 +300,27 @@ public:
? std::make_optional(_permit->consume_memory(size))
: std::nullopt;
}
void shrink_size_hint(bool page_was_cached) {
if (page_was_cached) {
// If the page was cached, shrink the _size_hint by page_size,
// but don't reduce it below page_size.
_size_hint = std::max(_size_hint, 2 * page_size) - page_size;
} else {
// If the page was uncached, then get_page read the entire _size_hint bytes from disk,
// (in one I/O operation) and inserted the read pages into the cache.
// We will most likely serve the remainder of the stream from them.
//
// But if some of those pages happen to be evicted before we complete the read
// (this shouldn't really happen in practice, because in practice stay in cache
// for much, much longer than any read takes, but still), we don't want to read
// something on the order of _size_hint again, as that could result, in theory,
// in a quadratic amount of work.
//
// So in the very unlikely chance that we will have to re-read something from disk,
// let's do it page-by-page.
_size_hint = page_size;
}
}
public:
// Creates an empty stream.
stream()
Expand All @@ -324,9 +347,9 @@ public:
}
auto units = get_page_units(_size_hint);
page_count_type readahead = div_ceil(_size_hint, page_size);
_size_hint = page_size;
return _cached_file->get_page(_page_idx, readahead, _trace_state).then(
[units = std::move(units), this] (temporary_buffer<char> page) mutable {
[units = std::move(units), this] (std::pair<temporary_buffer<char>, bool> read_result) mutable {
auto page = std::move(read_result.first);
if (_page_idx == _cached_file->_last_page) {
page.trim(_cached_file->_last_page_size);
}
Expand All @@ -338,6 +361,7 @@ public:
page.trim_front(_offset_in_page);
_offset_in_page = 0;
++_page_idx;
shrink_size_hint(read_result.second);
return page;
});
}
Expand All @@ -352,16 +376,17 @@ public:
}
auto units = get_page_units(_size_hint);
page_count_type readahead = div_ceil(_size_hint, page_size);
_size_hint = page_size;
return _cached_file->get_page_ptr(_page_idx, readahead, _trace_state).then(
[this, units = std::move(units)] (cached_page::ptr_type page) mutable {
[this, units = std::move(units)] (std::pair<cached_page::ptr_type, bool> read_result) mutable {
auto page = std::move(read_result.first);
size_t size = _page_idx == _cached_file->_last_page
? _cached_file->_last_page_size
: page_size;
units = get_page_units(page_size);
page_view buf(_offset_in_page, size - _offset_in_page, std::move(page), std::move(units));
_offset_in_page = 0;
++_page_idx;
shrink_size_hint(read_result.second);
return buf;
});
}
Expand Down

0 comments on commit 882a3c6

Please sign in to comment.