Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move non-critical metadata out of Session #595

Open
piodul opened this issue Nov 15, 2022 · 2 comments
Open

Move non-critical metadata out of Session #595

piodul opened this issue Nov 15, 2022 · 2 comments
Assignees
Labels
API-breaking This might introduce incompatible API changes area/metadata performance Improves performance of existing features
Milestone

Comments

@piodul
Copy link
Collaborator

piodul commented Nov 15, 2022

(Proposal based on the discussion in #574, credit to @wyfo for the idea)

By default, the Session fetches full schema metadata, but only a small part of it is necessary for the session to work properly - currently, we only need keyspace replication info and per-table partitioners. If the amount of items in the schema is large, the metadata may be large in size as well. Considering that not everybody needs this information, this can lead to wasted cluster load, bandwidth and RAM.

Currently, we allow preventing non-essential metadata from being fetched via the fetch_schema_metadata configuration option, and restrict the keyspaces for which metadata is fetched at all via keyspaces_to_fetch. However, those options are opt-out rather than opt-in, and it's hard to manage the lifetime of the metadata (somebody might want to fetch schema metadata once and consume it somehow, and then deallocate it - it's not possible right now).

To reduce the waste, we could move the logic that fetches full metadata outside the Session, and the session would just keep the keyspace replication info and per-table partitioners. The ability to fetch full metadata could be moved to a separate module. We could provide abstractions that allow better control over what metadata is being fetched and when, and allow better control over the lifetime of this metadata.

Some things to consider before designing/implementing the solution:

  • The API of the CPP driver requires the metadata to be available at all times, and to be updated when the session performs schema changes. Considering the cpp-rust-driver project, we need to make it possible to emulate the current semantics. We could provide a metadata manager object which manages metadata in a similar way to what the session currently does. The session would have to be able to push schema change events to the manager so that it synchronously waits for the manager to update the schema.
@piodul piodul added the performance Improves performance of existing features label Nov 15, 2022
@piodul piodul added the API-breaking This might introduce incompatible API changes label Mar 28, 2023
@piodul piodul added this to the 1.0.0 milestone Apr 5, 2023
@roydahan roydahan modified the milestones: 1.0.0, 0.12.0 Nov 12, 2023
@Lorak-mmk Lorak-mmk self-assigned this Nov 15, 2023
@avelanarius avelanarius modified the milestones: 0.12.0, 0.13.0 Jan 15, 2024
@roydahan
Copy link
Collaborator

Is this still on-track planned for MS 0.13.0?

@Lorak-mmk
Copy link
Collaborator

I'm trying to think about how to approach this. Some constraints and observations - many already mentioned by you:

  • Driver needs some metadata for prepared statements to work: keyspace with replication strategies and tables with partitioners. Let's call it obligatory metadata, the rest will be additional metadata.
  • Some users need access to metadata, some don't - additional metadata shouldn't be fetched unconditionally.
  • User may only need access to some parts of the metadata, so it should be possible to filter what we fetch. Imo the higher the granularity of fetching the better.
  • If a user has a lot of keyspaces with a lot of tables then even obligatory metadata may be large, but the user may only plan to query a few tables - then fetching all of obligatory metadata would be a waste. It would be nice to add some form of optional table / keyspace allowlist for obligatory metadata.
  • User should have more control over lifetime of additional metadata - it should not always be allocated forever.
  • At the same time it should be possible to preserve the current behavior of having metadata continuously updated.
  • We should avoid doing unnecessary queries - like fetching some parts of metadata twice.
  • When metadata is continuously updated by the driver (like is currently the case) then obligatory metadata and additional metadata should be consistent. It would be weird and counterintuitive if additional metadata had some table but obligatory metadata didn't fetch it yet (or vice versa).
  • Note: metadata will be inconsistent (or will fail to fetch) if we execute different fetching queries against different nodes. Right now everything is executed on control connection.

To me it looks very hard, if even possible, to satisfy those constraints.
If we move additional metadata out of session, then it won't have access to control connection so it will be difficult to provide consistency, for two reasons:

  1. different connection is used, and different nodes may return different schemas IIUC
  2. Obligatory schema will be fetched at different times (periodically and in response to events) than additional schema

Such outside metadata fetcher would also need some additional APIs if it were to avoid refetching keyspaces and tables.

If someone has a nice idea on how to make this split, I'm listening. If not, maybe its better to keep metadata fetching generally as-is, but provide:

  • better, more granular filters for fetching
  • possibility to remove some / all additional metadata to reclaim memory.

@wprzytula @piodul

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API-breaking This might introduce incompatible API changes area/metadata performance Improves performance of existing features
Projects
None yet
Development

No branches or pull requests

5 participants