Skip to content

Commit

Permalink
docs: exhaustive overview of statements & best practices
Browse files Browse the repository at this point in the history
In order to avoid API misuse, much knowledge is now shared in a
structured way of tables, and best practices are described to aid users.
  • Loading branch information
wprzytula committed Aug 29, 2024
1 parent d70603d commit 5b39d8f
Show file tree
Hide file tree
Showing 2 changed files with 120 additions and 33 deletions.
53 changes: 43 additions & 10 deletions docs/source/queries/paged.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,31 @@
Sometimes query results might be so big that one prefers not to fetch them all at once,
e.g. to reduce latency and/or memory footprint.
Paged queries allow to receive the whole result page by page, with a configurable page size.
In fact, most SELECTs queries should be done with paging, to avoid big load on cluster and large memory footprint.

`Session::query_iter` and `Session::execute_iter` take a [simple query](simple.md)
or a [prepared query](prepared.md) and return an `async` iterator over result `Rows`.
> ***Warning***\
> Issuing unpaged SELECTs (`Session::query_unpaged` or `Session::execute_unpaged`)
> may have dramatic performance consequences! **BEWARE!**\
> If the result set is big (or, e.g., there are a lot of tombstones), those atrocities can happen:
> - cluster may experience high load,
> - queries may time out,
> - the driver may devour a lot of RAM,
> - latency will likely spike.
>
> Stay safe. Page your SELECTs.
## `RowIterator`

The automated way to achieve that is `RowIterator`. It always fetches and enables access to one page,
while prefetching the next one. This limits latency and is a convenient abstraction.

> ***Note***\
> `RowIterator` is quite heavy machinery, introducing considerable overhead. Therefore,
> don't use it for statements that do not benefit from paging. In particular, avoid using it
> for non-SELECTs.
On API level, `Session::query_iter` and `Session::execute_iter` take a [simple query](simple.md)
or a [prepared query](prepared.md), respectively, and return an `async` iterator over result `Rows`.

> ***Warning***\
> In case of unprepared variant (`Session::query_iter`) if the values are not empty
Expand All @@ -22,7 +44,6 @@ Use `query_iter` to perform a [simple query](simple.md) with paging:
# use scylla::Session;
# use std::error::Error;
# async fn check_only_compiles(session: &Session) -> Result<(), Box<dyn Error>> {
use scylla::IntoTypedRows;
use futures::stream::StreamExt;

let mut rows_stream = session
Expand All @@ -45,7 +66,6 @@ Use `execute_iter` to perform a [prepared query](prepared.md) with paging:
# use scylla::Session;
# use std::error::Error;
# async fn check_only_compiles(session: &Session) -> Result<(), Box<dyn Error>> {
use scylla::IntoTypedRows;
use scylla::prepared_statement::PreparedStatement;
use futures::stream::StreamExt;

Expand Down Expand Up @@ -106,10 +126,10 @@ let _ = session.execute_iter(prepared, &[]).await?; // ...
# }
```

### Passing the paging state manually
It's possible to fetch a single page from the table, extract the paging state
from the result and manually pass it to the next query. That way, the next
query will start fetching the results from where the previous one left off.
## Manual paging
It's possible to fetch a single page from the table, and manually pass paging state
to the next query. That way, the next query will start fetching the results
from where the previous one left off.

On a `Query`:
```rust
Expand Down Expand Up @@ -197,5 +217,18 @@ loop {
```

### Performance
Performance is the same as in non-paged variants.\
For the best performance use [prepared queries](prepared.md).
For the best performance use [prepared queries](prepared.md).
See [query types overview](queries.md).

## Best practices

| Query result fetching | Unpaged | Paged manually | Paged automatically |
|-------------------------|-------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------|
| Exposed Session API | `{query,execute}_unpaged` | `{query,execute}_single_page` | `{query,execute}_iter` |
| Working | get all results in a single CQL frame, into a single Rust struct | get one page of results in a single CQL frame, into a single Rust struct | upon high-level iteration, fetch consecutive CQL frames and transparently iterate over their rows |
| Cluster load | potentially **HIGH** for large results, beware! | normal | normal |
| Driver overhead | low - simple frame fetch | low - simple frame fetch | considerable - `RowIteratorWorker` is a separate tokio task |
| Feature limitations | none | none | speculative execution not supported |
| Driver memory footprint | potentially **BIG** - all results have to be stored at once! | small - only one page stored at a time | small - at most constant number of pages stored at a time |
| Latency | potentially **BIG** - all results have to be generated at once! | considerable on page boundary - new page needs to be fetched | small - next page is always pre-fetched in background |
| Suitable operations | - in general: operations with empty result set (non-SELECTs)</br> - as possible optimisation: SELECTs with LIMIT clause | - for advanced users who prefer more control over paging, with less overhead of `RowIteratorWorker` | - in general: all SELECTs |
100 changes: 77 additions & 23 deletions docs/source/queries/queries.md
Original file line number Diff line number Diff line change
@@ -1,26 +1,80 @@
# Making queries

This driver supports all query types available in Scylla:
* [Simple queries](simple.md)
* Easy to use
* Poor performance
* Primitive load balancing
* [Prepared queries](prepared.md)
* Need to be prepared before use
* Fast
* Properly load balanced
* [Batch statements](batch.md)
* Run multiple queries at once
* Can be prepared for better performance and load balancing
* [Paged queries](paged.md)
* Allows to read result in multiple pages when it might be so big that one
prefers not to fetch it all at once
* Can be prepared for better performance and load balancing

Additionally there is special functionality to enable `USE KEYSPACE` queries:
[USE keyspace](usekeyspace.md)

Queries are fully asynchronous - you can run as many of them in parallel as you wish.
# Making queries - best practices

Driver supports all kinds of statements supported by ScyllaDB. The following tables aim to bridge between DB concepts and driver's API.
They include recommendations on which API to use in what cases.

## Kinds of CQL statements (from the CQL protocol point of view):

| Kind of CQL statement | Single | Batch |
|-----------------------|---------------------|------------------------------------------|
| Prepared | `PreparedStatement` | `Batch` filled with `PreparedStatement`s |
| Unprepared | `Query` | `Batch` filled with `Query`s |

This is **NOT** strictly related to content of the CQL query string.

> ***Interesting note***\
> In fact, any kind of CQL statement could contain any CQL query string.
> Yet, some of such combinations don't make sense and will be rejected by the DB.
> For example, SELECTs in a Batch are nonsense.
### [Unprepared](simple.md) vs [Prepared](prepared.md)

> ***GOOD TO KNOW***\
> Each time a statement is executed by sending a query string to the DB, it needs to be parsed. Driver does not parse CQL, therefore it sees query strings as opaque.\
> There is an option to *prepare* a statement, i.e. parse it once by the DB and associate it with an ID. After preparation, it's enough that driver sends the ID
> and the DB already knows what operation to perform - no more expensive parsing necessary! Moreover, upon preparation driver receives valuable data for load balancing,
> enabling advanced load balancing (so better performance!) of all further executions of that prepared statement.\
> ***Key take-over:*** always prepare statements that you are going to execute multiple times.
| Statement comparison | Unprepared | Prepared |
|----------------------|-------------------------------------------|-----------------------------------------------------------------------------------------------------------------|
| Exposed Session API | `query_*` | `execute_*` |
| Usability | execute CQL statement string directly | need to be separately prepared before use, in-background repreparations if statement falls off the server cache |
| Performance | poor (statement parsed each time) | good (statement parsed only upon preparation) |
| Load balancing | primitive (random choice of a node/shard) | advanced (proper node/shard, optimisations for LWT statements) |
| Suitable operations | one-shot operations | repeated operations |

### Single vs [Batch](batch.md)

| Statement comparison | Single | Batch |
|----------------------|-------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Exposed Session API | `query_*`, `execute_*` | `batch` |
| Usability | simple setup | need to aggregate statements and binding values to each is more cumbersome |
| Performance | good (DB is optimised for handling single statements) | good for small batches, may be worse for larger (also: higher risk of request timeout due to big portion of work) |
| Load balancing | advanced if prepared, else primitive | advanced if prepared **and ALL** statements in the batch target the same partition, else primitive |
| Suitable operations | most of operations | - a list of operations that needs to be executed atomically (batch LightWeight Transaction)</br> - a batch of operations targetting the same partition (as an advanced optimisation) |

## CQL statements - operations (based on what the CQL string contains):

| CQL data manipulation statement | Recommended statement kind | Recommended Session operation |
|------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------|
| SELECT | `PreparedStatement` if repeated, `Query` if once | `{query,execute}_iter` (or `{query,execute}_single_page` in a manual loop for performance / more control) |
| INSERT, UPDATE | `PreparedStatement` if repeated, `Query` if once, `Batch` if multiple statements are to be executed atomically (LightWeight Transaction) | `{query,execute}_unpaged` (paging is irrelevant, because the result set of such operation is empty) |
| CREATE/DROP {KEYSPACE, TABLE, TYPE, INDEX,...} | `Query`, `Batch` if multiple statements are to be executed atomically (LightWeight Transaction) | `query_unpaged` (paging is irrelevant, because the result set of such operation is empty) |

### [Paged](paged.md) vs Unpaged query

> ***GOOD TO KNOW***\
> SELECT statements return a [result set](result.md), possibly a large one. Therefore, paging is available to fetch it in chunks, relieving load on cluster and lowering latency.\
> ***Key take-overs:***\
> For SELECTs you had better **avoid unpaged queries**.\
> For non-SELECTs, unpaged API is preferred.
| Query result fetching | Unpaged | Paged |
|-----------------------|-------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Exposed Session API | `{query,execute}_unpaged` | `{query,execute}_single_page`, `{query,execute}_iter` |
| Usability | get all results in a single CQL frame, so into a [single Rust struct](result.md) | need to fetch multiple CQL frames and iterate over them - using driver's abstractions (`{query,execute}_iter`) or manually (`{query,execute}_single_page` in a loop) |
| Performance | - for large results, puts **high load on the cluster**</br> - for small results, the same as paged | - for large results, relieves the cluster</br> - for small results, the same as unpaged |
| Memory footprint | potentially big - all results have to be stored at once | small - at most constant number of pages are stored by the driver at the same time |
| Latency | potentially big - all results have to be generated at once | small - at most one chunk of data must be generated at once, so latency of each chunk is small |
| Suitable operations | - in general: operations with empty result set (non-SELECTs)</br> - as possible optimisation: SELECTs with LIMIT clause | - in general: all SELECTs |

For more detailed comparison and more best practices, see [doc page about paging](paged.md).

### Queries are fully asynchronous - you can run as many of them in parallel as you wish.

## `USE KEYSPACE`:
There is a special functionality to enable [USE keyspace](usekeyspace.md).

```{eval-rst}
.. toctree::
Expand Down

0 comments on commit 5b39d8f

Please sign in to comment.