This repository has been archived by the owner on Nov 14, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 15
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[_transactions2] Part 39: ADR Part 1 - Background, Principles, Physic…
…al Representations (#3909) * tx2 adr outline * intro to transactions table * explanation of how cassandra tx table works * Principles * tickets encoding explanation * waffles * physical tickets * choosing np/pq * add inflight clause * Streamlining putUnlessExists * Add sectioning * Add Thrift examples * Comment on order preservation * Fix stray punctuation * Clarify multinode contention and paxos section * Add tickets encoding diagram
- Loading branch information
1 parent
af07387
commit e5e160c
Showing
2 changed files
with
286 additions
and
0 deletions.
There are no files selected for viewing
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
286 changes: 286 additions & 0 deletions
286
docs/source/miscellaneous/doc/adr/0016-use-tickets-encoding-for-transactions.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,286 @@ | ||
# 16. Use the tickets encoding for the transactions table (_transactions2) | ||
|
||
Date: 05/04/2019 | ||
|
||
## Status | ||
|
||
Technical decision has been accepted. | ||
This architectural decision record is still a work in progress. | ||
|
||
## Context | ||
|
||
### The Transactions Table in AtlasDB | ||
|
||
In AtlasDB, the ``_transactions`` table keeps track of whether a transaction that had started at a given timestamp | ||
was committed or aborted (and in the case of it being committed, its commit timestamp). Logically, this table is a | ||
mapping of longs to longs, with a special value of ``-1`` meaning that the transaction was aborted. Transactions that | ||
are in-flight and have yet to either commit or abort will not have an entry in the table. | ||
|
||
| startTimestamp | commitTimestamp | | ||
| 20 | 33 | | ||
| 28 | 42 | | ||
| 37 | -1 | | ||
| 3141592 | 3141595 | | ||
|
||
This table is accessed via the ``TransactionService`` class in AtlasDB, which offers a simple interface: | ||
|
||
```java | ||
public interface TransactionService { | ||
@CheckForNull | ||
Long get(long startTimestamp); | ||
|
||
Map<Long, Long> get(Iterable<Long> startTimestamps); | ||
|
||
void putUnlessExists(long startTimestamp, long commitTimestamp) throws KeyAlreadyExistsException; | ||
} | ||
``` | ||
|
||
In practice, calling both the single and multi-timestamp versions of ``get`` may read from this table, though | ||
non-null results may be cached. ``putUnlessExists`` performs a put-unless-exists operation to the table - this | ||
is supported by means of a Thrift ``cas`` call for Cassandra KVS, and by ``INSERT ... IF NOT EXISTS`` SQL statements | ||
in relational KVSes. | ||
|
||
### Physical Representation in Cassandra | ||
|
||
All AtlasDB tables have a similar schema in Cassandra: | ||
|
||
- there is a blob partition key used to store rows, called the ``key`` | ||
- the clustering key has two more components: | ||
- a blob for columns, called ``column1`` | ||
- a big-integer for the timestamp, called ``column2`` | ||
- blob values. | ||
|
||
In the ``_transactions`` table, ``column1`` is always a single byte corresponding to the string ``t``, and | ||
``column2`` is a special value of ``-1``. In practice, we don't pay much attention to these values. | ||
More interestingly, the ``key`` is a VAR_LONG encoding of the start timestamp, and the ``value`` | ||
is similarly a VAR_LONG encoding of the commit timestamp. | ||
|
||
The Cassandra representation of the ``_transactions`` table introduced above may look as follows. | ||
|
||
| key | column1 | column2 | value | | ||
| 0x14 | 0x74 | -1 | 0x21 | | ||
| 0x1c | 0x74 | -1 | 0x2a | | ||
| 0x25 | 0x74 | -1 | 0xff80ffffffffffffffff | | ||
| 0xe02fefd8 | 0x74 | -1 | 0xe02fefdb | | ||
|
||
The details of VAR_LONG encoding are fiddly, but it exhibits several desirable properties: | ||
|
||
1. VAR_LONG encoding is order-preserving over non-negative numbers; that is, if 0 <= ts1 < ts2 | ||
then VAR_LONG(ts1) < VAR_LONG(ts2) and vice versa. This is significant, because it allows one to perform | ||
range scans of the transactions table (given that timestamps in Atlas are positive longs) in a straightforward | ||
way. | ||
2. VAR_LONG encoding supports negative numbers, meaning that we can use the same encoding for both positive timestamps | ||
(for transactions that successfully committed) and ``-1`` (for transactions that were aborted). | ||
|
||
However, our choice of encoding also has some issues. Two particularly relevant ones are as follows: | ||
|
||
1. Under VAR_LONG encoding, numbers that are near to each other numerically will also be very close in byte-space | ||
(notice that the encoded forms of 3141592 and 3141595 only differ in their three lowest-order bits). Furthermore, | ||
most writes to the transactions table will take place at numbers that are numerically close, because these | ||
correspond to actively running transactions, and thus at keys that are close in byte-space. Considering that | ||
Cassandra uses consistent hashing to handle data partitioning, at any given point in time the majority of writes | ||
to the cluster will end up going to the same node ('hot-spotting'). This is undesirable, because we lose | ||
horizontal scalability; writes are bottlenecked on a single node regardless of the size of the cluster. | ||
2. VAR_LONG encoding is not particularly efficient for our purposes in that it fails to exploit some characteristics | ||
of the distribution of our data. In particular: | ||
1. The special value ``-1`` is particularly large; under VAR_LONG encoding, negative longs will always encode to 10 | ||
bytes. | ||
2. Storing the full value of the commit timestamp is also wasteful, given that we know that it is generally slightly | ||
higher than the value of the start timestamp. | ||
|
||
### Principles for a Good Transaction Service | ||
|
||
We thus define three principles that we can use to help guide our decisions as to what makes an implementation of | ||
a transaction service a good one. | ||
|
||
1. **Horizontal Scalability (HS)**: A good transactions service must be horizontally scalable; that is, it should be | ||
possible to increase write bandwidth by increasing the number of database nodes and/or service nodes that are | ||
performing writing. | ||
2. **Compact Representation (CR)**: A good transactions service does not unnecessarily use excessive disk space. | ||
In addition to saving on disk usage, having a compact representation also improves the ability for query results to | ||
be cached in memory, and can improve time taken to execute queries if less data needs to be read through from disk. | ||
3. **Range Scans (RS)**: A good transactions service supports range scans (without needing to read the entire table | ||
and post-filter it). This is used in various backup and restore workflows in AtlasDB, and it is important that we | ||
are able to execute restores in a timely fashion. | ||
4. **Reasonable Usage Patterns (RUP)**: A good transactions service, if using an underlying service, must follow | ||
standard principles as to what constitutes a reasonable usage pattern of the underlying service. | ||
|
||
## Decision | ||
|
||
Implement the tickets encoding strategy, along with other features needed to support its efficient operation. | ||
The strategy and supporting features will be introduced in the following sections. | ||
|
||
### Tickets Encoding Strategy: Logical Overview | ||
|
||
We divide the domain of positive longs into disjoint *partitions* of constant size - we call the size the | ||
*partitioning quantum* (PQ). These partitions are contiguous ranges that start at a multiple of PQ - thus, the first | ||
partition consists of timestamps from 0 to PQ - 1, the second from PQ to 2 * PQ - 1 and so on. | ||
|
||
We assign a constant number of rows to a partition (NP), and seek to distribute start timestamps as evenly as possible | ||
among these NP rows as numbers increase. In practice, the least significant bits of the timestamp will be used as | ||
the row number; we would thus store the value associated with the timestamps k, NP + k, 2NP + k and so on in the same | ||
row, for a given partition and value of k in the interval [0, NP). To disambiguate between these timestamps, we use | ||
dynamic column keys - we can use a VAR_LONG encoding of the timestamp's offset relative to the base. | ||
|
||
More formally, for a given timestamp TS, we proceed as follows (where / denotes integer division): | ||
|
||
- we identify which row R TS belongs to; this is given by (TS / PQ) * NP + (TS % PQ) % NP. | ||
- we identify the column C TS belongs to; this is given by (TS % PQ) / NP. | ||
|
||
Notice that given R and C, we can similarly decode the original TS: | ||
|
||
- we identify the relevant partition P; this is given by R / NP. | ||
- we identify the offset from the column key, O1; this is given by C * NP. | ||
- we identify the offset from the second part of the row key, O2; this is given by R % NP. | ||
- the original timestamp is then P * PQ + O1 + O2. | ||
|
||
It may be easier to think of the timestamp being written as a 3-tuple (P, O1, O2), where the row component is the | ||
pair (P, O2) and the column key is O1; if NP divides PQ, then there is a bijection between such 3-tuples where O2 ranges | ||
from 0 to NP (exclusive), and O1 ranges from 0 to PQ / NP (exclusive). Furthermore, this bijection is order-preserving | ||
where ordering over the 3-tuples is interpreted lexicographically. | ||
|
||
This diagram should illustrate more clearly how this works, for PQ = 1,000,000 and NP = 100. | ||
|
||
![Illustration of the tickets encoding strategy](0016-tickets-encoding.png) | ||
|
||
### Physical Implementation of Tickets | ||
|
||
We store information about transactions committed under the new encoding scheme in the ``_transactions2`` table in | ||
Cassandra. | ||
Given that we want to avoid hot-spotting and ensure horizontal scalability, we need to ensure that the rows we may | ||
be writing data to are distributed differently in byte-space. We thus reverse the bits of each row before encoding it. | ||
|
||
```java | ||
private static byte[] encodeRowName(long startTimestamp) { | ||
long row = (startTimestamp / PARTITIONING_QUANTUM) * ROWS_PER_QUANTUM | ||
+ (startTimestamp % PARTITIONING_QUANTUM) % ROWS_PER_QUANTUM; | ||
return PtBytes.toBytes(Long.reverse(row)); | ||
} | ||
``` | ||
|
||
We are using a fixed-long encoding here, which uses a constant 8 bytes (instead of variable between 1 and 9). This was | ||
chosen to ensure that range scans are supported, as the reversed forms of variable length encodings tend to not be | ||
amenable to range scans. | ||
|
||
For the dynamic column keys, we simply use VAR_LONG encoding. | ||
|
||
We changed the encoding of values as well. For transactions that successfully committed, we used a delta encoding | ||
scheme instead, where we take the VAR_LONG of the difference between the commit and start timestamps, which is | ||
typically a small positive number. This helps to keep the size of the table down. Separately, for transactions that | ||
were aborted, we store an empty byte array as our special value instead of a negative number, because that is | ||
unnecessarily large. | ||
|
||
### Choosing PQ and NP | ||
|
||
We choose values of PQ and NP based on characteristics of the key-value-service in which we are storing the timestamp | ||
data, recalling principle RUP. To simplify the discussion in this section, we assume NP divides PQ. | ||
|
||
For Cassandra, following the Atlas team's recommended Cassandra best practices, we seek to bound the size of an | ||
individual row by 100 MB. Considering that a VAR_LONG takes at most 9 bytes for positive integers, and we | ||
can explicitly use an empty byte array to represent a transaction that failed to commit, we can estimate the | ||
maximum size of a row for a given choice of values of PQ and NP. | ||
|
||
Notice that the number of timestamps we actually need to store is given by PQ / NP (since each of the rows in the | ||
partition is different). For a given start/commit timestamp pair, the row key occupies 8 bytes; that said, in SSTables | ||
the row key only needs to be represented once. We thus focus on the column keys and values. | ||
|
||
The column key is a VAR_LONG encoded number that is bounded by PQ / NP, as that's the largest offset we might actually | ||
store. The value theoretically could go up to a full 9 bytes for a large positive number, though in practice is likely | ||
to be considerably smaller. | ||
|
||
We selected values of PQ = 25,000,000 and NP = 16. Under this configuration, each row stores at most 1,562,500 | ||
start-commit timestamp pairs. Thus, the numbers for the rows only go up to 1,562,500, and VAR_LONG encoding is able | ||
to represent these within 3 bytes. We do need to account for a bit more space as Cassandra needs to create a | ||
composite buffer to include the ``column2`` part of our physical row, but this should not take more than an additional | ||
9 bytes. We thus have 12 bytes for the column key and 9 bytes for the value, leading to a total of 21 bytes per | ||
start-commit timestamp pair. Each row is then bounded by about 32.04 MB, which leaves us quite a bit of headroom. | ||
|
||
It is worth mentioning that in practice, for implementation reasons it is very unlikely that we have a full 1,562,500 | ||
start-commit timestamp pairs in a single row, and in practice values are likely to be only 2 or 3 bytes rather than | ||
9 bytes. In any case, even under these adverse circumstances we still avoid generating excessively wide rows. | ||
|
||
### Streamlining putUnlessExists | ||
|
||
The putUnlessExists operation is performed at a serial consistency level in Cassandra, meaning that reads and writes | ||
go through Paxos for consensus. Thrift exposes a check-and-set operation on its APIs. | ||
|
||
``` | ||
CASResult cas(1:required binary key, | ||
2:required string column_family, | ||
3:list<Column> expected, | ||
4:list<Column> updates, | ||
5:required ConsistencyLevel serial_consistency_level=ConsistencyLevel.SERIAL, | ||
6:required ConsistencyLevel commit_consistency_level=ConsistencyLevel.QUORUM) | ||
throws (1:InvalidRequestException ire, 2:UnavailableException ue, 3:TimedOutException te) | ||
``` | ||
|
||
This was sufficient in the original transactions schema, because each row key only stores information about one | ||
start timestamp; a putUnlessExists operation is then a CAS from an empty row to a row that has one column. | ||
However, notice that this is not sufficient for transactions2, because each row may contain data about multiple start | ||
timestamps. This API requires us to provide a list of the old columns, and we don't know that beforehand. | ||
|
||
We considered alternatives of reading the existing row and then adding the new columns, or using the CQL API, because | ||
the behaviour of INSERT IF NOT EXISTS for columns matches the semantics we want. However, both of these solutions were | ||
found to have unacceptable performance in benchmarking. | ||
|
||
We thus decided to extend the Thrift interface to add support for a multi-column put-unless-exists operation that | ||
has the semantics we want. This is different from CAS from an empty list, in that this succeeds as long as any of | ||
the existing columns in the column family for the provided key do not overlap with the set of columns being added. | ||
|
||
``` | ||
CASResult put_unless_exists(1:required binary key, | ||
2:required string column_family, | ||
3:list<Column> updates, | ||
4:required ConsistencyLevel serial_consistency_level=ConsistencyLevel.SERIAL, | ||
5:required ConsistencyLevel commit_consistency_level=ConsistencyLevel.QUORUM) | ||
throws (1:InvalidRequestException ire, 2:UnavailableException ue, 3:TimedOutException te) | ||
``` | ||
|
||
#### Multinode Contention and Residues | ||
|
||
This is an improvement, but still runs into issues when clients (whether across multiple service nodes or on the same | ||
node) issue multiple requests in parallel, because each put_unless_exists request requires a round of Paxos. Cassandra | ||
maintains Paxos sequences at the level of a partition (key), so these requests would contend as far as Paxos is | ||
concerned, even if the columns are actually disjoint. Internally, Cassandra nodes are trying to apply updates to the | ||
partition; whether these updates are applied and the order in which they take place is agreed on using Paxos. | ||
Although the nodes will be OK with accepting multiple proposals if they don't conflict, only one round of consensus | ||
can be committed at a time (since updates are conditional). Also, Cassandra uses a leaderless implementation of | ||
Paxos, meaning that the 'dueling proposers' issue might slow an individual round of the protocol down if multiple nodes | ||
are trying to concurrently propose values. | ||
|
||
Batching requests on the client side for each partition could be useful, though that is still limited in that | ||
performance would be poor for services with many nodes. | ||
|
||
### Cassandra Table Tuning | ||
|
||
#### Bloom Filters | ||
|
||
#### Index Intervals | ||
|
||
#### Compression Chunk Length | ||
|
||
### Cell Loader V2 | ||
|
||
### Live Migrations and the Coordination Service | ||
|
||
## Consequences | ||
|
||
### Write Performance | ||
|
||
### Read Performance | ||
|
||
### Data Compression | ||
|
||
### Backup and Restore | ||
|
||
### Cassandra Dependencies | ||
|
||
## Alternatives Considered | ||
|
||
### Use TimeLock / other Paxos mechanism for transactions | ||
|
||
### The Row-Tickets Algorithm | ||
|
||
## Future Work | ||
|
||
### DbKVS and Transactions2 |