Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
runtime+storage: integrate disk storage
With this change, the disk backend (badger) becomes available for use with the OPA runtime properly: It can be configured using the `storage.disk` key in OPA's config (see included documentation). When enabled, - any data or policies stored with OPA will persist over restarts - per-query metrics related to disk usage are reported - Prometheus metrics per storage operation are exported The main intention behind this feature is to optimize memory usage: OPA can now operate on more data than fits into the allotted memory resources. It is NOT meant to be used as a primary source of truth: there are no backup/restore or desaster recovery procedures -- you MUST secure the means to restore the data stored with OPA's disk storage by yourself. See also open-policy-agent#4014. Future improvements around bundle loading are planned. Some notes on details: storage/disk: impose same locking regime used with inmem With this setup, we'll ensure: - there is only one open write txn at a time - there are any number of open read txns at a time - writes are blocked when reads are inflight - during a commit (and triggers being run), no read txns can be created This is to ensure the same atomic policy update semantics when using 'disk" as we have with "inmem". We're basically opting out of badger's currency control and transactionality guarantees. This is because we cannot piggy back on that to ensure the atomic update we want. There might be other ways -- using subscribers, and blocking in some other place -- but this one seems preferrable since it mirrors inmem. Part of the problem is ErrTxnTooLarge, and committing and renewing txns when it occurs: that, which is the prescribed solution to txns growing too big, also means that reads can see half of the "logical" transaction having been committed, while the rest is still getting processed. Another approach would have been using `WriteBatch`, but that won't let us read from the batch, only apply Set and Delete operations. We currently need to read (via an iterator) to figure out if we need to delete keys to replace something in the store. There is no DropPrefix operation on the badger txn, or the WriteBatch API. storage/disk: remove commit-and-renew-txn code for txn-too-big errors This would break transactional guarantees we care about: while there can be only one write transaction at a time, read transactions may happen while a write txn is underway -- with this commit-and-reset logic, those would read partial data. Now, the error will be returned to the caller. The maximum txn size depends on the size of memtables, and could be tweaked manually. In general, the caller should try to push multiple smaller increments of the data. storage/disk: implement noop MakeDir The MakeDir operation as implemented in the backend-agnostic storage code has become an issue with the disk store: to write /foo/bar/baz, we'd have to read /foo (among other subdirs), and that can be _much_ work for the disk backend. With inmem, it's cheap, so this wasn't problematic before. Some of the storage/disk/txn.go logic had to be adjusted to properly do the MakeDir steps implicitly. The index argument addition to patch() in storage/disk/txn.go was necessary to keep the error messages conforming to the previous code path: previously, conflicts (arrays indexed as objects) would be surfaced in the MakeDir step, now it's entangled with the patch calculation. storage/disk: check ctx.Err() in List/Get operations This won't abort reading a single key, but it will abort iterations. storage/disk: support patterns in partitions There is a potential clash here: "*", the path wildcard, is a valid path section. However, it only affects the case when a user would want to have a partition at /foo/*/bar and would really mean "*", and not the wildcard. Storing data at /foo/*/bar with a literal "*" won't be treated differently than storing something at /fo/xyz/bar. storage/disk: keep per-txn-type histograms of stats This is done by reading off the metrics on commit, and shovelling their numbers into the prometheus collector. NOTE: if you were to share a metrics object among multiple transactions, the results would be skewed, as it's not reset. However, our server handlers don't do that. storage/disk: opt out of badger's conflict detection With only one write transaction in flight at any time, the situation that badger guards against cannot happen: A transaction has written to a key after the current, to-be-committed transaction has last read that key from the store. Since it can't happen, we can ignore the bookkeeping involved. This improves the time it takes to overwrite existing keys. Signed-off-by: Stephan Renatus <[email protected]>
- Loading branch information