Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disk-based Storage #4014

Closed
5 of 6 tasks
tsandall opened this issue Nov 15, 2021 · 7 comments
Closed
5 of 6 tasks

Disk-based Storage #4014

tsandall opened this issue Nov 15, 2021 · 7 comments
Assignees

Comments

@tsandall
Copy link
Member

tsandall commented Nov 15, 2021

Now that OPA includes a disk-based implementation of the storage.Store interface, we can work on integrating the implementation into the OPA server. This issue tracks the work that needs to get done and will need to be updated as there are probably unknowns:

  • identify and refactor internal code that relies on reading from transaction in the commit hook (this isn't supported by the disk-based store)
  • identify and refactor internal code that relies on data paths in transaction events (or improve disk-based store to record data paths for transactions)
  • expose configuration to enable disk-based store (minimally storage directory and partitions)
  • instantiate disk-based store in runtime instead of in-memory store based on configuration
  • investigate optimizations to avoid loading bundles fully into memory and instead stream-to-disk
  • investigate whether server integration tests can be run with disk-based storage to show conformance
@floriangasc
Copy link
Contributor

floriangasc commented Nov 16, 2021

@tsandall

Hi,

If you need help for something may be i can help.

  1. expose configuration to enable disk-based store (minimally storage directory and partitions) : I have make a little code for add cli args for disk configuration.
  2. instantiate disk-based store in runtime instead of in-memory store based on configuration : I started to write a little function for chose store type in runtime according config may be it can use at base.

If you want i can copy past code (or make a pr) if it could be useful. It's not a big part of the problem, i think.

For 2 first point (refactoring) is may be little hard for me. For optimisation stream-to-disk, i have take a look but it's seem's there lot of code involve and it's also little hard. But if i can help you on some task why not.

@tsandall
Copy link
Member Author

Hi @floriangasc - I agree that the first two points are probably the hardest. I think we need to get those sorted out first since they're the highest risk. I'm hoping that we'll work on this in the next 2-3 months.

@tsandall tsandall changed the title Enable disk-based storage the OPA server Disk-based Storage Dec 8, 2021
@floriangasc
Copy link
Contributor

Hi @tsandall

I will try to use in production mirroring open policy agent with disk store enabled. Here some feed back. I don't know if you already know, but if it could be help:

  1. the disk store implementation, badger, take not negligible ram especially in case of write occur often (5-10mn). In our case, the same amount ram with and without badger is consumed. It's not for the same reason but at the end in OS point of view, it's same amount of ram. The main reason i have found is about memtable of badger (thanks to pprof). I have tried conservative ram configuration found on some issue, but it consumes too much ram for use as service sidecar (greater than 6 giga).
  2. I have read somewhere in our opa github what properties of badger should be customizable. I think some properties should be added. I have need to use this
o = o.WithIndexCacheSize(100 << 20)
o.NumMemtables = 1
o.MemTableSize = 32 << 20
o = o.WithBaseTableSize(1 << 20)
o = o.WithBlockCacheSize(1 << 20)

Some useful link i have found during my research on badger.
https://github.com/dgraph-io/ristretto#Config
https://discuss.dgraph.io/t/memtable-memory-usage-causes-oom/9059
https://dgraph.io/docs/badger/get-started/#in-memory-mode-diskless-mode/
https://github.com/dgraph-io/badger/blob/master/options.go
dgraph-io/badger#1304

The disk store ram consumption issue is not coupled to number of data loaded, it's more about a minimum ram need by badger to work (according to data loaded, and data access and data updated)

Finally after study opa github issue about this kind of problem, i have choose to develop a plugin, that use opa downloader for download bundle localy. But bundle are not activated. It's just localy present in sidecar. It's activated on demand, according input, for a defined duration (like 5 or 10mn according production feedback). Now opa consume 512 mo de ram in the worst case, policy are evaluated in 10 micro second in worst case. Bundle are keep up-to-date due to long polling. And ram consumption not depend on all user, group and so on present in database, but only those connected at the specific service. So plugin is the most promising solution, I have abandoned to use disk store, because it's cause to ram consumption.

If you want any more information, let me know. Thanks for all your work, opa has very good code (easy to read and learn and update) and it's solve a lot of problems in clean way for us.

@tsandall
Copy link
Member Author

tsandall commented Jan 26, 2022

@floriangasc sorry for the delayed reply--this got buried in updates over the holidays.

Is it possible to share the code you used to stress test the disk store in OPA? I didn't observe the behaviour you're describing when I tested badger almost a year ago but perhaps I missed something.

EDIT: Also, I'd just like to say thank you for the detailed writeup of what you observed!

@srenatus srenatus self-assigned this Feb 22, 2022
srenatus added a commit to srenatus/opa that referenced this issue Mar 30, 2022
With this change, the disk backend (badger) becomes available for
use with the OPA runtime properly:

It can be configured using the `storage.disk` key in OPA's config
(see included documentation).

When enabled,
- any data or policies stored with OPA will persist over restarts
- per-query metrics related to disk usage are reported
- Prometheus metrics per storage operation are exported

The main intention behind this feature is to optimize memory usage:
OPA can now operate on more data than fits into the allotted memory
resources. It is NOT meant to be used as a primary source of truth:
there are no backup/restore or desaster recovery procedures -- you
MUST secure the means to restore the data stored with OPA's disk
storage by yourself.

See also open-policy-agent#4014. Future improvements around bundle loading are
planned.

Some notes on details:

storage/disk: impose same locking regime used with inmem

With this setup, we'll ensure:

- there is only one open write txn at a time
- there are any number of open read txns at a time
- writes are blocked when reads are inflight
- during a commit (and triggers being run), no read txns can be created

This is to ensure the same atomic policy update semantics when using
'disk" as we have with "inmem". We're basically opting out of badger's
currency control and transactionality guarantees. This is because we
cannot piggy back on that to ensure the atomic update we want.

There might be other ways -- using subscribers, and blocking in some
other place -- but this one seems preferrable since it mirrors inmem.

Part of the problem is ErrTxnTooLarge, and committing and renewing
txns when it occurs: that, which is the prescribed solution to txns
growing too big, also means that reads can see half of the "logical"
transaction having been committed, while the rest is still getting
processed.

Another approach would have been using `WriteBatch`, but that won't
let us read from the batch, only apply Set and Delete operations.
We currently need to read (via an iterator) to figure out if we
need to delete keys to replace something in the store.  There is
no DropPrefix operation on the badger txn, or the WriteBatch API.

storage/disk: remove commit-and-renew-txn code for txn-too-big errors

This would break transactional guarantees we care about: while there
can be only one write transaction at a time, read transactions may
happen while a write txn is underway -- with this commit-and-reset
logic, those would read partial data.

Now, the error will be returned to the caller. The maximum txn size
depends on the size of memtables, and could be tweaked manually.
In general, the caller should try to push multiple smaller increments
of the data.

storage/disk: implement noop MakeDir

The MakeDir operation as implemented in the backend-agnostic storage
code has become an issue with the disk store: to write /foo/bar/baz,
we'd have to read /foo (among other subdirs), and that can be _much_
work for the disk backend. With inmem, it's cheap, so this wasn't
problematic before.

Some of the storage/disk/txn.go logic had to be adjusted to properly
do the MakeDir steps implicitly.

The index argument addition to patch() in storage/disk/txn.go was
necessary to keep the error messages conforming to the previous
code path: previously, conflicts (arrays indexed as objects) would
be surfaced in the MakeDir step, now it's entangled with the patch
calculation.

storage/disk: check ctx.Err() in List/Get operations

This won't abort reading a single key, but it will abort iterations.

storage/disk: support patterns in partitions

There is a potential clash here: "*", the path wildcard, is
a valid path section. However, it only affects the case when
a user would want to have a partition at

    /foo/*/bar

and would really mean "*", and not the wildcard.

Storing data at /foo/*/bar with a literal "*" won't be treated
differently than storing something at /fo/xyz/bar.

storage/disk: keep per-txn-type histograms of stats

This is done by reading off the metrics on commit, and shovelling
their numbers into the prometheus collector.

NOTE: if you were to share a metrics object among multiple transactions,
the results would be skewed, as it's not reset. However, our server
handlers don't do that.

storage/disk: opt out of badger's conflict detection

With only one write transaction in flight at any time, the situation
that badger guards against cannot happen:

A transaction has written to a key after the current, to-be-committed
transaction has last read that key from the store.

Since it can't happen, we can ignore the bookkeeping involved. This
improves the time it takes to overwrite existing keys.

Signed-off-by: Stephan Renatus <[email protected]>
srenatus added a commit that referenced this issue Mar 30, 2022
With this change, the disk backend (badger) becomes available for
use with the OPA runtime properly:

It can be configured using the `storage.disk` key in OPA's config
(see included documentation).

When enabled,
- any data or policies stored with OPA will persist over restarts
- per-query metrics related to disk usage are reported
- Prometheus metrics per storage operation are exported

The main intention behind this feature is to optimize memory usage:
OPA can now operate on more data than fits into the allotted memory
resources. It is NOT meant to be used as a primary source of truth:
there are no backup/restore or desaster recovery procedures -- you
MUST secure the means to restore the data stored with OPA's disk
storage by yourself.

See also #4014. Future improvements around bundle loading are
planned.

Some notes on details:

storage/disk: impose same locking regime used with inmem

With this setup, we'll ensure:

- there is only one open write txn at a time
- there are any number of open read txns at a time
- writes are blocked when reads are inflight
- during a commit (and triggers being run), no read txns can be created

This is to ensure the same atomic policy update semantics when using
'disk" as we have with "inmem". We're basically opting out of badger's
currency control and transactionality guarantees. This is because we
cannot piggy back on that to ensure the atomic update we want.

There might be other ways -- using subscribers, and blocking in some
other place -- but this one seems preferrable since it mirrors inmem.

Part of the problem is ErrTxnTooLarge, and committing and renewing
txns when it occurs: that, which is the prescribed solution to txns
growing too big, also means that reads can see half of the "logical"
transaction having been committed, while the rest is still getting
processed.

Another approach would have been using `WriteBatch`, but that won't
let us read from the batch, only apply Set and Delete operations.
We currently need to read (via an iterator) to figure out if we
need to delete keys to replace something in the store.  There is
no DropPrefix operation on the badger txn, or the WriteBatch API.

storage/disk: remove commit-and-renew-txn code for txn-too-big errors

This would break transactional guarantees we care about: while there
can be only one write transaction at a time, read transactions may
happen while a write txn is underway -- with this commit-and-reset
logic, those would read partial data.

Now, the error will be returned to the caller. The maximum txn size
depends on the size of memtables, and could be tweaked manually.
In general, the caller should try to push multiple smaller increments
of the data.

storage/disk: implement noop MakeDir

The MakeDir operation as implemented in the backend-agnostic storage
code has become an issue with the disk store: to write /foo/bar/baz,
we'd have to read /foo (among other subdirs), and that can be _much_
work for the disk backend. With inmem, it's cheap, so this wasn't
problematic before.

Some of the storage/disk/txn.go logic had to be adjusted to properly
do the MakeDir steps implicitly.

The index argument addition to patch() in storage/disk/txn.go was
necessary to keep the error messages conforming to the previous
code path: previously, conflicts (arrays indexed as objects) would
be surfaced in the MakeDir step, now it's entangled with the patch
calculation.

storage/disk: check ctx.Err() in List/Get operations

This won't abort reading a single key, but it will abort iterations.

storage/disk: support patterns in partitions

There is a potential clash here: "*", the path wildcard, is
a valid path section. However, it only affects the case when
a user would want to have a partition at

    /foo/*/bar

and would really mean "*", and not the wildcard.

Storing data at /foo/*/bar with a literal "*" won't be treated
differently than storing something at /fo/xyz/bar.

storage/disk: keep per-txn-type histograms of stats

This is done by reading off the metrics on commit, and shovelling
their numbers into the prometheus collector.

NOTE: if you were to share a metrics object among multiple transactions,
the results would be skewed, as it's not reset. However, our server
handlers don't do that.

storage/disk: opt out of badger's conflict detection

With only one write transaction in flight at any time, the situation
that badger guards against cannot happen:

A transaction has written to a key after the current, to-be-committed
transaction has last read that key from the store.

Since it can't happen, we can ignore the bookkeeping involved. This
improves the time it takes to overwrite existing keys.

Signed-off-by: Stephan Renatus <[email protected]>
damienjburks added a commit to damienjburks/opa that referenced this issue Apr 1, 2022
Signed-off-by: Damien Burks <[email protected]>

removing space from testcase

Signed-off-by: Damien Burks <[email protected]>

website: make local dev and PR preview not build everything (open-policy-agent#4474)

With this change, the work done for local development, and the per-PR netlify preview changes:

It will no longer include the website stuff for versions other than edge, the current working tree.

We thus save us the time, and the flakiness, involved with

- checking if github has release binaries for all the versions
- checking out their sources
- fetching the release binaries to pre-hydrate old versions' live-blocks.

The previously-used, documented make target should still be intact.

Fixes open-policy-agent#4379 to some extent, I hope.

* docs/website: remove "latest" binary from opa versions cache

Having a stale binary here -- one called "latest" but not actually
being "latest" -- causes issues like this: when building the website
content for the (real) latest version, the script would take the
old (previous-latest) binary, and fail because that binary didn't
know the latest future keywords.

Signed-off-by: Stephan Renatus <[email protected]>

website: fix download link redirect (open-policy-agent#4482)

Fixes the bug introduced in open-policy-agent#4474 causing the setup-opa action to fail
to retrieve the "latest" opa binary.

Signed-off-by: Stephan Renatus <[email protected]>

build(deps): bump github.com/yashtewari/glob-intersection to v0.1.0 (open-policy-agent#4481)

https://github.com/yashtewari/glob-intersection/releases/tag/v0.1.0

Signed-off-by: Stephan Renatus <[email protected]>

docs/website: fix /docs redirect (open-policy-agent#4483)

Signed-off-by: Stephan Renatus <[email protected]>

website: fix template to produce proper redirects (open-policy-agent#4484)

Also redirect /docs -> /docs/edge in preview

Signed-off-by: Stephan Renatus <[email protected]>

docs/envoy: include new configurable gRPC msg sizes (open-policy-agent#4459)

Docs for open-policy-agent/opa-envoy-plugin#323

Signed-off-by: Elliot Maincourt <[email protected]>

website/live-blocks: update caniuse/browserlist

Signed-off-by: Stephan Renatus <[email protected]>

website/live-blocks: don't call the github api to determine release asset urls

Signed-off-by: Stephan Renatus <[email protected]>

website: don't run generate twice (open-policy-agent#4488)

Signed-off-by: Stephan Renatus <[email protected]>

ast: Fix print call rewriting in else rules

The compiler was accidentally checking/rewriting print calls in the
body of else rules before the implicit args of the else rule were
processed. As a result, the compiler was generating false-positive
errors for refs to undeclared args. The root of the problem was the
usage of WalkBodies on the top-level rule (which implicitly walks the
bodies of else rules under the top-level rule). With this change, the
compiler will call WalkBodies on the head and body of each rule rather
than the entire rule itself (which includes the else chain).

Fixes open-policy-agent#4489

Signed-off-by: Torin Sandall <[email protected]>

Update maintainer status, add Will Beason (open-policy-agent#4465)

Signed-off-by: Max Smythe <[email protected]>

Co-authored-by: Will Beason <[email protected]>

Make maintainers list alphabetical (open-policy-agent#4491)

Signed-off-by: Max Smythe <[email protected]>

plugins/logs: Fix broken retry logic

We were incorrectly resetting the retry counter after
every error condition instead of using the incremented
value. As a result, retry delay would always be 0s.
This meant that if OPA encountered an error while
uploading decision logs it would immediately retry
instead of doing an exponential backoff.

Fixes: open-policy-agent#4486

Signed-off-by: Ashutosh Narkar <[email protected]>

logging: mask authorization header value in debug logs (open-policy-agent#4496)

Fixes open-policy-agent#4495

Signed-off-by: Anders Eknert <[email protected]>

Fixes open-policy-agent#4376 (open-policy-agent#4494)

Do note though that this does not change how multiple
with statements are grouped. Although I agree with that,
it's IMHO a separate feature request, while the spacing
issue is a bug.

Signed-off-by: Anders Eknert <[email protected]>

runtime: improve log output for binary response (open-policy-agent#4498)

This change omits the response body when using compression on metrics endpoint and when pprof is enabled.

Signed-off-by: Christian Altamirano <[email protected]>

build: bump golang: 1.17 -> 1.18

No change to go.mod's `go` stanza, so no changes in code compatibility.

However, it's used for building our docker images and release
binaries, and for fuzz testing in our nightly workflow.

Some test-related changes with the dns lookup built-in function's
error handling; and the hardcoded signature. Running

    go test ./topdown -run TestTopdownJWTEncodeSignECWithSeedReturnsSameSignature -count 10000

makes me believe that for whatever reason the signature changed,
it's at least stable.

topdown/http_test: Test-only change to accomodate this change in Go (https://go.dev/doc/go1.18):

    Certificate.Verify now uses platform APIs to verify certificate
    validity on macOS and iOS when it is called with a nil
    VerifyOpts.Roots or when using the root pool returned from
    SystemCertPool.

We're keeping the old message for go <= 1.17; in a silly-simple way.

Also:

* ci: build and test two old golang version on macos|linux

  We'll drop golang 1.15, keep one unsupported version (1.16).

Signed-off-by: Stephan Renatus <[email protected]>

ci: remove go-fuzz, use native go 1.18 fuzzer

Signed-off-by: Stephan Renatus <[email protected]>

build(dev-deps): update live-scripts deps: mocha, nanoid (open-policy-agent#4500)

Signed-off-by: Stephan Renatus <[email protected]>

build(deps): bump nanoid in /docs/website/scripts/live-blocks (open-policy-agent#4501)

update bob_token and alice_token in markdown (open-policy-agent#4504)

Signed-off-by: yongen.pan <[email protected]>

topdown/eval: copy without modifying expr, update test/e2e/concurrency (open-policy-agent#4503)

What we previously did turned into a race condition with multiple
concurrent calls to /v1/compile.

With a change introduced with 0.38.0 (the `every` keyword), the
`nil` Terms of an `ast.Expr` node was surfaced: previously, it would
go unnoticed, but could potentially have yielded bad results.

The effect of this change is proven using a new e2e test that would
fail on the code we had previous.

Signed-off-by: Stephan Renatus <[email protected]>

uploading latest changes

Signed-off-by: Damien Burks <[email protected]>

fixing test cases

Signed-off-by: Damien Burks <[email protected]>

sdk: avoid using different state (open-policy-agent#4505)

I noticed that when operating on opa.state, locking is usually done to avoid
a race, whereas here opa.state is used directly. By comparing the previous
changes, I found that open-policy-agent#4240 changed the previous behaviour.

This change adjusts that: we ensure that we work on the state as read via
the mutex-protected `s := *opa.state`.

Signed-off-by: Iceber Gu <[email protected]>

build(deps): update opentelemetry-go: 1.5.0 -> 1.6.1 (metrics 0.28.0) (open-policy-agent#4467)

https://github.com/open-telemetry/opentelemetry-go/releases/tag/v1.6.0
https://github.com/open-telemetry/opentelemetry-go/releases/tag/v1.6.1
https://github.com/open-telemetry/opentelemetry-go-contrib/releases/tag/v1.6.0

Signed-off-by: Stephan Renatus <[email protected]>

build(deps): open opentelemetry-go 1.6.0 -> 1.6.1 (open-policy-agent#4512)

This is a left-over from the previous bump, 0856120.

https://github.com/open-telemetry/opentelemetry-go/releases/tag/v1.6.1

Signed-off-by: Stephan Renatus <[email protected]>

sdk/opa_test: increase max delta (0.5s) (open-policy-agent#4513)

When running on GHA, we've found this test to often fail on macos-latest:
It would not functionally be wrong, but it also wasn't able to finish in
the alloted time. Now, the maximum delta has been increased a lot (10ms to
500ms). It's much, but it's still good enough to ensure that the context
passed to Stop() is the one that matters for shutdown.

Signed-off-by: Stephan Renatus <[email protected]>

docs: Add clarifications for docs describing bundle signing and delta features

This commit adds some clarification to the bundle doc regarding
the format of bundle signatures and manifest behavior for delta
bundles.

Signed-off-by: Ashutosh Narkar <[email protected]>

runtime: change processed file event log level to info (open-policy-agent#4514)

"Processed file watch event." will now be logged at level "info" (was "warn").

Signed-off-by: Anders Eknert <[email protected]>

runtime/logging: only suppress payloads for handlers that compress responses (open-policy-agent#4502)

* runtime/logging: only suppress payloads for handlers that compress responses

To get compressed responses, two things need to be true:

1. The client must accept compressed responses
2. The handler must reply with a compressed response

For general API requests, (1) holds most of the time. (2) is only true
for the metrics endpoint at them moment, since the 3rd party library we
use for serving the prometheus endpoint will do compression.

* runtime/logging: remove dead code

The http.Hijack stuff was related to a watch feature removed in
open-policy-agent@186ef99

dropInputParam was only used by its tests.

Signed-off-by: Stephan Renatus <[email protected]>

storage: code cosmetics

Signed-off-by: Stephan Renatus <[email protected]>

storage: allow implementations to override MakeDir

Signed-off-by: Stephan Renatus <[email protected]>

runtime+storage: integrate disk storage

With this change, the disk backend (badger) becomes available for
use with the OPA runtime properly:

It can be configured using the `storage.disk` key in OPA's config
(see included documentation).

When enabled,
- any data or policies stored with OPA will persist over restarts
- per-query metrics related to disk usage are reported
- Prometheus metrics per storage operation are exported

The main intention behind this feature is to optimize memory usage:
OPA can now operate on more data than fits into the allotted memory
resources. It is NOT meant to be used as a primary source of truth:
there are no backup/restore or desaster recovery procedures -- you
MUST secure the means to restore the data stored with OPA's disk
storage by yourself.

See also open-policy-agent#4014. Future improvements around bundle loading are
planned.

Some notes on details:

storage/disk: impose same locking regime used with inmem

With this setup, we'll ensure:

- there is only one open write txn at a time
- there are any number of open read txns at a time
- writes are blocked when reads are inflight
- during a commit (and triggers being run), no read txns can be created

This is to ensure the same atomic policy update semantics when using
'disk" as we have with "inmem". We're basically opting out of badger's
currency control and transactionality guarantees. This is because we
cannot piggy back on that to ensure the atomic update we want.

There might be other ways -- using subscribers, and blocking in some
other place -- but this one seems preferrable since it mirrors inmem.

Part of the problem is ErrTxnTooLarge, and committing and renewing
txns when it occurs: that, which is the prescribed solution to txns
growing too big, also means that reads can see half of the "logical"
transaction having been committed, while the rest is still getting
processed.

Another approach would have been using `WriteBatch`, but that won't
let us read from the batch, only apply Set and Delete operations.
We currently need to read (via an iterator) to figure out if we
need to delete keys to replace something in the store.  There is
no DropPrefix operation on the badger txn, or the WriteBatch API.

storage/disk: remove commit-and-renew-txn code for txn-too-big errors

This would break transactional guarantees we care about: while there
can be only one write transaction at a time, read transactions may
happen while a write txn is underway -- with this commit-and-reset
logic, those would read partial data.

Now, the error will be returned to the caller. The maximum txn size
depends on the size of memtables, and could be tweaked manually.
In general, the caller should try to push multiple smaller increments
of the data.

storage/disk: implement noop MakeDir

The MakeDir operation as implemented in the backend-agnostic storage
code has become an issue with the disk store: to write /foo/bar/baz,
we'd have to read /foo (among other subdirs), and that can be _much_
work for the disk backend. With inmem, it's cheap, so this wasn't
problematic before.

Some of the storage/disk/txn.go logic had to be adjusted to properly
do the MakeDir steps implicitly.

The index argument addition to patch() in storage/disk/txn.go was
necessary to keep the error messages conforming to the previous
code path: previously, conflicts (arrays indexed as objects) would
be surfaced in the MakeDir step, now it's entangled with the patch
calculation.

storage/disk: check ctx.Err() in List/Get operations

This won't abort reading a single key, but it will abort iterations.

storage/disk: support patterns in partitions

There is a potential clash here: "*", the path wildcard, is
a valid path section. However, it only affects the case when
a user would want to have a partition at

    /foo/*/bar

and would really mean "*", and not the wildcard.

Storing data at /foo/*/bar with a literal "*" won't be treated
differently than storing something at /fo/xyz/bar.

storage/disk: keep per-txn-type histograms of stats

This is done by reading off the metrics on commit, and shovelling
their numbers into the prometheus collector.

NOTE: if you were to share a metrics object among multiple transactions,
the results would be skewed, as it's not reset. However, our server
handlers don't do that.

storage/disk: opt out of badger's conflict detection

With only one write transaction in flight at any time, the situation
that badger guards against cannot happen:

A transaction has written to a key after the current, to-be-committed
transaction has last read that key from the store.

Since it can't happen, we can ignore the bookkeeping involved. This
improves the time it takes to overwrite existing keys.

Signed-off-by: Stephan Renatus <[email protected]>

docs/management-bundles: remove quotations with info boxes (open-policy-agent#4517)

Small follow-up to open-policy-agent#4515, using info boxes instead of generic markup.

Signed-off-by: Stephan Renatus <[email protected]>

website: fix community site (open-policy-agent#4522)

* remove hugo templating

This page doesn't go through hugo at all, so these things would
end up on the website.

A quick fix is reverting the bits that added the templating. The
downside is that local dev and PR previews will link to
/docs/latest/ecosystem, which doesn't exist there.

* link Documentation and Download to /docs

For /docs/ (not /docs/latest) we have a redirect in local dev and
PR preview mode:

    /docs -> /docs/edge

so this will be a little less broken: only ecosystem is a 404 link
on the preview.

Signed-off-by: Stephan Renatus <[email protected]>

integrations: Add Emissary-Ingress (open-policy-agent#4523)

This PR adds the Emissary ingress to the Ecosystem page.
The blog section contains link to an article which guides on Emissary ingress and OPA integrations.

Signed-off-by: Tayyab J <[email protected]>

storage/disk: wildcard partition validation, docs caveat (open-policy-agent#4519)

A bunch of smaller follow-up tasks to open-policy-agent#4381.

* storage/disk_test: check invalid patches with wildcard partition, too
* docs/disk: add caveat re: bundles loaded into memory
* storage/disk: auto-manage /system partitions

If these are found in the user-provided partitions, we'll error out.

* storage/disk: pretty-print partitions with "*" instead of %2A
* storage/disk: respect wildcard-replacement in partition validation

It is now allowed to replace a partition like

    /foo/bar

by

    /foo/*

also if multiple wildcards are used.

Caveats:

You cannot add a wildcard partition like /*/*, since it would overlap
the managed "/system/*" partition.

When attempting to go back from /foo/* to /foo/bar, an error is
raised _unconditionally_ -- we could check the existing data, but
currently don't.

* storage/disk: check prefix when adding wildcard partitions

The previously done check would have falsely returned that there is no problem
when adding a wildcard partition: lookup of "/foo/*" with '*' not interpreted
as a wildcard, but as a string, would yield a not-found, even if there was any
data under /foo/.

Now, we'll check the prefix-until-wildcard. It's more cautious than
theoretically necessary, but safe.

Signed-off-by: Stephan Renatus <[email protected]>

website/style: fix code block bottom margin (open-policy-agent#4526)

The text following regular code blocks was too close. Now, the bottom
margin is the same was used with live-blocks, 1.5rem (24px).

Signed-off-by: Stephan Renatus <[email protected]>

docs: Add CloudFormation Hook tutorial (open-policy-agent#4525)

Obviously missing is deployment of OPA to AWS. This will come in the next iteration.

Signed-off-by: Anders Eknert <[email protected]>

Prepare v0.39.0 release (open-policy-agent#4524)

Signed-off-by: Stephan Renatus <[email protected]>

Prepare v0.40.0 development (open-policy-agent#4527)

Signed-off-by: Stephan Renatus <[email protected]>

website: fix styra support link

Signed-off-by: Torin Sandall <[email protected]>

adopters: Add Wealthsimple (open-policy-agent#4530)

Signed-off-by: hannahyeates <[email protected]>

ci: fix rego check (open-policy-agent#4532)

* build/policies: format using 0.39.0
* workflow/pull-request: use edge opa for rego PR checks

Signed-off-by: Stephan Renatus <[email protected]>

website: remove right margin on sidebar (open-policy-agent#4529)

* remove right margin on sidebar
* replace the removed margin with padding

Signed-off-by: orweis <[email protected]>

website: Remove unused variables to avoid error in strict mode(open-policy-agent#4534)

To Fix:
```
2 errors occurred:
policy.rego:13: rego_compile_error: assigned var header unused
policy.rego:13: rego_compile_error: assigned var signature unused
```

Signed-off-by: panpan0000 <[email protected]>

website: show yellow banner for old version (open-policy-agent#4533)

Hopefully this is gets more attention than the yellow "older" button
in the version picker.

Signed-off-by: Stephan Renatus <[email protected]>

fixing disk diagnostic test case
@tsandall
Copy link
Member Author

tsandall commented Apr 2, 2022

@srenatus just to double check... the last task on the list above ("investigate whether server integration tests can be run with disk-based storage to show conformance") is complete, right? Perhaps we should close this issue and file a new one to track the optimizations/improvements for bundle loading.

@srenatus
Copy link
Contributor

srenatus commented Apr 4, 2022

Yes, a select few have been changed to run with both inmem and disk storage; notably those checking the PUT /v1/data API. They had yielded a good number of issues that had been resolved via #4381.

Agreed, I'll close this one in favour of an issue focusing on bundles.

@floriangasc
Copy link
Contributor

floriangasc commented Apr 4, 2022

@tsandall

Hi. Sorry for the delay, I have forgotten to respond. I have use istio mirroring capability.

  1. i have put opa has side car of dedicated ingressgateway testing (openresty/nginx) k8s deployment.
  2. i have configure opa to "listen" our production bundle server (long polling).
  3. A new bundle or new version of bundle pop every 5mn and opa evaluate around 50-100 request per second.
  4. the ingress gateway testing k8s deployment get all mirrored traffic (GET requests) of real ingress gateway k8s deployment.
  5. It run about five days (or something like that) after we have decide to switch to another way.

ps:

  1. if you have more question, doesn't hesitate, I will answer more quickly. Sorry again for the delay :(
  2. we have put in pause this project due to other prioritiy and some other complexity don't related to opa. One of this complexity is to generate bundle, aws iops disk is very limited by default, and we have optimized our bundle generation and buy iops. Retrospectively i wonder if the high ram consumption of opa (badger) come from this aws pricing limitation. Because it can't write to disk, may it use more ram. Because we are not aware of this during the testing phase, i don't have the exact response.
  3. The next week i will update opa to the latest version, fix some bug in our pluging and put the first part of the project in production. Thanks lot for your work on opa, is a superb tool, with easy read and understand code :)

anderseknert pushed a commit to damienjburks/opa that referenced this issue Apr 5, 2022
Signed-off-by: Damien Burks <[email protected]>

removing space from testcase

Signed-off-by: Damien Burks <[email protected]>

website: make local dev and PR preview not build everything (open-policy-agent#4474)

With this change, the work done for local development, and the per-PR netlify preview changes:

It will no longer include the website stuff for versions other than edge, the current working tree.

We thus save us the time, and the flakiness, involved with

- checking if github has release binaries for all the versions
- checking out their sources
- fetching the release binaries to pre-hydrate old versions' live-blocks.

The previously-used, documented make target should still be intact.

Fixes open-policy-agent#4379 to some extent, I hope.

* docs/website: remove "latest" binary from opa versions cache

Having a stale binary here -- one called "latest" but not actually
being "latest" -- causes issues like this: when building the website
content for the (real) latest version, the script would take the
old (previous-latest) binary, and fail because that binary didn't
know the latest future keywords.

Signed-off-by: Stephan Renatus <[email protected]>

website: fix download link redirect (open-policy-agent#4482)

Fixes the bug introduced in open-policy-agent#4474 causing the setup-opa action to fail
to retrieve the "latest" opa binary.

Signed-off-by: Stephan Renatus <[email protected]>

build(deps): bump github.com/yashtewari/glob-intersection to v0.1.0 (open-policy-agent#4481)

https://github.com/yashtewari/glob-intersection/releases/tag/v0.1.0

Signed-off-by: Stephan Renatus <[email protected]>

docs/website: fix /docs redirect (open-policy-agent#4483)

Signed-off-by: Stephan Renatus <[email protected]>

website: fix template to produce proper redirects (open-policy-agent#4484)

Also redirect /docs -> /docs/edge in preview

Signed-off-by: Stephan Renatus <[email protected]>

docs/envoy: include new configurable gRPC msg sizes (open-policy-agent#4459)

Docs for open-policy-agent/opa-envoy-plugin#323

Signed-off-by: Elliot Maincourt <[email protected]>

website/live-blocks: update caniuse/browserlist

Signed-off-by: Stephan Renatus <[email protected]>

website/live-blocks: don't call the github api to determine release asset urls

Signed-off-by: Stephan Renatus <[email protected]>

website: don't run generate twice (open-policy-agent#4488)

Signed-off-by: Stephan Renatus <[email protected]>

ast: Fix print call rewriting in else rules

The compiler was accidentally checking/rewriting print calls in the
body of else rules before the implicit args of the else rule were
processed. As a result, the compiler was generating false-positive
errors for refs to undeclared args. The root of the problem was the
usage of WalkBodies on the top-level rule (which implicitly walks the
bodies of else rules under the top-level rule). With this change, the
compiler will call WalkBodies on the head and body of each rule rather
than the entire rule itself (which includes the else chain).

Fixes open-policy-agent#4489

Signed-off-by: Torin Sandall <[email protected]>

Update maintainer status, add Will Beason (open-policy-agent#4465)

Signed-off-by: Max Smythe <[email protected]>

Co-authored-by: Will Beason <[email protected]>

Make maintainers list alphabetical (open-policy-agent#4491)

Signed-off-by: Max Smythe <[email protected]>

plugins/logs: Fix broken retry logic

We were incorrectly resetting the retry counter after
every error condition instead of using the incremented
value. As a result, retry delay would always be 0s.
This meant that if OPA encountered an error while
uploading decision logs it would immediately retry
instead of doing an exponential backoff.

Fixes: open-policy-agent#4486

Signed-off-by: Ashutosh Narkar <[email protected]>

logging: mask authorization header value in debug logs (open-policy-agent#4496)

Fixes open-policy-agent#4495

Signed-off-by: Anders Eknert <[email protected]>

Fixes open-policy-agent#4376 (open-policy-agent#4494)

Do note though that this does not change how multiple
with statements are grouped. Although I agree with that,
it's IMHO a separate feature request, while the spacing
issue is a bug.

Signed-off-by: Anders Eknert <[email protected]>

runtime: improve log output for binary response (open-policy-agent#4498)

This change omits the response body when using compression on metrics endpoint and when pprof is enabled.

Signed-off-by: Christian Altamirano <[email protected]>

build: bump golang: 1.17 -> 1.18

No change to go.mod's `go` stanza, so no changes in code compatibility.

However, it's used for building our docker images and release
binaries, and for fuzz testing in our nightly workflow.

Some test-related changes with the dns lookup built-in function's
error handling; and the hardcoded signature. Running

    go test ./topdown -run TestTopdownJWTEncodeSignECWithSeedReturnsSameSignature -count 10000

makes me believe that for whatever reason the signature changed,
it's at least stable.

topdown/http_test: Test-only change to accomodate this change in Go (https://go.dev/doc/go1.18):

    Certificate.Verify now uses platform APIs to verify certificate
    validity on macOS and iOS when it is called with a nil
    VerifyOpts.Roots or when using the root pool returned from
    SystemCertPool.

We're keeping the old message for go <= 1.17; in a silly-simple way.

Also:

* ci: build and test two old golang version on macos|linux

  We'll drop golang 1.15, keep one unsupported version (1.16).

Signed-off-by: Stephan Renatus <[email protected]>

ci: remove go-fuzz, use native go 1.18 fuzzer

Signed-off-by: Stephan Renatus <[email protected]>

build(dev-deps): update live-scripts deps: mocha, nanoid (open-policy-agent#4500)

Signed-off-by: Stephan Renatus <[email protected]>

build(deps): bump nanoid in /docs/website/scripts/live-blocks (open-policy-agent#4501)

update bob_token and alice_token in markdown (open-policy-agent#4504)

Signed-off-by: yongen.pan <[email protected]>

topdown/eval: copy without modifying expr, update test/e2e/concurrency (open-policy-agent#4503)

What we previously did turned into a race condition with multiple
concurrent calls to /v1/compile.

With a change introduced with 0.38.0 (the `every` keyword), the
`nil` Terms of an `ast.Expr` node was surfaced: previously, it would
go unnoticed, but could potentially have yielded bad results.

The effect of this change is proven using a new e2e test that would
fail on the code we had previous.

Signed-off-by: Stephan Renatus <[email protected]>

uploading latest changes

Signed-off-by: Damien Burks <[email protected]>

fixing test cases

Signed-off-by: Damien Burks <[email protected]>

sdk: avoid using different state (open-policy-agent#4505)

I noticed that when operating on opa.state, locking is usually done to avoid
a race, whereas here opa.state is used directly. By comparing the previous
changes, I found that open-policy-agent#4240 changed the previous behaviour.

This change adjusts that: we ensure that we work on the state as read via
the mutex-protected `s := *opa.state`.

Signed-off-by: Iceber Gu <[email protected]>

build(deps): update opentelemetry-go: 1.5.0 -> 1.6.1 (metrics 0.28.0) (open-policy-agent#4467)

https://github.com/open-telemetry/opentelemetry-go/releases/tag/v1.6.0
https://github.com/open-telemetry/opentelemetry-go/releases/tag/v1.6.1
https://github.com/open-telemetry/opentelemetry-go-contrib/releases/tag/v1.6.0

Signed-off-by: Stephan Renatus <[email protected]>

build(deps): open opentelemetry-go 1.6.0 -> 1.6.1 (open-policy-agent#4512)

This is a left-over from the previous bump, 0856120.

https://github.com/open-telemetry/opentelemetry-go/releases/tag/v1.6.1

Signed-off-by: Stephan Renatus <[email protected]>

sdk/opa_test: increase max delta (0.5s) (open-policy-agent#4513)

When running on GHA, we've found this test to often fail on macos-latest:
It would not functionally be wrong, but it also wasn't able to finish in
the alloted time. Now, the maximum delta has been increased a lot (10ms to
500ms). It's much, but it's still good enough to ensure that the context
passed to Stop() is the one that matters for shutdown.

Signed-off-by: Stephan Renatus <[email protected]>

docs: Add clarifications for docs describing bundle signing and delta features

This commit adds some clarification to the bundle doc regarding
the format of bundle signatures and manifest behavior for delta
bundles.

Signed-off-by: Ashutosh Narkar <[email protected]>

runtime: change processed file event log level to info (open-policy-agent#4514)

"Processed file watch event." will now be logged at level "info" (was "warn").

Signed-off-by: Anders Eknert <[email protected]>

runtime/logging: only suppress payloads for handlers that compress responses (open-policy-agent#4502)

* runtime/logging: only suppress payloads for handlers that compress responses

To get compressed responses, two things need to be true:

1. The client must accept compressed responses
2. The handler must reply with a compressed response

For general API requests, (1) holds most of the time. (2) is only true
for the metrics endpoint at them moment, since the 3rd party library we
use for serving the prometheus endpoint will do compression.

* runtime/logging: remove dead code

The http.Hijack stuff was related to a watch feature removed in
open-policy-agent@186ef99

dropInputParam was only used by its tests.

Signed-off-by: Stephan Renatus <[email protected]>

storage: code cosmetics

Signed-off-by: Stephan Renatus <[email protected]>

storage: allow implementations to override MakeDir

Signed-off-by: Stephan Renatus <[email protected]>

runtime+storage: integrate disk storage

With this change, the disk backend (badger) becomes available for
use with the OPA runtime properly:

It can be configured using the `storage.disk` key in OPA's config
(see included documentation).

When enabled,
- any data or policies stored with OPA will persist over restarts
- per-query metrics related to disk usage are reported
- Prometheus metrics per storage operation are exported

The main intention behind this feature is to optimize memory usage:
OPA can now operate on more data than fits into the allotted memory
resources. It is NOT meant to be used as a primary source of truth:
there are no backup/restore or desaster recovery procedures -- you
MUST secure the means to restore the data stored with OPA's disk
storage by yourself.

See also open-policy-agent#4014. Future improvements around bundle loading are
planned.

Some notes on details:

storage/disk: impose same locking regime used with inmem

With this setup, we'll ensure:

- there is only one open write txn at a time
- there are any number of open read txns at a time
- writes are blocked when reads are inflight
- during a commit (and triggers being run), no read txns can be created

This is to ensure the same atomic policy update semantics when using
'disk" as we have with "inmem". We're basically opting out of badger's
currency control and transactionality guarantees. This is because we
cannot piggy back on that to ensure the atomic update we want.

There might be other ways -- using subscribers, and blocking in some
other place -- but this one seems preferrable since it mirrors inmem.

Part of the problem is ErrTxnTooLarge, and committing and renewing
txns when it occurs: that, which is the prescribed solution to txns
growing too big, also means that reads can see half of the "logical"
transaction having been committed, while the rest is still getting
processed.

Another approach would have been using `WriteBatch`, but that won't
let us read from the batch, only apply Set and Delete operations.
We currently need to read (via an iterator) to figure out if we
need to delete keys to replace something in the store.  There is
no DropPrefix operation on the badger txn, or the WriteBatch API.

storage/disk: remove commit-and-renew-txn code for txn-too-big errors

This would break transactional guarantees we care about: while there
can be only one write transaction at a time, read transactions may
happen while a write txn is underway -- with this commit-and-reset
logic, those would read partial data.

Now, the error will be returned to the caller. The maximum txn size
depends on the size of memtables, and could be tweaked manually.
In general, the caller should try to push multiple smaller increments
of the data.

storage/disk: implement noop MakeDir

The MakeDir operation as implemented in the backend-agnostic storage
code has become an issue with the disk store: to write /foo/bar/baz,
we'd have to read /foo (among other subdirs), and that can be _much_
work for the disk backend. With inmem, it's cheap, so this wasn't
problematic before.

Some of the storage/disk/txn.go logic had to be adjusted to properly
do the MakeDir steps implicitly.

The index argument addition to patch() in storage/disk/txn.go was
necessary to keep the error messages conforming to the previous
code path: previously, conflicts (arrays indexed as objects) would
be surfaced in the MakeDir step, now it's entangled with the patch
calculation.

storage/disk: check ctx.Err() in List/Get operations

This won't abort reading a single key, but it will abort iterations.

storage/disk: support patterns in partitions

There is a potential clash here: "*", the path wildcard, is
a valid path section. However, it only affects the case when
a user would want to have a partition at

    /foo/*/bar

and would really mean "*", and not the wildcard.

Storing data at /foo/*/bar with a literal "*" won't be treated
differently than storing something at /fo/xyz/bar.

storage/disk: keep per-txn-type histograms of stats

This is done by reading off the metrics on commit, and shovelling
their numbers into the prometheus collector.

NOTE: if you were to share a metrics object among multiple transactions,
the results would be skewed, as it's not reset. However, our server
handlers don't do that.

storage/disk: opt out of badger's conflict detection

With only one write transaction in flight at any time, the situation
that badger guards against cannot happen:

A transaction has written to a key after the current, to-be-committed
transaction has last read that key from the store.

Since it can't happen, we can ignore the bookkeeping involved. This
improves the time it takes to overwrite existing keys.

Signed-off-by: Stephan Renatus <[email protected]>

docs/management-bundles: remove quotations with info boxes (open-policy-agent#4517)

Small follow-up to open-policy-agent#4515, using info boxes instead of generic markup.

Signed-off-by: Stephan Renatus <[email protected]>

website: fix community site (open-policy-agent#4522)

* remove hugo templating

This page doesn't go through hugo at all, so these things would
end up on the website.

A quick fix is reverting the bits that added the templating. The
downside is that local dev and PR previews will link to
/docs/latest/ecosystem, which doesn't exist there.

* link Documentation and Download to /docs

For /docs/ (not /docs/latest) we have a redirect in local dev and
PR preview mode:

    /docs -> /docs/edge

so this will be a little less broken: only ecosystem is a 404 link
on the preview.

Signed-off-by: Stephan Renatus <[email protected]>

integrations: Add Emissary-Ingress (open-policy-agent#4523)

This PR adds the Emissary ingress to the Ecosystem page.
The blog section contains link to an article which guides on Emissary ingress and OPA integrations.

Signed-off-by: Tayyab J <[email protected]>

storage/disk: wildcard partition validation, docs caveat (open-policy-agent#4519)

A bunch of smaller follow-up tasks to open-policy-agent#4381.

* storage/disk_test: check invalid patches with wildcard partition, too
* docs/disk: add caveat re: bundles loaded into memory
* storage/disk: auto-manage /system partitions

If these are found in the user-provided partitions, we'll error out.

* storage/disk: pretty-print partitions with "*" instead of %2A
* storage/disk: respect wildcard-replacement in partition validation

It is now allowed to replace a partition like

    /foo/bar

by

    /foo/*

also if multiple wildcards are used.

Caveats:

You cannot add a wildcard partition like /*/*, since it would overlap
the managed "/system/*" partition.

When attempting to go back from /foo/* to /foo/bar, an error is
raised _unconditionally_ -- we could check the existing data, but
currently don't.

* storage/disk: check prefix when adding wildcard partitions

The previously done check would have falsely returned that there is no problem
when adding a wildcard partition: lookup of "/foo/*" with '*' not interpreted
as a wildcard, but as a string, would yield a not-found, even if there was any
data under /foo/.

Now, we'll check the prefix-until-wildcard. It's more cautious than
theoretically necessary, but safe.

Signed-off-by: Stephan Renatus <[email protected]>

website/style: fix code block bottom margin (open-policy-agent#4526)

The text following regular code blocks was too close. Now, the bottom
margin is the same was used with live-blocks, 1.5rem (24px).

Signed-off-by: Stephan Renatus <[email protected]>

docs: Add CloudFormation Hook tutorial (open-policy-agent#4525)

Obviously missing is deployment of OPA to AWS. This will come in the next iteration.

Signed-off-by: Anders Eknert <[email protected]>

Prepare v0.39.0 release (open-policy-agent#4524)

Signed-off-by: Stephan Renatus <[email protected]>

Prepare v0.40.0 development (open-policy-agent#4527)

Signed-off-by: Stephan Renatus <[email protected]>

website: fix styra support link

Signed-off-by: Torin Sandall <[email protected]>

adopters: Add Wealthsimple (open-policy-agent#4530)

Signed-off-by: hannahyeates <[email protected]>

ci: fix rego check (open-policy-agent#4532)

* build/policies: format using 0.39.0
* workflow/pull-request: use edge opa for rego PR checks

Signed-off-by: Stephan Renatus <[email protected]>

website: remove right margin on sidebar (open-policy-agent#4529)

* remove right margin on sidebar
* replace the removed margin with padding

Signed-off-by: orweis <[email protected]>

website: Remove unused variables to avoid error in strict mode(open-policy-agent#4534)

To Fix:
```
2 errors occurred:
policy.rego:13: rego_compile_error: assigned var header unused
policy.rego:13: rego_compile_error: assigned var signature unused
```

Signed-off-by: panpan0000 <[email protected]>

website: show yellow banner for old version (open-policy-agent#4533)

Hopefully this is gets more attention than the yellow "older" button
in the version picker.

Signed-off-by: Stephan Renatus <[email protected]>

fixing disk diagnostic test case

removing additional space

Signed-off-by: Damien Burks <[email protected]>

removing test files

Signed-off-by: Damien Burks <[email protected]>

fixing linting issue

Signed-off-by: Damien Burks <[email protected]>

adding test case for logging

Signed-off-by: Damien Burks <[email protected]>

ecosystem: adding permit.io (open-policy-agent#4531)

Signed-off-by: Oz Radiano <[email protected]>

docs: CloudFormation fixes (open-policy-agent#4541)

* Fix step headers to use h3
* Add warning about boolean values conversion

Signed-off-by: Anders Eknert <[email protected]>
rokkiter pushed a commit to rokkiter/opa that referenced this issue Apr 18, 2022
With this change, the disk backend (badger) becomes available for
use with the OPA runtime properly:

It can be configured using the `storage.disk` key in OPA's config
(see included documentation).

When enabled,
- any data or policies stored with OPA will persist over restarts
- per-query metrics related to disk usage are reported
- Prometheus metrics per storage operation are exported

The main intention behind this feature is to optimize memory usage:
OPA can now operate on more data than fits into the allotted memory
resources. It is NOT meant to be used as a primary source of truth:
there are no backup/restore or desaster recovery procedures -- you
MUST secure the means to restore the data stored with OPA's disk
storage by yourself.

See also open-policy-agent#4014. Future improvements around bundle loading are
planned.

Some notes on details:

storage/disk: impose same locking regime used with inmem

With this setup, we'll ensure:

- there is only one open write txn at a time
- there are any number of open read txns at a time
- writes are blocked when reads are inflight
- during a commit (and triggers being run), no read txns can be created

This is to ensure the same atomic policy update semantics when using
'disk" as we have with "inmem". We're basically opting out of badger's
currency control and transactionality guarantees. This is because we
cannot piggy back on that to ensure the atomic update we want.

There might be other ways -- using subscribers, and blocking in some
other place -- but this one seems preferrable since it mirrors inmem.

Part of the problem is ErrTxnTooLarge, and committing and renewing
txns when it occurs: that, which is the prescribed solution to txns
growing too big, also means that reads can see half of the "logical"
transaction having been committed, while the rest is still getting
processed.

Another approach would have been using `WriteBatch`, but that won't
let us read from the batch, only apply Set and Delete operations.
We currently need to read (via an iterator) to figure out if we
need to delete keys to replace something in the store.  There is
no DropPrefix operation on the badger txn, or the WriteBatch API.

storage/disk: remove commit-and-renew-txn code for txn-too-big errors

This would break transactional guarantees we care about: while there
can be only one write transaction at a time, read transactions may
happen while a write txn is underway -- with this commit-and-reset
logic, those would read partial data.

Now, the error will be returned to the caller. The maximum txn size
depends on the size of memtables, and could be tweaked manually.
In general, the caller should try to push multiple smaller increments
of the data.

storage/disk: implement noop MakeDir

The MakeDir operation as implemented in the backend-agnostic storage
code has become an issue with the disk store: to write /foo/bar/baz,
we'd have to read /foo (among other subdirs), and that can be _much_
work for the disk backend. With inmem, it's cheap, so this wasn't
problematic before.

Some of the storage/disk/txn.go logic had to be adjusted to properly
do the MakeDir steps implicitly.

The index argument addition to patch() in storage/disk/txn.go was
necessary to keep the error messages conforming to the previous
code path: previously, conflicts (arrays indexed as objects) would
be surfaced in the MakeDir step, now it's entangled with the patch
calculation.

storage/disk: check ctx.Err() in List/Get operations

This won't abort reading a single key, but it will abort iterations.

storage/disk: support patterns in partitions

There is a potential clash here: "*", the path wildcard, is
a valid path section. However, it only affects the case when
a user would want to have a partition at

    /foo/*/bar

and would really mean "*", and not the wildcard.

Storing data at /foo/*/bar with a literal "*" won't be treated
differently than storing something at /fo/xyz/bar.

storage/disk: keep per-txn-type histograms of stats

This is done by reading off the metrics on commit, and shovelling
their numbers into the prometheus collector.

NOTE: if you were to share a metrics object among multiple transactions,
the results would be skewed, as it's not reset. However, our server
handlers don't do that.

storage/disk: opt out of badger's conflict detection

With only one write transaction in flight at any time, the situation
that badger guards against cannot happen:

A transaction has written to a key after the current, to-be-committed
transaction has last read that key from the store.

Since it can't happen, we can ignore the bookkeeping involved. This
improves the time it takes to overwrite existing keys.

Signed-off-by: Stephan Renatus <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

No branches or pull requests

3 participants