Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TLS CA too many open files - decisionlog upload #4486

Closed
iamatwork opened this issue Mar 25, 2022 · 1 comment · Fixed by #4493
Closed

TLS CA too many open files - decisionlog upload #4486

iamatwork opened this issue Mar 25, 2022 · 1 comment · Fixed by #4493
Labels

Comments

@iamatwork
Copy link

OPA v0.37.2
config:

services:
decisionlogger:
url: https://XXX
tls:
ca_cert: /opt/XXX/certs/myCA.crt
decision_logs:
reporting:
min_delay_seconds: 15
service: decisionlogger

When uploading decisionlogs fails (with authorization error), after a few tries, OPA goes bonkers and tries to upload the logs without any delay (configured minimum is 15s). This fails with " /opt/XXX/certs/myCA.crt: too many open files"

{"level":"error","msg":"log upload failed: Post "https://XXX:9200/opa_log/_doc\": dial tcp: lookup XXX on XXX: no such host.","plugin":"decision_logs","time":"2022-03-25T13:24:54Z"}
{"level":"debug","msg":"Waiting 0s before next upload/retry.","plugin":"decision_logs","time":"2022-03-25T13:24:54Z"}
{"headers":{"Content-Encoding":["gzip"],"Content-Type":["application/json"],"User-Agent":["Open Policy Agent/0.37.2 (linux, amd64)"]},"level":"debug","method":"POST","msg":"Sending request.","time":"2022-03-25T13:24:54Z","url":"https://XXXX"}
{"headers":{"Content-Length":["544"],"Content-Type":["application/json; charset=UTF-8"],"Www-Authenticate":["Basic realm="security" charset="UTF-8"","Bearer realm="security"","ApiKey"]},"level":"debug","method":"POST","msg":"Received response.","status":"401 Unauthorized","time":"2022-03-25T13:24:54Z","url":"https://XXX:9200"}
{"level":"error","msg":"log upload failed, server replied with not authorized.","plugin":"decision_logs","time":"2022-03-25T13:24:54Z"}
{"level":"debug","msg":"Waiting 0s before next upload/retry.","plugin":"decision_logs","time":"2022-03-25T13:24:54Z"}
{"level":"error","msg":"log upload failed: open /opt/XXX/certs/myCA.crt: too many open files.","plugin":"decision_logs","time":"2022-03-25T13:24:54Z"}
{"level":"debug","msg":"Waiting 0s before next upload/retry.","plugin":"decision_logs","time":"2022-03-25T13:24:54Z"}
{"level":"error","msg":"log upload failed: open /opt/XXX/certs/myCA.crt: too many open files.","plugin":"decision_logs","time":"2022-03-25T13:24:54Z"}

@iamatwork iamatwork added the bug label Mar 25, 2022
@ashutosh-narkar ashutosh-narkar added the investigating Issues being actively investigated label Mar 25, 2022
@ashutosh-narkar
Copy link
Member

Looking at your config I would have first expected an error like

error: config error: reporting configuration missing 'max_delay_seconds' in decision_logs

Looks like we are not properly setting the retry value in the delay calculation which causes this. We'll add a fix soon. Thanks for reporting.

@ashutosh-narkar ashutosh-narkar removed the investigating Issues being actively investigated label Mar 25, 2022
ashutosh-narkar added a commit to ashutosh-narkar/opa that referenced this issue Mar 25, 2022
We were incorrectly resetting the retry counter after
every error condition instead of using the incremented
value. As a result, retry delay would always be 0s.
This meant that if OPA encountered an error while
uploading decision logs it would immediately retry
instead of doing an exponential backoff.

Fixes: open-policy-agent#4486

Signed-off-by: Ashutosh Narkar <[email protected]>
ashutosh-narkar added a commit that referenced this issue Mar 25, 2022
We were incorrectly resetting the retry counter after
every error condition instead of using the incremented
value. As a result, retry delay would always be 0s.
This meant that if OPA encountered an error while
uploading decision logs it would immediately retry
instead of doing an exponential backoff.

Fixes: #4486

Signed-off-by: Ashutosh Narkar <[email protected]>
christian1607 pushed a commit to christian1607/opa that referenced this issue Mar 26, 2022
We were incorrectly resetting the retry counter after
every error condition instead of using the incremented
value. As a result, retry delay would always be 0s.
This meant that if OPA encountered an error while
uploading decision logs it would immediately retry
instead of doing an exponential backoff.

Fixes: open-policy-agent#4486

Signed-off-by: Ashutosh Narkar <[email protected]>
damienjburks added a commit to damienjburks/opa that referenced this issue Apr 1, 2022
Signed-off-by: Damien Burks <[email protected]>

removing space from testcase

Signed-off-by: Damien Burks <[email protected]>

website: make local dev and PR preview not build everything (open-policy-agent#4474)

With this change, the work done for local development, and the per-PR netlify preview changes:

It will no longer include the website stuff for versions other than edge, the current working tree.

We thus save us the time, and the flakiness, involved with

- checking if github has release binaries for all the versions
- checking out their sources
- fetching the release binaries to pre-hydrate old versions' live-blocks.

The previously-used, documented make target should still be intact.

Fixes open-policy-agent#4379 to some extent, I hope.

* docs/website: remove "latest" binary from opa versions cache

Having a stale binary here -- one called "latest" but not actually
being "latest" -- causes issues like this: when building the website
content for the (real) latest version, the script would take the
old (previous-latest) binary, and fail because that binary didn't
know the latest future keywords.

Signed-off-by: Stephan Renatus <[email protected]>

website: fix download link redirect (open-policy-agent#4482)

Fixes the bug introduced in open-policy-agent#4474 causing the setup-opa action to fail
to retrieve the "latest" opa binary.

Signed-off-by: Stephan Renatus <[email protected]>

build(deps): bump github.com/yashtewari/glob-intersection to v0.1.0 (open-policy-agent#4481)

https://github.com/yashtewari/glob-intersection/releases/tag/v0.1.0

Signed-off-by: Stephan Renatus <[email protected]>

docs/website: fix /docs redirect (open-policy-agent#4483)

Signed-off-by: Stephan Renatus <[email protected]>

website: fix template to produce proper redirects (open-policy-agent#4484)

Also redirect /docs -> /docs/edge in preview

Signed-off-by: Stephan Renatus <[email protected]>

docs/envoy: include new configurable gRPC msg sizes (open-policy-agent#4459)

Docs for open-policy-agent/opa-envoy-plugin#323

Signed-off-by: Elliot Maincourt <[email protected]>

website/live-blocks: update caniuse/browserlist

Signed-off-by: Stephan Renatus <[email protected]>

website/live-blocks: don't call the github api to determine release asset urls

Signed-off-by: Stephan Renatus <[email protected]>

website: don't run generate twice (open-policy-agent#4488)

Signed-off-by: Stephan Renatus <[email protected]>

ast: Fix print call rewriting in else rules

The compiler was accidentally checking/rewriting print calls in the
body of else rules before the implicit args of the else rule were
processed. As a result, the compiler was generating false-positive
errors for refs to undeclared args. The root of the problem was the
usage of WalkBodies on the top-level rule (which implicitly walks the
bodies of else rules under the top-level rule). With this change, the
compiler will call WalkBodies on the head and body of each rule rather
than the entire rule itself (which includes the else chain).

Fixes open-policy-agent#4489

Signed-off-by: Torin Sandall <[email protected]>

Update maintainer status, add Will Beason (open-policy-agent#4465)

Signed-off-by: Max Smythe <[email protected]>

Co-authored-by: Will Beason <[email protected]>

Make maintainers list alphabetical (open-policy-agent#4491)

Signed-off-by: Max Smythe <[email protected]>

plugins/logs: Fix broken retry logic

We were incorrectly resetting the retry counter after
every error condition instead of using the incremented
value. As a result, retry delay would always be 0s.
This meant that if OPA encountered an error while
uploading decision logs it would immediately retry
instead of doing an exponential backoff.

Fixes: open-policy-agent#4486

Signed-off-by: Ashutosh Narkar <[email protected]>

logging: mask authorization header value in debug logs (open-policy-agent#4496)

Fixes open-policy-agent#4495

Signed-off-by: Anders Eknert <[email protected]>

Fixes open-policy-agent#4376 (open-policy-agent#4494)

Do note though that this does not change how multiple
with statements are grouped. Although I agree with that,
it's IMHO a separate feature request, while the spacing
issue is a bug.

Signed-off-by: Anders Eknert <[email protected]>

runtime: improve log output for binary response (open-policy-agent#4498)

This change omits the response body when using compression on metrics endpoint and when pprof is enabled.

Signed-off-by: Christian Altamirano <[email protected]>

build: bump golang: 1.17 -> 1.18

No change to go.mod's `go` stanza, so no changes in code compatibility.

However, it's used for building our docker images and release
binaries, and for fuzz testing in our nightly workflow.

Some test-related changes with the dns lookup built-in function's
error handling; and the hardcoded signature. Running

    go test ./topdown -run TestTopdownJWTEncodeSignECWithSeedReturnsSameSignature -count 10000

makes me believe that for whatever reason the signature changed,
it's at least stable.

topdown/http_test: Test-only change to accomodate this change in Go (https://go.dev/doc/go1.18):

    Certificate.Verify now uses platform APIs to verify certificate
    validity on macOS and iOS when it is called with a nil
    VerifyOpts.Roots or when using the root pool returned from
    SystemCertPool.

We're keeping the old message for go <= 1.17; in a silly-simple way.

Also:

* ci: build and test two old golang version on macos|linux

  We'll drop golang 1.15, keep one unsupported version (1.16).

Signed-off-by: Stephan Renatus <[email protected]>

ci: remove go-fuzz, use native go 1.18 fuzzer

Signed-off-by: Stephan Renatus <[email protected]>

build(dev-deps): update live-scripts deps: mocha, nanoid (open-policy-agent#4500)

Signed-off-by: Stephan Renatus <[email protected]>

build(deps): bump nanoid in /docs/website/scripts/live-blocks (open-policy-agent#4501)

update bob_token and alice_token in markdown (open-policy-agent#4504)

Signed-off-by: yongen.pan <[email protected]>

topdown/eval: copy without modifying expr, update test/e2e/concurrency (open-policy-agent#4503)

What we previously did turned into a race condition with multiple
concurrent calls to /v1/compile.

With a change introduced with 0.38.0 (the `every` keyword), the
`nil` Terms of an `ast.Expr` node was surfaced: previously, it would
go unnoticed, but could potentially have yielded bad results.

The effect of this change is proven using a new e2e test that would
fail on the code we had previous.

Signed-off-by: Stephan Renatus <[email protected]>

uploading latest changes

Signed-off-by: Damien Burks <[email protected]>

fixing test cases

Signed-off-by: Damien Burks <[email protected]>

sdk: avoid using different state (open-policy-agent#4505)

I noticed that when operating on opa.state, locking is usually done to avoid
a race, whereas here opa.state is used directly. By comparing the previous
changes, I found that open-policy-agent#4240 changed the previous behaviour.

This change adjusts that: we ensure that we work on the state as read via
the mutex-protected `s := *opa.state`.

Signed-off-by: Iceber Gu <[email protected]>

build(deps): update opentelemetry-go: 1.5.0 -> 1.6.1 (metrics 0.28.0) (open-policy-agent#4467)

https://github.com/open-telemetry/opentelemetry-go/releases/tag/v1.6.0
https://github.com/open-telemetry/opentelemetry-go/releases/tag/v1.6.1
https://github.com/open-telemetry/opentelemetry-go-contrib/releases/tag/v1.6.0

Signed-off-by: Stephan Renatus <[email protected]>

build(deps): open opentelemetry-go 1.6.0 -> 1.6.1 (open-policy-agent#4512)

This is a left-over from the previous bump, 0856120.

https://github.com/open-telemetry/opentelemetry-go/releases/tag/v1.6.1

Signed-off-by: Stephan Renatus <[email protected]>

sdk/opa_test: increase max delta (0.5s) (open-policy-agent#4513)

When running on GHA, we've found this test to often fail on macos-latest:
It would not functionally be wrong, but it also wasn't able to finish in
the alloted time. Now, the maximum delta has been increased a lot (10ms to
500ms). It's much, but it's still good enough to ensure that the context
passed to Stop() is the one that matters for shutdown.

Signed-off-by: Stephan Renatus <[email protected]>

docs: Add clarifications for docs describing bundle signing and delta features

This commit adds some clarification to the bundle doc regarding
the format of bundle signatures and manifest behavior for delta
bundles.

Signed-off-by: Ashutosh Narkar <[email protected]>

runtime: change processed file event log level to info (open-policy-agent#4514)

"Processed file watch event." will now be logged at level "info" (was "warn").

Signed-off-by: Anders Eknert <[email protected]>

runtime/logging: only suppress payloads for handlers that compress responses (open-policy-agent#4502)

* runtime/logging: only suppress payloads for handlers that compress responses

To get compressed responses, two things need to be true:

1. The client must accept compressed responses
2. The handler must reply with a compressed response

For general API requests, (1) holds most of the time. (2) is only true
for the metrics endpoint at them moment, since the 3rd party library we
use for serving the prometheus endpoint will do compression.

* runtime/logging: remove dead code

The http.Hijack stuff was related to a watch feature removed in
open-policy-agent@186ef99

dropInputParam was only used by its tests.

Signed-off-by: Stephan Renatus <[email protected]>

storage: code cosmetics

Signed-off-by: Stephan Renatus <[email protected]>

storage: allow implementations to override MakeDir

Signed-off-by: Stephan Renatus <[email protected]>

runtime+storage: integrate disk storage

With this change, the disk backend (badger) becomes available for
use with the OPA runtime properly:

It can be configured using the `storage.disk` key in OPA's config
(see included documentation).

When enabled,
- any data or policies stored with OPA will persist over restarts
- per-query metrics related to disk usage are reported
- Prometheus metrics per storage operation are exported

The main intention behind this feature is to optimize memory usage:
OPA can now operate on more data than fits into the allotted memory
resources. It is NOT meant to be used as a primary source of truth:
there are no backup/restore or desaster recovery procedures -- you
MUST secure the means to restore the data stored with OPA's disk
storage by yourself.

See also open-policy-agent#4014. Future improvements around bundle loading are
planned.

Some notes on details:

storage/disk: impose same locking regime used with inmem

With this setup, we'll ensure:

- there is only one open write txn at a time
- there are any number of open read txns at a time
- writes are blocked when reads are inflight
- during a commit (and triggers being run), no read txns can be created

This is to ensure the same atomic policy update semantics when using
'disk" as we have with "inmem". We're basically opting out of badger's
currency control and transactionality guarantees. This is because we
cannot piggy back on that to ensure the atomic update we want.

There might be other ways -- using subscribers, and blocking in some
other place -- but this one seems preferrable since it mirrors inmem.

Part of the problem is ErrTxnTooLarge, and committing and renewing
txns when it occurs: that, which is the prescribed solution to txns
growing too big, also means that reads can see half of the "logical"
transaction having been committed, while the rest is still getting
processed.

Another approach would have been using `WriteBatch`, but that won't
let us read from the batch, only apply Set and Delete operations.
We currently need to read (via an iterator) to figure out if we
need to delete keys to replace something in the store.  There is
no DropPrefix operation on the badger txn, or the WriteBatch API.

storage/disk: remove commit-and-renew-txn code for txn-too-big errors

This would break transactional guarantees we care about: while there
can be only one write transaction at a time, read transactions may
happen while a write txn is underway -- with this commit-and-reset
logic, those would read partial data.

Now, the error will be returned to the caller. The maximum txn size
depends on the size of memtables, and could be tweaked manually.
In general, the caller should try to push multiple smaller increments
of the data.

storage/disk: implement noop MakeDir

The MakeDir operation as implemented in the backend-agnostic storage
code has become an issue with the disk store: to write /foo/bar/baz,
we'd have to read /foo (among other subdirs), and that can be _much_
work for the disk backend. With inmem, it's cheap, so this wasn't
problematic before.

Some of the storage/disk/txn.go logic had to be adjusted to properly
do the MakeDir steps implicitly.

The index argument addition to patch() in storage/disk/txn.go was
necessary to keep the error messages conforming to the previous
code path: previously, conflicts (arrays indexed as objects) would
be surfaced in the MakeDir step, now it's entangled with the patch
calculation.

storage/disk: check ctx.Err() in List/Get operations

This won't abort reading a single key, but it will abort iterations.

storage/disk: support patterns in partitions

There is a potential clash here: "*", the path wildcard, is
a valid path section. However, it only affects the case when
a user would want to have a partition at

    /foo/*/bar

and would really mean "*", and not the wildcard.

Storing data at /foo/*/bar with a literal "*" won't be treated
differently than storing something at /fo/xyz/bar.

storage/disk: keep per-txn-type histograms of stats

This is done by reading off the metrics on commit, and shovelling
their numbers into the prometheus collector.

NOTE: if you were to share a metrics object among multiple transactions,
the results would be skewed, as it's not reset. However, our server
handlers don't do that.

storage/disk: opt out of badger's conflict detection

With only one write transaction in flight at any time, the situation
that badger guards against cannot happen:

A transaction has written to a key after the current, to-be-committed
transaction has last read that key from the store.

Since it can't happen, we can ignore the bookkeeping involved. This
improves the time it takes to overwrite existing keys.

Signed-off-by: Stephan Renatus <[email protected]>

docs/management-bundles: remove quotations with info boxes (open-policy-agent#4517)

Small follow-up to open-policy-agent#4515, using info boxes instead of generic markup.

Signed-off-by: Stephan Renatus <[email protected]>

website: fix community site (open-policy-agent#4522)

* remove hugo templating

This page doesn't go through hugo at all, so these things would
end up on the website.

A quick fix is reverting the bits that added the templating. The
downside is that local dev and PR previews will link to
/docs/latest/ecosystem, which doesn't exist there.

* link Documentation and Download to /docs

For /docs/ (not /docs/latest) we have a redirect in local dev and
PR preview mode:

    /docs -> /docs/edge

so this will be a little less broken: only ecosystem is a 404 link
on the preview.

Signed-off-by: Stephan Renatus <[email protected]>

integrations: Add Emissary-Ingress (open-policy-agent#4523)

This PR adds the Emissary ingress to the Ecosystem page.
The blog section contains link to an article which guides on Emissary ingress and OPA integrations.

Signed-off-by: Tayyab J <[email protected]>

storage/disk: wildcard partition validation, docs caveat (open-policy-agent#4519)

A bunch of smaller follow-up tasks to open-policy-agent#4381.

* storage/disk_test: check invalid patches with wildcard partition, too
* docs/disk: add caveat re: bundles loaded into memory
* storage/disk: auto-manage /system partitions

If these are found in the user-provided partitions, we'll error out.

* storage/disk: pretty-print partitions with "*" instead of %2A
* storage/disk: respect wildcard-replacement in partition validation

It is now allowed to replace a partition like

    /foo/bar

by

    /foo/*

also if multiple wildcards are used.

Caveats:

You cannot add a wildcard partition like /*/*, since it would overlap
the managed "/system/*" partition.

When attempting to go back from /foo/* to /foo/bar, an error is
raised _unconditionally_ -- we could check the existing data, but
currently don't.

* storage/disk: check prefix when adding wildcard partitions

The previously done check would have falsely returned that there is no problem
when adding a wildcard partition: lookup of "/foo/*" with '*' not interpreted
as a wildcard, but as a string, would yield a not-found, even if there was any
data under /foo/.

Now, we'll check the prefix-until-wildcard. It's more cautious than
theoretically necessary, but safe.

Signed-off-by: Stephan Renatus <[email protected]>

website/style: fix code block bottom margin (open-policy-agent#4526)

The text following regular code blocks was too close. Now, the bottom
margin is the same was used with live-blocks, 1.5rem (24px).

Signed-off-by: Stephan Renatus <[email protected]>

docs: Add CloudFormation Hook tutorial (open-policy-agent#4525)

Obviously missing is deployment of OPA to AWS. This will come in the next iteration.

Signed-off-by: Anders Eknert <[email protected]>

Prepare v0.39.0 release (open-policy-agent#4524)

Signed-off-by: Stephan Renatus <[email protected]>

Prepare v0.40.0 development (open-policy-agent#4527)

Signed-off-by: Stephan Renatus <[email protected]>

website: fix styra support link

Signed-off-by: Torin Sandall <[email protected]>

adopters: Add Wealthsimple (open-policy-agent#4530)

Signed-off-by: hannahyeates <[email protected]>

ci: fix rego check (open-policy-agent#4532)

* build/policies: format using 0.39.0
* workflow/pull-request: use edge opa for rego PR checks

Signed-off-by: Stephan Renatus <[email protected]>

website: remove right margin on sidebar (open-policy-agent#4529)

* remove right margin on sidebar
* replace the removed margin with padding

Signed-off-by: orweis <[email protected]>

website: Remove unused variables to avoid error in strict mode(open-policy-agent#4534)

To Fix:
```
2 errors occurred:
policy.rego:13: rego_compile_error: assigned var header unused
policy.rego:13: rego_compile_error: assigned var signature unused
```

Signed-off-by: panpan0000 <[email protected]>

website: show yellow banner for old version (open-policy-agent#4533)

Hopefully this is gets more attention than the yellow "older" button
in the version picker.

Signed-off-by: Stephan Renatus <[email protected]>

fixing disk diagnostic test case
anderseknert pushed a commit to damienjburks/opa that referenced this issue Apr 5, 2022
Signed-off-by: Damien Burks <[email protected]>

removing space from testcase

Signed-off-by: Damien Burks <[email protected]>

website: make local dev and PR preview not build everything (open-policy-agent#4474)

With this change, the work done for local development, and the per-PR netlify preview changes:

It will no longer include the website stuff for versions other than edge, the current working tree.

We thus save us the time, and the flakiness, involved with

- checking if github has release binaries for all the versions
- checking out their sources
- fetching the release binaries to pre-hydrate old versions' live-blocks.

The previously-used, documented make target should still be intact.

Fixes open-policy-agent#4379 to some extent, I hope.

* docs/website: remove "latest" binary from opa versions cache

Having a stale binary here -- one called "latest" but not actually
being "latest" -- causes issues like this: when building the website
content for the (real) latest version, the script would take the
old (previous-latest) binary, and fail because that binary didn't
know the latest future keywords.

Signed-off-by: Stephan Renatus <[email protected]>

website: fix download link redirect (open-policy-agent#4482)

Fixes the bug introduced in open-policy-agent#4474 causing the setup-opa action to fail
to retrieve the "latest" opa binary.

Signed-off-by: Stephan Renatus <[email protected]>

build(deps): bump github.com/yashtewari/glob-intersection to v0.1.0 (open-policy-agent#4481)

https://github.com/yashtewari/glob-intersection/releases/tag/v0.1.0

Signed-off-by: Stephan Renatus <[email protected]>

docs/website: fix /docs redirect (open-policy-agent#4483)

Signed-off-by: Stephan Renatus <[email protected]>

website: fix template to produce proper redirects (open-policy-agent#4484)

Also redirect /docs -> /docs/edge in preview

Signed-off-by: Stephan Renatus <[email protected]>

docs/envoy: include new configurable gRPC msg sizes (open-policy-agent#4459)

Docs for open-policy-agent/opa-envoy-plugin#323

Signed-off-by: Elliot Maincourt <[email protected]>

website/live-blocks: update caniuse/browserlist

Signed-off-by: Stephan Renatus <[email protected]>

website/live-blocks: don't call the github api to determine release asset urls

Signed-off-by: Stephan Renatus <[email protected]>

website: don't run generate twice (open-policy-agent#4488)

Signed-off-by: Stephan Renatus <[email protected]>

ast: Fix print call rewriting in else rules

The compiler was accidentally checking/rewriting print calls in the
body of else rules before the implicit args of the else rule were
processed. As a result, the compiler was generating false-positive
errors for refs to undeclared args. The root of the problem was the
usage of WalkBodies on the top-level rule (which implicitly walks the
bodies of else rules under the top-level rule). With this change, the
compiler will call WalkBodies on the head and body of each rule rather
than the entire rule itself (which includes the else chain).

Fixes open-policy-agent#4489

Signed-off-by: Torin Sandall <[email protected]>

Update maintainer status, add Will Beason (open-policy-agent#4465)

Signed-off-by: Max Smythe <[email protected]>

Co-authored-by: Will Beason <[email protected]>

Make maintainers list alphabetical (open-policy-agent#4491)

Signed-off-by: Max Smythe <[email protected]>

plugins/logs: Fix broken retry logic

We were incorrectly resetting the retry counter after
every error condition instead of using the incremented
value. As a result, retry delay would always be 0s.
This meant that if OPA encountered an error while
uploading decision logs it would immediately retry
instead of doing an exponential backoff.

Fixes: open-policy-agent#4486

Signed-off-by: Ashutosh Narkar <[email protected]>

logging: mask authorization header value in debug logs (open-policy-agent#4496)

Fixes open-policy-agent#4495

Signed-off-by: Anders Eknert <[email protected]>

Fixes open-policy-agent#4376 (open-policy-agent#4494)

Do note though that this does not change how multiple
with statements are grouped. Although I agree with that,
it's IMHO a separate feature request, while the spacing
issue is a bug.

Signed-off-by: Anders Eknert <[email protected]>

runtime: improve log output for binary response (open-policy-agent#4498)

This change omits the response body when using compression on metrics endpoint and when pprof is enabled.

Signed-off-by: Christian Altamirano <[email protected]>

build: bump golang: 1.17 -> 1.18

No change to go.mod's `go` stanza, so no changes in code compatibility.

However, it's used for building our docker images and release
binaries, and for fuzz testing in our nightly workflow.

Some test-related changes with the dns lookup built-in function's
error handling; and the hardcoded signature. Running

    go test ./topdown -run TestTopdownJWTEncodeSignECWithSeedReturnsSameSignature -count 10000

makes me believe that for whatever reason the signature changed,
it's at least stable.

topdown/http_test: Test-only change to accomodate this change in Go (https://go.dev/doc/go1.18):

    Certificate.Verify now uses platform APIs to verify certificate
    validity on macOS and iOS when it is called with a nil
    VerifyOpts.Roots or when using the root pool returned from
    SystemCertPool.

We're keeping the old message for go <= 1.17; in a silly-simple way.

Also:

* ci: build and test two old golang version on macos|linux

  We'll drop golang 1.15, keep one unsupported version (1.16).

Signed-off-by: Stephan Renatus <[email protected]>

ci: remove go-fuzz, use native go 1.18 fuzzer

Signed-off-by: Stephan Renatus <[email protected]>

build(dev-deps): update live-scripts deps: mocha, nanoid (open-policy-agent#4500)

Signed-off-by: Stephan Renatus <[email protected]>

build(deps): bump nanoid in /docs/website/scripts/live-blocks (open-policy-agent#4501)

update bob_token and alice_token in markdown (open-policy-agent#4504)

Signed-off-by: yongen.pan <[email protected]>

topdown/eval: copy without modifying expr, update test/e2e/concurrency (open-policy-agent#4503)

What we previously did turned into a race condition with multiple
concurrent calls to /v1/compile.

With a change introduced with 0.38.0 (the `every` keyword), the
`nil` Terms of an `ast.Expr` node was surfaced: previously, it would
go unnoticed, but could potentially have yielded bad results.

The effect of this change is proven using a new e2e test that would
fail on the code we had previous.

Signed-off-by: Stephan Renatus <[email protected]>

uploading latest changes

Signed-off-by: Damien Burks <[email protected]>

fixing test cases

Signed-off-by: Damien Burks <[email protected]>

sdk: avoid using different state (open-policy-agent#4505)

I noticed that when operating on opa.state, locking is usually done to avoid
a race, whereas here opa.state is used directly. By comparing the previous
changes, I found that open-policy-agent#4240 changed the previous behaviour.

This change adjusts that: we ensure that we work on the state as read via
the mutex-protected `s := *opa.state`.

Signed-off-by: Iceber Gu <[email protected]>

build(deps): update opentelemetry-go: 1.5.0 -> 1.6.1 (metrics 0.28.0) (open-policy-agent#4467)

https://github.com/open-telemetry/opentelemetry-go/releases/tag/v1.6.0
https://github.com/open-telemetry/opentelemetry-go/releases/tag/v1.6.1
https://github.com/open-telemetry/opentelemetry-go-contrib/releases/tag/v1.6.0

Signed-off-by: Stephan Renatus <[email protected]>

build(deps): open opentelemetry-go 1.6.0 -> 1.6.1 (open-policy-agent#4512)

This is a left-over from the previous bump, 0856120.

https://github.com/open-telemetry/opentelemetry-go/releases/tag/v1.6.1

Signed-off-by: Stephan Renatus <[email protected]>

sdk/opa_test: increase max delta (0.5s) (open-policy-agent#4513)

When running on GHA, we've found this test to often fail on macos-latest:
It would not functionally be wrong, but it also wasn't able to finish in
the alloted time. Now, the maximum delta has been increased a lot (10ms to
500ms). It's much, but it's still good enough to ensure that the context
passed to Stop() is the one that matters for shutdown.

Signed-off-by: Stephan Renatus <[email protected]>

docs: Add clarifications for docs describing bundle signing and delta features

This commit adds some clarification to the bundle doc regarding
the format of bundle signatures and manifest behavior for delta
bundles.

Signed-off-by: Ashutosh Narkar <[email protected]>

runtime: change processed file event log level to info (open-policy-agent#4514)

"Processed file watch event." will now be logged at level "info" (was "warn").

Signed-off-by: Anders Eknert <[email protected]>

runtime/logging: only suppress payloads for handlers that compress responses (open-policy-agent#4502)

* runtime/logging: only suppress payloads for handlers that compress responses

To get compressed responses, two things need to be true:

1. The client must accept compressed responses
2. The handler must reply with a compressed response

For general API requests, (1) holds most of the time. (2) is only true
for the metrics endpoint at them moment, since the 3rd party library we
use for serving the prometheus endpoint will do compression.

* runtime/logging: remove dead code

The http.Hijack stuff was related to a watch feature removed in
open-policy-agent@186ef99

dropInputParam was only used by its tests.

Signed-off-by: Stephan Renatus <[email protected]>

storage: code cosmetics

Signed-off-by: Stephan Renatus <[email protected]>

storage: allow implementations to override MakeDir

Signed-off-by: Stephan Renatus <[email protected]>

runtime+storage: integrate disk storage

With this change, the disk backend (badger) becomes available for
use with the OPA runtime properly:

It can be configured using the `storage.disk` key in OPA's config
(see included documentation).

When enabled,
- any data or policies stored with OPA will persist over restarts
- per-query metrics related to disk usage are reported
- Prometheus metrics per storage operation are exported

The main intention behind this feature is to optimize memory usage:
OPA can now operate on more data than fits into the allotted memory
resources. It is NOT meant to be used as a primary source of truth:
there are no backup/restore or desaster recovery procedures -- you
MUST secure the means to restore the data stored with OPA's disk
storage by yourself.

See also open-policy-agent#4014. Future improvements around bundle loading are
planned.

Some notes on details:

storage/disk: impose same locking regime used with inmem

With this setup, we'll ensure:

- there is only one open write txn at a time
- there are any number of open read txns at a time
- writes are blocked when reads are inflight
- during a commit (and triggers being run), no read txns can be created

This is to ensure the same atomic policy update semantics when using
'disk" as we have with "inmem". We're basically opting out of badger's
currency control and transactionality guarantees. This is because we
cannot piggy back on that to ensure the atomic update we want.

There might be other ways -- using subscribers, and blocking in some
other place -- but this one seems preferrable since it mirrors inmem.

Part of the problem is ErrTxnTooLarge, and committing and renewing
txns when it occurs: that, which is the prescribed solution to txns
growing too big, also means that reads can see half of the "logical"
transaction having been committed, while the rest is still getting
processed.

Another approach would have been using `WriteBatch`, but that won't
let us read from the batch, only apply Set and Delete operations.
We currently need to read (via an iterator) to figure out if we
need to delete keys to replace something in the store.  There is
no DropPrefix operation on the badger txn, or the WriteBatch API.

storage/disk: remove commit-and-renew-txn code for txn-too-big errors

This would break transactional guarantees we care about: while there
can be only one write transaction at a time, read transactions may
happen while a write txn is underway -- with this commit-and-reset
logic, those would read partial data.

Now, the error will be returned to the caller. The maximum txn size
depends on the size of memtables, and could be tweaked manually.
In general, the caller should try to push multiple smaller increments
of the data.

storage/disk: implement noop MakeDir

The MakeDir operation as implemented in the backend-agnostic storage
code has become an issue with the disk store: to write /foo/bar/baz,
we'd have to read /foo (among other subdirs), and that can be _much_
work for the disk backend. With inmem, it's cheap, so this wasn't
problematic before.

Some of the storage/disk/txn.go logic had to be adjusted to properly
do the MakeDir steps implicitly.

The index argument addition to patch() in storage/disk/txn.go was
necessary to keep the error messages conforming to the previous
code path: previously, conflicts (arrays indexed as objects) would
be surfaced in the MakeDir step, now it's entangled with the patch
calculation.

storage/disk: check ctx.Err() in List/Get operations

This won't abort reading a single key, but it will abort iterations.

storage/disk: support patterns in partitions

There is a potential clash here: "*", the path wildcard, is
a valid path section. However, it only affects the case when
a user would want to have a partition at

    /foo/*/bar

and would really mean "*", and not the wildcard.

Storing data at /foo/*/bar with a literal "*" won't be treated
differently than storing something at /fo/xyz/bar.

storage/disk: keep per-txn-type histograms of stats

This is done by reading off the metrics on commit, and shovelling
their numbers into the prometheus collector.

NOTE: if you were to share a metrics object among multiple transactions,
the results would be skewed, as it's not reset. However, our server
handlers don't do that.

storage/disk: opt out of badger's conflict detection

With only one write transaction in flight at any time, the situation
that badger guards against cannot happen:

A transaction has written to a key after the current, to-be-committed
transaction has last read that key from the store.

Since it can't happen, we can ignore the bookkeeping involved. This
improves the time it takes to overwrite existing keys.

Signed-off-by: Stephan Renatus <[email protected]>

docs/management-bundles: remove quotations with info boxes (open-policy-agent#4517)

Small follow-up to open-policy-agent#4515, using info boxes instead of generic markup.

Signed-off-by: Stephan Renatus <[email protected]>

website: fix community site (open-policy-agent#4522)

* remove hugo templating

This page doesn't go through hugo at all, so these things would
end up on the website.

A quick fix is reverting the bits that added the templating. The
downside is that local dev and PR previews will link to
/docs/latest/ecosystem, which doesn't exist there.

* link Documentation and Download to /docs

For /docs/ (not /docs/latest) we have a redirect in local dev and
PR preview mode:

    /docs -> /docs/edge

so this will be a little less broken: only ecosystem is a 404 link
on the preview.

Signed-off-by: Stephan Renatus <[email protected]>

integrations: Add Emissary-Ingress (open-policy-agent#4523)

This PR adds the Emissary ingress to the Ecosystem page.
The blog section contains link to an article which guides on Emissary ingress and OPA integrations.

Signed-off-by: Tayyab J <[email protected]>

storage/disk: wildcard partition validation, docs caveat (open-policy-agent#4519)

A bunch of smaller follow-up tasks to open-policy-agent#4381.

* storage/disk_test: check invalid patches with wildcard partition, too
* docs/disk: add caveat re: bundles loaded into memory
* storage/disk: auto-manage /system partitions

If these are found in the user-provided partitions, we'll error out.

* storage/disk: pretty-print partitions with "*" instead of %2A
* storage/disk: respect wildcard-replacement in partition validation

It is now allowed to replace a partition like

    /foo/bar

by

    /foo/*

also if multiple wildcards are used.

Caveats:

You cannot add a wildcard partition like /*/*, since it would overlap
the managed "/system/*" partition.

When attempting to go back from /foo/* to /foo/bar, an error is
raised _unconditionally_ -- we could check the existing data, but
currently don't.

* storage/disk: check prefix when adding wildcard partitions

The previously done check would have falsely returned that there is no problem
when adding a wildcard partition: lookup of "/foo/*" with '*' not interpreted
as a wildcard, but as a string, would yield a not-found, even if there was any
data under /foo/.

Now, we'll check the prefix-until-wildcard. It's more cautious than
theoretically necessary, but safe.

Signed-off-by: Stephan Renatus <[email protected]>

website/style: fix code block bottom margin (open-policy-agent#4526)

The text following regular code blocks was too close. Now, the bottom
margin is the same was used with live-blocks, 1.5rem (24px).

Signed-off-by: Stephan Renatus <[email protected]>

docs: Add CloudFormation Hook tutorial (open-policy-agent#4525)

Obviously missing is deployment of OPA to AWS. This will come in the next iteration.

Signed-off-by: Anders Eknert <[email protected]>

Prepare v0.39.0 release (open-policy-agent#4524)

Signed-off-by: Stephan Renatus <[email protected]>

Prepare v0.40.0 development (open-policy-agent#4527)

Signed-off-by: Stephan Renatus <[email protected]>

website: fix styra support link

Signed-off-by: Torin Sandall <[email protected]>

adopters: Add Wealthsimple (open-policy-agent#4530)

Signed-off-by: hannahyeates <[email protected]>

ci: fix rego check (open-policy-agent#4532)

* build/policies: format using 0.39.0
* workflow/pull-request: use edge opa for rego PR checks

Signed-off-by: Stephan Renatus <[email protected]>

website: remove right margin on sidebar (open-policy-agent#4529)

* remove right margin on sidebar
* replace the removed margin with padding

Signed-off-by: orweis <[email protected]>

website: Remove unused variables to avoid error in strict mode(open-policy-agent#4534)

To Fix:
```
2 errors occurred:
policy.rego:13: rego_compile_error: assigned var header unused
policy.rego:13: rego_compile_error: assigned var signature unused
```

Signed-off-by: panpan0000 <[email protected]>

website: show yellow banner for old version (open-policy-agent#4533)

Hopefully this is gets more attention than the yellow "older" button
in the version picker.

Signed-off-by: Stephan Renatus <[email protected]>

fixing disk diagnostic test case

removing additional space

Signed-off-by: Damien Burks <[email protected]>

removing test files

Signed-off-by: Damien Burks <[email protected]>

fixing linting issue

Signed-off-by: Damien Burks <[email protected]>

adding test case for logging

Signed-off-by: Damien Burks <[email protected]>

ecosystem: adding permit.io (open-policy-agent#4531)

Signed-off-by: Oz Radiano <[email protected]>

docs: CloudFormation fixes (open-policy-agent#4541)

* Fix step headers to use h3
* Add warning about boolean values conversion

Signed-off-by: Anders Eknert <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants