Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport of Fix issue with peer stream node cleanup. into release/1.14.x #17246

Conversation

hc-github-team-consul-core
Copy link
Collaborator

Backport

This PR is auto-generated from #17235 to be assessed for backporting due to the inclusion of the label backport/1.14.

The below text is copied from the body of the original PR.


This commit encompasses a few problems that are closely related due to their proximity in the code.

  1. The peerstream utilizes node IDs in several locations to determine which nodes / services / checks should be cleaned up or created. While VM deployments with agents will likely always have a node ID, agentless uses synthetic nodes and does not populate the field. This means that for consul-k8s deployments, all services were likely bundled together into the same synthetic node in some code paths (but not all), resulting in strange behavior. The Node.Node field should be used instead as a unique identifier, as it should always be populated.

  2. The peerstream cleanup process for unused nodes uses an incorrect query for node deregistration. This query is NOT namespace aware and results in the node (and corresponding services) being deregistered prematurely whenever it has zero default-namespace services and 1+ non-default-namespace services registered on it. This issue is tricky to find due to the incorrect logic mentioned in 1, combined with the fact that the affected services must be co-located on the same node as the currently deregistering service for this to be encountered.

  3. The stream tracker did not understand differences between services in different namespaces and could therefore report incorrect numbers. It was updated to utilize the full service name to avoid conflicts and return proper results.


Overview of commits

Paul Glass and others added 30 commits March 7, 2023 14:05
* NET-2954: Improve integration tests CI execution time

* fix ci

* remove comments and modify config file
* Update changelog with Consul patch releases 1.13.7, 1.14.5, 1.15.1

* Bump submodules from latest patch release

* Forgot one
* added a backport-checker GitHub action

* Update .github/workflows/backport-checker.yml
* Upgrade ember-intl

* Add changelog

* Add yarn lock
* Add namespace file with build tag for OSS tests

* Remove TODO comment
* jira pr check filter out dependabot and oss/ent merges
Add peer locality to discovery chains
* fixes for unsupported partitions field in CRD metadata block

* Apply suggestions from code review

Co-authored-by: Luke Kysow <[email protected]>

---------

Co-authored-by: Luke Kysow <[email protected]>
* Consul WAN Fed with Vault Secrets Backend document updates

* Corrected dc1-consul.yaml and dc2-consul.yaml file highlights

* Update website/content/docs/k8s/deployment-configurations/vault/wan-federation.mdx

Co-authored-by: trujillo-adam <[email protected]>

* Update website/content/docs/k8s/deployment-configurations/vault/wan-federation.mdx

Co-authored-by: trujillo-adam <[email protected]>

---------

Co-authored-by: trujillo-adam <[email protected]>
Co-authored-by: Ashvitha Sridharan <[email protected]>
Co-authored-by: Freddy <[email protected]>

Add a new envoy flag: "envoy_hcp_metrics_bind_socket_dir", a directory
where a unix socket will be created with the name
`<namespace>_<proxy_id>.sock` to forward Envoy metrics.

If set, this will configure:
- In bootstrap configuration a local stats_sink and static cluster.
  These will forward metrics to a loopback listener sent over xDS.

- A dynamic listener listening at the socket path that the previously
  defined static cluster is sending metrics to.

- A dynamic cluster that will forward traffic received at this listener
  to the hcp-metrics-collector service.


Reasons for having a static cluster pointing at a dynamic listener:
- We want to secure the metrics stream using TLS, but the stats sink can
  only be defined in bootstrap config. With dynamic listeners/clusters
  we can use the proxy's leaf certificate issued by the Connect CA,
  which isn't available at bootstrap time.

- We want to intelligently route to the HCP collector. Configuring its
  addreess at bootstrap time limits our flexibility routing-wise. More
  on this below.

Reasons for defining the collector as an upstream in `proxycfg`:
- The HCP collector will be deployed as a mesh service.

- Certificate management is taken care of, as mentioned above.

- Service discovery and routing logic is automatically taken care of,
  meaning that no code changes are required in the xds package.

- Custom routing rules can be added for the collector using discovery
  chain config entries. Initially the collector is expected to be
  deployed to each admin partition, but in the future could be deployed
  centrally in the default partition. These config entries could even be
  managed by HCP itself.
This commit adds a sameness-group config entry to the API and structs packages. It includes some validation logic and a new memdb index that tracks the default sameness-group for each partition. Sameness groups will simplify the effort of managing failovers / intentions / exports for peers and partitions.

Note that this change purely to introduce the configuration entry and does not include the full functionality of sameness-groups.
If a CA config update did not cause a root change, the codepath would return early and skip some steps which preserve its intermediate certificates and signing key ID. This commit re-orders some code and prevents updates from generating new intermediate certificates.
* Add copyright headers to UI files

* Ensure copywrite file ignores external libs
* docs(discovery): typo

* docs(discovery): EOF and trim lines

---------

Co-authored-by: trujillo-adam <[email protected]>
This commit fixes an issue where trust bundles could not be read
by services in a non-default namespace, unless they had excessive
ACL permissions given to them.

Prior to this change, `service:write` was required in the default
namespace in order to read the trust bundle. Now, `service:write`
to a service in any namespace is sufficient.
* Add known issues to Raft WAL docs.

* Refactor update based on review feedback
@hc-github-team-consul-core hc-github-team-consul-core requested review from sarahethompson and modrake and removed request for a team May 8, 2023 18:13
@hc-github-team-consul-core hc-github-team-consul-core force-pushed the backport/derekm/NET-3007/fix-peer-stream-cleanup/friendly-caring-krill branch from e934397 to 10b7a69 Compare May 8, 2023 18:13
@hc-github-team-consul-core hc-github-team-consul-core enabled auto-merge (squash) May 8, 2023 18:13
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Auto approved Consul Bot automated PR

@github-actions github-actions bot added pr/dependencies PR specifically updates dependencies of project theme/acls ACL and token generation theme/agent-cache Agent Cache theme/api Relating to the HTTP API interface theme/certificates Related to creating, distributing, and rotating certificates in Consul theme/cli Flags and documentation for the CLI interface theme/config Relating to Consul Agent configuration, including reloading theme/connect Anything related to Consul Connect, Service Mesh, Side Car Proxies theme/consul-terraform-sync Relating to Consul Terraform Sync and Network Infrastructure Automation theme/contributing Additions and enhancements to community contributing materials theme/envoy/xds Related to Envoy support theme/health-checks Health Check functionality theme/internals Serf, Raft, SWIM, Lifeguard, Anti-Entropy, locking topics theme/telemetry Anything related to telemetry or observability theme/tls Using TLS (Transport Layer Security) or mTLS (mutual TLS) to secure communication theme/ui Anything related to the UI type/ci Relating to continuous integration (CI) tooling for testing or releases type/docs Documentation needs to be created/updated/clarified labels May 8, 2023
@hashi-derek hashi-derek disabled auto-merge May 8, 2023 18:14
@hashi-derek hashi-derek added the pr/do-not-merge PR cannot be merged in its current form. label May 8, 2023
@hashi-derek hashi-derek closed this May 8, 2023
@hashi-derek
Copy link
Member

Forcing this backport to close due to strange history of 800+ commits appearing. Will look into why the automation caused this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr/dependencies PR specifically updates dependencies of project pr/do-not-merge PR cannot be merged in its current form. theme/acls ACL and token generation theme/agent-cache Agent Cache theme/api Relating to the HTTP API interface theme/certificates Related to creating, distributing, and rotating certificates in Consul theme/cli Flags and documentation for the CLI interface theme/config Relating to Consul Agent configuration, including reloading theme/connect Anything related to Consul Connect, Service Mesh, Side Car Proxies theme/consul-terraform-sync Relating to Consul Terraform Sync and Network Infrastructure Automation theme/contributing Additions and enhancements to community contributing materials theme/envoy/xds Related to Envoy support theme/health-checks Health Check functionality theme/internals Serf, Raft, SWIM, Lifeguard, Anti-Entropy, locking topics theme/telemetry Anything related to telemetry or observability theme/tls Using TLS (Transport Layer Security) or mTLS (mutual TLS) to secure communication theme/ui Anything related to the UI type/ci Relating to continuous integration (CI) tooling for testing or releases type/docs Documentation needs to be created/updated/clarified
Projects
None yet
Development

Successfully merging this pull request may close these issues.