Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multitenant: network page of console doesn't load for application tenant #115022

Open
ajstorm opened this issue Nov 23, 2023 · 3 comments
Open

multitenant: network page of console doesn't load for application tenant #115022

ajstorm opened this issue Nov 23, 2023 · 3 comments
Labels
A-cluster-observability Related to cluster observability A-multitenancy Related to multi-tenancy branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-testcluster Issues found or occurred on a test cluster, i.e. a long-running internal cluster P-2 Issues/test failures with a fix SLA of 3 months T-db-server

Comments

@ajstorm
Copy link
Collaborator

ajstorm commented Nov 23, 2023

This is what you get when you try and look at the Network page of the console from the application tenant:

image

From the system tenant, it works fine:

image

If we can't render this page properly from the application tenant, it shouldn't appear in the sidebar.

Jira issue: CRDB-33845

Epic CRDB-38968

@ajstorm ajstorm added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-multitenancy Related to multi-tenancy T-multitenant Issues owned by the multi-tenant virtual team T-cluster-observability A-cluster-observability Related to cluster observability labels Nov 23, 2023
@ajstorm ajstorm added the O-testcluster Issues found or occurred on a test cluster, i.e. a long-running internal cluster label Nov 23, 2023
@kevin-v-ngo
Copy link

Running into this issue as well

@ajstorm ajstorm added P-1 Issues/test failures with a fix SLA of 1 month release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Nov 28, 2023
Copy link

blathers-crl bot commented Nov 30, 2023

Hi @ajstorm, please add branch-* labels to identify which branch(es) this release-blocker affects.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@ajstorm ajstorm added GA-blocker branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Nov 30, 2023
stevendanna added a commit to stevendanna/cockroach that referenced this issue Dec 6, 2023
The network page doesn't work inside a virtual cluster yet.  Rather
than just presenting a spinner, here we add a warning to the page.

Informs cockroachdb#115022

Release note: None
stevendanna added a commit to stevendanna/cockroach that referenced this issue Dec 6, 2023
The network page doesn't work inside a virtual cluster yet.  Rather
than just presenting a spinner, here we add a warning to the page.

Informs cockroachdb#115022

Release note: None
craig bot pushed a commit that referenced this issue Dec 6, 2023
114616: logstore: sync sideloaded storage directories r=erikgrinaker a=pavelkalinnikov

This PR ensures that the hierarchy of directories/files created by the sideloaded log storage is properly synced.

Previously, only the "leaf" files in this hierarchy were fsync-ed. Even though this guarantees the files content and metadata is synced, this still does not guarantee that the references to these files are durable. For example, Linux man page for `fsync` [^1] says:

```
Calling fsync() does not necessarily ensure that the entry in the
directory containing the file has also reached disk.  For that an
explicit fsync() on a file descriptor for the directory is also
needed.
```

It means that these files can be lost after a system crash of power off. This leads to issues because:

1. Pebble WAL syncs are not atomic with the sideloaded files syncs. It is thus possible that raft log metadata "references" a sideloaded file and gets synced, but the file is not yet. A power off/on at this point leads to an internal inconsistency, and can result in a crash loop when raft will try to load these entries to apply and/or send to other replicas.

2. The durability of entry files is used as a pre-condition to sending raft messages that trigger committing these entries. A coordinated power off/on on a majority of replicas can thus lead to losing committed entries and unrecoverable loss-of-quorum.

This PR fixes the above issues, by syncing parents of all the directories and files that the sideloaded storage creates.

[^1]: https://man7.org/linux/man-pages/man2/fsync.2.html

Part of #114411

Epic: none

Release note (bug fix): this commit fixes a durability bug in raft log storage, caused by incorrect syncing of filesystem metadata. It was possible to lose writes of a particular kind (`AddSSTable`) that is e.g. used by `RESTORE`. This loss was possible only under power-off or OS crash conditions. As a result, CRDB could enter a crash loop on restart. In the worst case of a correlated power-off/crash across multiple nodes this could lead to loss of quorum or data loss.

115689: ui: add warning to network page when unavailable r=maryliag a=stevendanna

The network page doesn't work inside a virtual cluster yet.  Rather than just presenting a spinner, here we add a warning to the page. Additionally, it simplifies the text of the warning
presented on the Advanced Debug page.

Informs #115022

<img width="1239" alt="Screenshot 2023-12-06 at 16 09 16" src="https://github.com/cockroachdb/cockroach/assets/852371/43778020-c892-4e96-b1c4-ec58b20309ae">

<img width="1236" alt="Screenshot 2023-12-06 at 16 09 33" src="https://github.com/cockroachdb/cockroach/assets/852371/30643fbb-ec68-4973-b35f-60a9a874e6a5">




Release note: None

115705: kv,admission: only log empty admission warning for non-release builds r=aadityasondhi a=aadityasondhi

This error message, while useful for debugging, spams the logs with a stack trace which can be distracting when reading the logs.

Since AC defaults to skip when there is an empty header, this is not a concern, unless we see real-world performance impact (which we have not).

This patch removes it from release builds while we figure out all the sources for missing headers.

Informs #112680

Release note: None

Co-authored-by: Pavel Kalinnikov <[email protected]>
Co-authored-by: Steven Danna <[email protected]>
Co-authored-by: Aaditya Sondhi <[email protected]>
blathers-crl bot pushed a commit that referenced this issue Dec 8, 2023
The network page doesn't work inside a virtual cluster yet.  Rather
than just presenting a spinner, here we add a warning to the page.

Informs #115022

Release note: None
blathers-crl bot pushed a commit that referenced this issue Dec 8, 2023
The network page doesn't work inside a virtual cluster yet.  Rather
than just presenting a spinner, here we add a warning to the page.

Informs #115022

Release note: None
@stevendanna stevendanna added P-2 Issues/test failures with a fix SLA of 3 months and removed GA-blocker P-1 Issues/test failures with a fix SLA of 1 month labels Dec 12, 2023
@stevendanna
Copy link
Collaborator

We've now disabled this page in MT mode so I removed the release blocker and changed this to P-2.

@stevendanna stevendanna removed the T-multitenant Issues owned by the multi-tenant virtual team label Jul 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cluster-observability Related to cluster observability A-multitenancy Related to multi-tenancy branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-testcluster Issues found or occurred on a test cluster, i.e. a long-running internal cluster P-2 Issues/test failures with a fix SLA of 3 months T-db-server
Projects
None yet
Development

No branches or pull requests

3 participants