Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rocksdb_db_options won't apply #309

Closed
wenhaocs opened this issue Oct 3, 2023 · 3 comments
Closed

rocksdb_db_options won't apply #309

wenhaocs opened this issue Oct 3, 2023 · 3 comments
Labels
affects/none PR/issue: this bug affects none version. process/fixed Process of bug severity/none Severity of bug type/bug Type: something is unexpected
Milestone

Comments

@wenhaocs
Copy link

wenhaocs commented Oct 3, 2023

My own simulation (v1.6.3)

(1) I checked the max_background_jobs is the default value of 2 in rocksdb
image

(2) Then I changed it to 3 in kubectl edit
image

(3) Since this flag is dynamic, it will not trigger restart. The options of the existing rocksdb instance keep the same.
image

(4) Now I create a space, the newly created rocksdb instance should pick the new value (work as expected).
image
image

(5) After I add a new static flag, why is there no restart? (Update: because the default value is 4096)
image
image
image

When using curl to check the flags in storaged, it's updated. Apparently, this is not dynamic flag and should trigger restart....
image

(6) After I update the static flag, it will trigger restart.
image
image

I1003 23:01:45.341053       1 nebula_cluster_controller.go:162] Start to reconcile NebulaCluster
I1003 23:01:45.377849       1 storaged_updater.go:185] pod [nebula/nebula-storaged-2] leader count is 0, ready for rolling update
I1003 23:01:45.377925       1 helper.go:231] dynamic flags: map[rocksdb_db_options:{"max_subcompactions":"3","max_background_jobs":"3"} wal_ttl:500]
E1003 23:01:45.380769       1 nebula_cluster_control.go:124] reconcile storaged cluster failed: update storaged cluster nebula-storaged dynamic flags failed: get http://nebula-storaged-2.nebula-storaged-headless.nebula.svc.cluster.local:19779/flags response body is empty
I1003 23:01:45.395742       1 nebulacluster.go:119] NebulaCluster [nebula/nebula] updated successfully
I1003 23:01:45.395768       1 nebula_cluster_controller.go:173] NebulaCluster [nebula/nebula] reconcile details: waiting for nebulacluster ready
E1003 23:01:45.395777       1 nebula_cluster_controller.go:184] NebulaCluster [nebula/nebula] reconcile failed: update storaged cluster nebula-storaged dynamic flags failed: get http://nebula-storaged-2.nebula-storaged-headless.nebula.svc.cluster.local:19779/flags response body is empty
I1003 23:01:45.395781       1 nebula_cluster_controller.go:143] Finished reconciling NebulaCluster [nebula/nebula] (54.860001ms), result: {false 5s}
I1003 23:01:50.396939       1 nebula_cluster_controller.go:162] Start to reconcile NebulaCluster
I1003 23:01:50.503747       1 storaged_updater.go:185] pod [nebula/nebula-storaged-2] leader count is 0, ready for rolling update
I1003 23:01:50.504058       1 helper.go:231] dynamic flags: map[rocksdb_db_options:{"max_subcompactions":"3","max_background_jobs":"3"} wal_ttl:500]
E1003 23:01:50.506426       1 nebula_cluster_control.go:124] reconcile storaged cluster failed: update storaged cluster nebula-storaged dynamic flags failed: get http://nebula-storaged-2.nebula-storaged-headless.nebula.svc.cluster.local:19779/flags response body is empty
I1003 23:01:50.522574       1 nebulacluster.go:119] NebulaCluster [nebula/nebula] updated successfully
I1003 23:01:50.522594       1 nebula_cluster_controller.go:173] NebulaCluster [nebula/nebula] reconcile details: waiting for nebulacluster ready
E1003 23:01:50.522601       1 nebula_cluster_controller.go:184] NebulaCluster [nebula/nebula] reconcile failed: update storaged cluster nebula-storaged dynamic flags failed: get http://nebula-storaged-2.nebula-storaged-headless.nebula.svc.cluster.local:19779/flags response body is empty
I1003 23:01:50.522607       1 nebula_cluster_controller.go:143] Finished reconciling NebulaCluster [nebula/nebula] (125.732283ms), result: {false 5s}
I1003 23:01:55.523212       1 nebula_cluster_controller.go:162] Start to reconcile NebulaCluster
I1003 23:01:55.559036       1 storaged_updater.go:185] pod [nebula/nebula-storaged-2] leader count is 0, ready for rolling update
I1003 23:01:55.559397       1 helper.go:231] dynamic flags: map[rocksdb_db_options:{"max_subcompactions":"3","max_background_jobs":"3"} wal_ttl:500]
E1003 23:01:55.561028       1 nebula_cluster_control.go:124] reconcile storaged cluster failed: update storaged cluster nebula-storaged dynamic flags failed: get http://nebula-storaged-2.nebula-storaged-headless.nebula.svc.cluster.local:19779/flags response body is empty
I1003 23:01:55.576263       1 nebulacluster.go:119] NebulaCluster [nebula/nebula] updated successfully
I1003 23:01:55.576283       1 nebula_cluster_controller.go:173] NebulaCluster [nebula/nebula] reconcile details: waiting for nebulacluster ready
E1003 23:01:55.576290       1 nebula_cluster_controller.go:184] NebulaCluster [nebula/nebula] reconcile failed: update storaged cluster nebula-storaged dynamic flags failed: get http://nebula-storaged-2.nebula-storaged-headless.nebula.svc.cluster.local:19779/flags response body is empty

But the rocksdb_options are not correctly set. Though the gflags are set. (BUG)
image
image

(7) Thereafter, I remove the static flag I just added. It can restart. But rocksdb_db_options value will not take effect in rocksdb either. Though the gflags have been set. (BUG)
image
image
image

User's scenario

What was user doing is running kubectl apply to update flags. After the update, they saw very high ingestion latency. The disk load is also low. We saw a lot of write stall. Then we examined the rocksdb log, and found the compaction related flags are default. Seems rocksdb_db_options will only take effect if you set it before creating the cluster.

Issues

  1. Why the user can still get the correct values being set from rocksdb LOG in some storaged instances, while others are using the default value?
  2. Why is rocksdb_db_options is not reflected in rocksdb? Note: When a rocksdb instance is created, the flag must be ready. If you curl to modify the flag after the rocksdb instance is created, it will not take effect.
  3. Please make sure all changes via kubectl apply and kubectl edit are equivalent. If people use kubectl apply first and then use kubectl edit to update the same static or dynamic flag, the value can be picked up.

Here is the logs of operator:
downloaded-logs-20231003-122906.csv

@wenhaocs wenhaocs added the type/bug Type: something is unexpected label Oct 3, 2023
@github-actions github-actions bot added affects/none PR/issue: this bug affects none version. severity/none Severity of bug labels Oct 3, 2023
@wenhaocs
Copy link
Author

wenhaocs commented Oct 3, 2023

When testing, please make sure kubectl apply will behave as expected (It is how users are updating configs).

@Sophie-Xie Sophie-Xie added this to the v1.6.x milestone Oct 4, 2023
@MegaByte875
Copy link
Contributor

#310

@wenhaocs
Copy link
Author

wenhaocs commented Oct 4, 2023

About Why the user can still get the correct values being set from rocksdb LOG in some storaged instances, while others are using the default value? It must because some storaged restarted. Once a storaged restarts, rocksdb will get created with the default value of rocksdb_db_options, and then Operator sends a curl command to update the rocksdb_db_options, which will not take effect on those created rocksdb instances. Even for those storaged which seem works fine and do not have issues of high compaction latency, not all the rocksdb instances in those storaged have the right value set. The rocksdb for default space is still using the default value. This is because that rocksdb instance was created before the dynamic flag gets set. All other rocksdb instances are created when creating a space, at which time the dynamic flag has already been set. That's why all the rocksdb instances on that storaged seems to pick the correct value, except for 1 rocksdb instance.
image

The process of updating static flags and dynamic flags are the same regardless it's first start or restart. It's always updating the static flags first and pass them as the startup flags of graphd/metad/storaged. After the whole stateful set is in running state, Operator will send curl commands to update the dynamic flags.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects/none PR/issue: this bug affects none version. process/fixed Process of bug severity/none Severity of bug type/bug Type: something is unexpected
Projects
None yet
Development

No branches or pull requests

3 participants