Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
fix(meta): meta server failed and could not be started normally while…
… setting environment variables after dropping table (#2148) #2149. There are two problems that should be solved: 1. why the primary meta server failed with `segfault` while dropping tables ? 2. why all meta servers were never be restarted normally after the primary meta server failed ? A pegasus cluster would flush security policies to remote meta storage periodically (by `update_ranger_policy_interval_sec`) in the form of environment variables. We do this by `server_state::set_app_envs()`. However, after updating the meta data on the remote storage (namely ZooKeeper), the table is not checked that if it still exists while updating environment variables of local memory: ```C++ void server_state::set_app_envs(const app_env_rpc &env_rpc) { ... do_update_app_info(app_path, ainfo, [this, app_name, keys, values, env_rpc](error_code ec) { CHECK_EQ_MSG(ec, ERR_OK, "update app info to remote storage failed"); zauto_write_lock l(_lock); std::shared_ptr<app_state> app = get_app(app_name); std::string old_envs = dsn::utils::kv_map_to_string(app->envs, ',', '='); for (int idx = 0; idx < keys.size(); idx++) { app->envs[keys[idx]] = values[idx]; } std::string new_envs = dsn::utils::kv_map_to_string(app->envs, ',', '='); LOG_INFO("app envs changed: old_envs = {}, new_envs = {}", old_envs, new_envs); }); } ``` In `std::string old_envs = dsn::utils::kv_map_to_string(app->envs, ',', '=');`, since `app` is `nullptr`, `app->envs` would point an invalid address, leading to `segfault` in `libdsn_utils.so` where `dsn::utils::kv_map_to_string` is. Therefore, the reason for the 1st problem is very clear: the callback for updating meta data on remote storage is called immediately after the table is removed, and an invalid address is accessed due to null pointer. Then, the meta server would load meta data from remote storage after it is restart. However, the intermediate status `AS_DROPPING` is also flushed to remote storage with security policies since all meta data for a table is an unitary `json` object: the whole `json` would be set to remote storage once any property is updated. However `AS_DROPPING` is invalid, and cannot pass the assertion which would make meta server fail again and again, which is the reason of the 2nd problem: ```C++ server_state::sync_apps_from_remote_storage() { ... std::shared_ptr<app_state> app = app_state::create(info); { zauto_write_lock l(_lock); _all_apps.emplace(app->app_id, app); if (app->status == app_status::AS_AVAILABLE) { app->status = app_status::AS_CREATING; _exist_apps.emplace(app->app_name, app); _table_metric_entities.create_entity(app->app_id, app->partition_count); } else if (app->status == app_status::AS_DROPPED) { app->status = app_status::AS_DROPPING; } else { CHECK(false, "invalid status({}) for app({}) in remote storage", enum_to_string(app->status), app->get_logname()); } } ... } ``` To fix the 1st problem, we just check if the table still exists after meta data is updated on the remote storage. To fix the 2nd problem, we prevent meta data with intermediate status `AS_DROPPING` from being flushed to remote storage.
- Loading branch information