Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed exception when user tries to log in #42641

Merged
merged 13 commits into from
Nov 30, 2022

Conversation

Enmk
Copy link
Contributor

@Enmk Enmk commented Oct 25, 2022

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in official stable or prestable release)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Fixed unable to log in (because of failure to create session_log entry) in rare case of messed up setting profiles.
...

Closes: #35952

Fixed SettingsProfilesInfo::getProfileNames() throwing std::exception

  • Using profiles_with_implicit for profile IDs, since it seems to be better synchronized with names
  • Trying harder: attempting to get name from AccessControl if it is missing from local cached names
  • Throwing Exception if everything else fails

Fix is based on a stack trace of #35952 and some notes in bug description.
There are no tests added since I wasn't able to reproduce bug on my side, however same code path must be checked by existing tests.

Bug analysis

  • While populating SettingsProfilesInfo::profiles_with_implicit and SettingsProfilesInfo::names_of_profiles, SettingsProfilesCache::substituteProfiles can't find a profile by UUID in SettingsProfilesCache::all_profiles.
  • So SettingsProfilesInfo::profiles_with_implicit and SettingsProfilesInfo::names_of_profiles do not have an entry for a profile from SettingsProfilesInfo::profiles.
  • This could happen only if SettingsProfilesCache::all_profiles updated asynchronously, but it is protected by mutex.

So just to be safe SettingsProfilesInfo::getProfileNames uses profile UIDs from SettingsProfilesInfo::profiles_with_implicit instead of SettingsProfilesInfo::profiles, since SettingsProfilesInfo::profiles_with_implicit and SettingsProfilesInfo::names_of_profiles are explicitly synchronized by SettingsProfilesCache::substituteProfiles.

- Using profiles_with_implicit for profile IDs, since it seems to be better synchronized with names
- Trying harder: attempting to get name from AccessControl if it is missing from local cached names
- Throwing Exception if everything else fails
@robot-ch-test-poll robot-ch-test-poll added the pr-bugfix Pull request with bugfix, not backported by default label Oct 25, 2022
@tavplubix tavplubix added the can be tested Allows running workflows for external contributors label Oct 26, 2022
@rschu1ze rschu1ze self-assigned this Nov 2, 2022
@rschu1ze
Copy link
Member

rschu1ze commented Nov 3, 2022

@Enmk I am not really familiar with the code and its constraints and I will certainly not oppose your fix. But, to be honest, it feels like a bandaid for an issue that should never have happened in the first place. I agree with your section "bug analysis", in particular that the member variables of SettingsProfileInfo are out-of-sync and that they are set (at least as far as I see) by SettingsProfilesCache::substituteProfiles(). So the question is what could have caused this problem. Thoughts:

  • We could use Clang TSA to safeguard the locking in SettingsProfileCache, see Support for Clang Thread Safety Analysis (TSA) #38068
  • Is substituteProfiles really exception-safe? E.g. it does lots of .insert(), .push_back() and .emplace() where a bad_alloc exception could mess up the state of SettingsProfileInfo. This could be fixed by first reserving sufficient space in all members and then populating them.
  • Maybe there are other places in this function which can throw.

Even if there is no repro (which is okay, perhaps the issue occurs really super sporadically), my preference would be to make the settings profile creating code more robust.

…lesInfo::profiles_with_implicit

That means SettingsProfilesInfo::profiles now:
- has default_profile_id (if enabled)
- doesn't have profiles that are not matching for current user/roles
@Enmk
Copy link
Contributor Author

Enmk commented Nov 23, 2022

  • Is substituteProfiles really exception-safe? E.g. it does lots of .insert(), .push_back() and .emplace() where a bad_alloc exception could mess up the state of SettingsProfileInfo. This could be fixed by first reserving sufficient space in all members and then populating them.

substituteProfiles is not exception safe, but there should be no way to modify inconsistently SettingsProfilesInfo in SettingsProfilesInfo due to exception. It only modifies the state of the stack-based SettingsProfilesInfo instance, so if an exception is thrown, said instance is not added to the SettingsProfilesCache::profile_infos_cache and lost.

From what I see, substituteProfiles is always called with a locked SettingsProfilesCache::mutex (directly or indirectly):

graph TD
    mergeSettingsAndConstraints --> mergeSettingsAndConstraintsFor --> substituteProfiles
    getSettingsProfileInfo --> |locks mutex | substituteProfiles
    getEnabledSettings --> |locks mutex | mergeSettingsAndConstraintsFor
    profileRemoved --> |locks mutex | mergeSettingsAndConstraints
    profileAddedOrChanged --> |locks mutex | mergeSettingsAndConstraints
Loading

Took another look, and probably fixed the most probable cause in mergeSettingsAndConstraintsFor, see
aa78f1c

@rschu1ze ^^

@rschu1ze
Copy link
Member

Thanks for the analysis, I agree with it, and sorry for the delay (just got back from vacation)

Please find one smaller comment attached. Functional test test_function_current_profiles looks fishy, like to check?

@rschu1ze
Copy link
Member

test_function_current_profiles unfortunately still fails.

[gw0] [100%] FAILED test_settings_profile/test.py::test_function_current_profiles 

=================================== FAILURES ===================================
________________________ test_function_current_profiles ________________________
[gw0] linux -- Python 3.8.10 /usr/bin/python3

    def test_function_current_profiles():
        instance.query("CREATE SETTINGS PROFILE P1, P2")
        instance.query("ALTER USER robin SETTINGS PROFILE P1, P2")
        instance.query("CREATE SETTINGS PROFILE P3 TO robin")
        instance.query("CREATE SETTINGS PROFILE P4")
        instance.query("CREATE SETTINGS PROFILE P5 SETTINGS INHERIT P4")
        instance.query("CREATE ROLE worker SETTINGS PROFILE P5")
        instance.query("GRANT worker TO robin")
        instance.query("CREATE SETTINGS PROFILE P6")
    
        session_id = new_session_id()
>       assert (
            instance.http_query(
                "SELECT defaultProfiles(), currentProfiles(), enabledProfiles()",
                user="robin",
                params={"session_id": session_id},
            )
            == "['P1','P2']\t['P1','P2']\t['default','P3','P4','P5','P1','P2']\n"
        )
E       assert "['P1','P2']\t['default','P3','P5','P1','P2']\t['default','P3','P4','P5','P1','P2']\n" == "['P1','P2']\t['P1','P2']\t['default','P3','P4','P5','P1','P2']\n"
E         - ['P1','P2']	['P1','P2']	['default','P3','P4','P5','P1','P2']
E         + ['P1','P2']	['default','P3','P5','P1','P2']	['default','P3','P4','P5','P1','P2']
E         ?            	 ++++++++++++++++++++

@Enmk
Copy link
Contributor Author

Enmk commented Nov 29, 2022

Errors analysis for 2684fc5b

ClickHouse Integration Tests (asan) [1/6]

/usr/local/lib/python3.8/dist-packages/kazoo/client.py:635: KazooTimeoutError
E               Failed: Timeout >900.0s

ClickHouse Integration Tests (asan) [5/6]

>       assert (
            instance.http_query(
                "SELECT defaultProfiles(), currentProfiles(), enabledProfiles()",
                user="robin",
                params={"session_id": session_id},
            )
            == "['P1','P2']\t['P1','P2']\t['default','P3','P4','P5','P1','P2']\n"
        )
E       assert "['P1','P2']\t['default','P3','P5','P1','P2']\t['default','P3','P4','P5','P1','P2']\n" == "['P1','P2']\t['P1','P2']\t['default','P3','P4','P5','P1','P2']\n"
E         - ['P1','P2']	['P1','P2']	['default','P3','P4','P5','P1','P2']
E         + ['P1','P2']	['default','P3','P5','P1','P2']	['default','P3','P4','P5','P1','P2']
E         ?            	 ++++++++++++++++++++

ClickHouse Integration Tests (release) [3/4]

ClickHouse Integration Tests (tsan)

E           helpers.client.QueryRuntimeException: Client failed! Return code: 209, stderr: Received exception from server (version 22.12.1):
E           Code: 209. DB::Exception: Received from 172.16.8.2:9000. DB::NetException. DB::NetException: Timeout exceeded while reading from socket (node_1:9000, receive timeout 3000 ms): While executing Remote. Stack trace:
...

ClickHouse Integration Tests (tsan) [5/6]

ClickHouse Stress Test (asan)

ClickHouse Stress Test (debug)

Looks like unrelated OOM

Conclusion

Need to modify tests/integration/test_settings_profile/test.py::test_function_current_profiles to reflect fixed behavior of currentProfiles()

Comment on lines +71 to +82
const auto p = names_of_profiles.find(profile_id);
if (p != names_of_profiles.end())
result.push_back(p->second);
else
{
if (const auto name = access_control.tryReadName(profile_id))
// We could've updated cache here, but it is a very rare case, so don't bother.
result.push_back(*name);
else
throw Exception(ErrorCodes::LOGICAL_ERROR, "Unable to get profile name for {}", toString(profile_id));
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Enmk, what does it fix and how? Looks like it simply replaces std::out_of_range with LOGICAL_ERROR without fixing the root casue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, I think there should be profiles_with_implicit instead of profiles in:

 for (const auto & profile_id : profiles)

good catch! Is there a new related crash or bug?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Enmk SettingsProfilesInfo must be internally consistent, so I think if we fix that consistency in #57263 then we only need here a logical error:

    for (const auto & profile_id : profiles)
    {
        const auto p = names_of_profiles.find(profile_id);
        if (p != names_of_profiles.end())
            result.push_back(p->second);
        else
            throw Exception(ErrorCodes::LOGICAL_ERROR, "Unable to get profile name for {}", toString(profile_id));
    }

Enmk added a commit to Enmk/ClickHouse that referenced this pull request Nov 27, 2023
`SettingsProfilesInfo::profiles` is not updated in bg if any of the profiles assigned to the user change, but `SettingsProfilesInfo::profiles_with_implicit` is.

Update of ClickHouse#42641
kudos @tavplubix  https://github.com/ClickHouse/ClickHouse/pull/42641/files/3d0c07ac5b8f18917f2314474030910176ec7940#r1406196201
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
can be tested Allows running workflows for external contributors pr-bugfix Pull request with bugfix, not backported by default
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Code: 1001. DB::Exception: std::out_of_range: unordered_map::at: key not found
5 participants