Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Check init tasks #478

Merged
merged 27 commits into from
Jul 11, 2024
Merged

feat: Check init tasks #478

merged 27 commits into from
Jul 11, 2024

Conversation

CarlosNihelton
Copy link
Collaborator

@CarlosNihelton CarlosNihelton commented Jun 27, 2024

This affects only the registration time. Subsequent boots don't check for cloud-init or depend on it in any way.

If the distro launcher is used to register a WSL instance, we check if there is a cloud-init command in PATH, and if so we launch cloud-init status --wait right after registration is complete.
This will wait until cloud-init completes what's doing or report error/disabled states. By having a Win32 process mapped to that command seems enough to make WSL consider the boot is complete, thus preventing WSL boot time watchdog to kick in.


EDIT:
Cloud-init could have created Linux user accounts. Users could have been created in many other ways, even pre-created during image build. We expect the desired default user to be set via /etc/wsl.conf. That would work even without the distro launcher, so it's reasonable to assume the desired default to be present there.

We "parse" that file and, if there is the default user entry, then we set it via WSL API (which is faster than terminanting and restarting the instance). In order to leverage Windows primitives for parsing INI, we have to make a temporary copy of that file into the Windows filesystem.

If we don't find a default user set, we check if the current default user is still root, in which case we look for any non-system
user in the NSS passwd database (by launching getent), i.e. any user with the lowest UID greather than or equals to 1000 and have a real login shell (no /bin/sync,/bin/false or /usr/bin/nologin).

Finally, we fallback to the interative user creation, i.e. the default upstream prompting.


I originally intended to put this implementation inside DistributionInfo.h/cpp to ensure less files being touched (thus less
diff compared to upstream) but that doesn't play nice with CI 'Refresh releases and assets' workflow.

So I placed the code in its own component, and inside a subdirectory, so if there comes a time we need to revert that change (it becomes obsolete or so) it remains easy to drop. Also, being inside a subdirectory allows me to drop a .clang-format in
there without worrying of reformatting upstream code :)

Lessons learned from the launcher's old times:

  • We can put cpp files inside a subdirectory and still work with precompiled headers in Visual Studio if we either adjust the include path for every cpp file or adjust the path to the precompiled header for every cpp file in the vcxproj file (no option is winner :) )

I didn't add too many test cases because I was afraid of taking too long to register/unregister in between tests.
I also avoided adding stuff that could make the tests brittle such as installing huge apt packages (in fact I'm installing nothing at all at this moment), attaching to Ubuntu Pro, enrolling into Landscape and what not. Up for discussion on what makes sense to add here. I also didn't play with cloud-init user-data file selection. Although I could, I believe that's better tested as a unit test of the datasource in the upstream repository itself. Up for discussion as well.

Finally we cannot use private directories to behave as %USERPROFILE%, we need to put stuff inside the real user profile directory on Windows, which is sad. We can set %USERPROFILE% in the Windows shell appropriately and subprocesses of the launcher (either Linux or Windows processes) find the correct value, but systemd services won't. They will still match up the global %USERPROFILE%, one reason why I struggled to implement those tests.


UDENG-2768

@CarlosNihelton CarlosNihelton self-assigned this Jun 27, 2024
@CarlosNihelton CarlosNihelton marked this pull request as ready for review June 27, 2024 20:13
@CarlosNihelton CarlosNihelton requested a review from a team as a code owner June 27, 2024 20:13
Copy link
Member

@didrocks didrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before going on with the PR, I think we should clarify this:

If we don't find a default user set, than we allow fallback to the interative user creation, i.e. the default upstream prompting.

I think we said that if we don’t find anything in wsl.conf, we still list all users and take the first one with UID >= 1000? I can imagine people using cloud-init use the standard user creation creation but don’t know about wsl.conf and we want to support them.

My suggestion would be, once the launcher waited on cloud-init as you did in that PR:

  • look at wsl.conf, if there is a value there, ensure the user exists on the system (with getent passwd), update the windows registry
  • if nothing, getent passwd list all users with UID >= 1000. If found, update wsl.conf and the windows registry
  • otherwise, fallback to interactive user creation.

Does this make sense?

@CarlosNihelton
Copy link
Collaborator Author

Before going on with the PR, I think we should clarify this:

If we don't find a default user set, than we allow fallback to the interative user creation, i.e. the default upstream prompting.

I think we said that if we don’t find anything in wsl.conf, we still list all users and take the first one with UID >= 1000? I can imagine people using cloud-init use the standard user creation creation but don’t know about wsl.conf and we want to support them.

My suggestion would be, once the launcher waited on cloud-init as you did in that PR:

  • look at wsl.conf, if there is a value there, ensure the user exists on the system (with getent passwd), update the windows registry
  • if nothing, getent passwd list all users with UID >= 1000. If found, update wsl.conf and the windows registry
  • otherwise, fallback to interactive user creation.

Does this make sense?

While it does make sense, I've been resisting to move towards that direction because it becomes harder (IMHO) to explain to users why certain cloud-init user-data behaves in one way when the distro is set up via the launcher and differently when through wsl --import for example. To date I still struggle to find a clear way to present this difference in behavior as a feature of the distro launcher so users won't just assume coming straight from cloud-init and being surprised when it doesn't happen, other than relying on Ubuntu WSL documentation. I think that can be confusing. Do you still believe it's worth investing in doing this augmentation of cloud-init in the distro launcher?

@didrocks
Copy link
Member

It does not depend on cloud-init directly as we don’t really care the way the user was created on the machine. I can be a pre-created image with a default user and they didn’t create/know about wsl.conf.

We don’t want those users to grow that knowledge and the launcher should do the right thing silently.

@jibel opinions?

@jibel
Copy link
Collaborator

jibel commented Jul 1, 2024

It does not depend on cloud-init directly as we don’t really care the way the user was created on the machine. I can be a pre-created image with a default user and they didn’t create/know about wsl.conf.

We don’t want those users to grow that knowledge and the launcher should do the right thing silently.

@jibel opinions?

Looking at the order of priority to consider a default user for WSL it's:
wsl.conf > registry
wsl considers nothing else.

So we should check:

  • wsl.conf
  • registry
  • first user >= 1000 (possibly created by cloud-init or a wsl import or something else we don't know)
  • interactive mode

Then create an entry in wsl.conf based on this discovery if there is none so wsl -d starts with the default user

@CarlosNihelton
Copy link
Collaborator Author

I changed the PR description to match the current implementation after our discussion. The changes in the description are marked as shown below:


EDIT:
...


Copy link
Member

@didrocks didrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are getting there! I still see some discrepancy with what was discussed on MM though which I think we should fix before merging:

  1. We told that the order is (from what wsl command is considering):
  • wsl.conf
  • key in registry

If I am correct, we don’t check the registry here, only set it. I think similarly than wsl.conf, we should consider it as "a default has been set".

  1. You are considering a range rather than user >= 1000. I think you are right to not consider the nobody user, but there are still some other potentials users with a way larger uid when they are registered through Active Directory for instance or any other dynamic "bindings". I think the approach to ignore some users is correct and we should filter by the shell associated with it.

  2. I see that you try to "fix" any invalid wsl.conf (and if the registry is checked, I think you would do the same). I am more in favor of "so be it". If wsl.conf or the registry are configured, that was an intent by the rootfs owner or by the cloud-init which was set. If this is set, let’s not do anything, no double checking, no fix and error out if this is incorrect.
    That way, we don’t diverge from "we have a cloud-init" behavior to "there is none" with the wsl command itself?

So in a nutshell, I think we should:

  • if there is wsl.conf user entry -> no additional check, no user creation, let’s then wsl run with this user (even if it doesn’t exist)
  • else if there is a registry key -> no additional check, no user creation, let’s then wsl run with this user (even if it doesn’t exist)
  • else if there is one: get first user with uid >= 1000 and an associated shell (no /bin/false, no /usr/sbin/nologin) -> set it as default in wsl.conf (do we need to update the registry if wsl.conf always takes precedence?)
  • else -> default to interactive prompting

Does it make sense?

NB: I feel sorry you had to write our own ini parser here. I tried to review the C++ part to the best of my knowledge but I fully trust you on that side :)

.github/workflows/e2e.yaml Show resolved Hide resolved
DistroLauncher/Ubuntu/InitTasks.cpp Outdated Show resolved Hide resolved
DistroLauncher/Ubuntu/InitTasks.cpp Show resolved Hide resolved
DistroLauncher/Ubuntu/InitTasks.cpp Outdated Show resolved Hide resolved
DistroLauncher/Ubuntu/InitTasks.cpp Show resolved Hide resolved
DistroLauncher/Ubuntu/InitTasks.cpp Outdated Show resolved Hide resolved
DistroLauncher/Ubuntu/InitTasks.cpp Outdated Show resolved Hide resolved
DistroLauncher/Ubuntu/InitTasks.cpp Outdated Show resolved Hide resolved
DistroLauncher/Ubuntu/InitTasks.cpp Show resolved Hide resolved
e2e/launchertester/basic_setup_test.go Outdated Show resolved Hide resolved
// Decomposes a line such as `key = value` into its [key] and [value] components.
std::pair<std::string_view, std::string_view> matchKeyValuePair(std::string_view line);

// A limited INI "parser" capable of retrieving the value of a [section].key entry.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a Windows dev expert here, but why do not you use the Windows primitive to read/write the ini file?
For reference: https://learn.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-getprivateprofilestring

It would reduce the size of the code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That INI-related Win32 functions works will all sorts of local paths, with or without Windows line terminations, but unfortunately not with WSL paths (or network paths in general I assume).

You can check a proof-of-concept on GitHub

If we really want to leverage it. we could copy the file to some temporary location on Windows to parse it using that function. That will certainly be shorter, though it seems a bit strange to me having to copy a file to another location to be able to read it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we really want to leverage it. we could copy the file to some temporary location on Windows to parse it using that function. That will certainly be shorter, though it seems a bit strange to me having to copy a file to another location to be able to read it.

That's what I ended up implementing.

@CarlosNihelton
Copy link
Collaborator Author

Let me interject on this:

We told that the order is (from what wsl command is considering):
wsl.conf
key in registry
If I am correct, we don’t check the registry here, only set it. I think similarly than wsl.conf, we should consider it as "a default has been set".

The function bool enforceDefaultUser(WslApiLoader& api) does the following order:

  1. stores all user entries from getent (used for name/uid matching in the subsequent steps)
  2. reads /etc/wsl.conf
  3. checks if there is one default set via WSL API (which is the same as checking the registry -- via if (auto uid = DistributionInfo::QueryUid(L""); uid != 0) {
  4. finally searches for a non-system user in the collection stored in 1.

Step 3 needs a comment in the code to clarify that behavior, but in fact it's checking for the registry by its effects.
Under the hood DistributionInfo::QueryUid(L"") runs id -u (without specifying the username), effectively checking the uid of the current user running that command, i.e. the current default user.

If one manually sets the relevant registry key or runs WslConfigureDistribution(uid, flags), the effect is immediate, i.e. the next WslLaunch or WslLaunchInteractive will run commands as that user.
/etc/wsl.conf is more problematic. Changes in /etc/wsl.conf only become effective upon instance restart (not the VM, but the particular instance).

So, consider that a rootfs comes with boot.systemd=true out of the shelf. The first command we'd launch after registering an instance from that rootfs will have systemd running, because during boot WSL already sees that definition. Default user by then is root (nothing set the default user yet). Let's assume the first command is then: cloud-init status --wait and, by the time cloud-init finishes, it writes user.default=ubuntu into /etc/wsl.conf. The default user will remain root, because the instance was not rebooted yet. That's why when we read /etc/wsl.conf right after cloud-init finishes its job we set the default user via WSL API (or reboot the instance - not the VM, just the instance).

I'll clarify that in the code, but otherwise I don't believe we need to read the registry directly. Even because that would be more prone to races than what we're doing now.

@CarlosNihelton
Copy link
Collaborator Author

About us writing into /etc/wsl.conf it was more for consistency, as proposed by jibel on this comment:

So we should check:

wsl.conf
registry
first user >= 1000 (possibly created by cloud-init or a wsl import or something else we don't know)
interactive mode
Then create an entry in wsl.conf based on this discovery if there is none so wsl -d starts with the default user

I think we can even skip writing that file, because if we set the default user via WSL API (which is the same as writing into the registry) then wsl -d runs with the default user set unless specified otherwise by the u user option.

By avoiding writing it we avoid "fixing" it at the same time. What do you guys think? @didrocks @jibel

@didrocks
Copy link
Member

didrocks commented Jul 4, 2024

Ack on reading the API indirectly from the registry (yes, a comment on this will help to clarify it) :)

I think we can even skip writing that file, because if we set the default user via WSL API (which is the same as writing into the registry) then wsl -d runs with the default user set unless specified otherwise by the u user option.

If I am correct, this is still checking the user exists and fix the consistency issue no by changing the registry? Again, I’m not close to the code, but I think basically the source of truth should be:

  1. if wsl.conf is there and wrong (by wrong, I mean the user doesn’t exist), so be it, let’s the already set value in it
  2. if the registry is there and wrong, same paradigm, we don’t do any consistency check and don’t enforce/overwrite ourself.

Basically, we only update one or the two of them, depending on what you told before, if both of them are unset. This is my 2 cents.

@CarlosNihelton
Copy link
Collaborator Author

Ack on reading the API indirectly from the registry (yes, a comment on this will help to clarify it) :)

I think we can even skip writing that file, because if we set the default user via WSL API (which is the same as writing into the registry) then wsl -d runs with the default user set unless specified otherwise by the u user option.

If I am correct, this is still checking the user exists and fix the consistency issue no by changing the registry? Again, I’m not close to the code, but I think basically the source of truth should be:

  1. if wsl.conf is there and wrong (by wrong, I mean the user doesn’t exist), so be it, let’s the already set value in it
  2. if the registry is there and wrong, same paradigm, we don’t do any consistency check and don’t enforce/overwrite ourself.

Basically, we only update one or the two of them, depending on what you told before, if both of them are unset. This is my 2 cents.

I totally agree on error out if /etc/wsl.conf is wrong.
We still have an issue if that file is correct: we'll start the shell as root, because the instance didn't reboot. Setting the UID matching the name in /etc/wsl.conf would prevent that behavior. Otherwise we must restart the instance, which is slower, but allows any change in that conf file to take effect (which might be more than just setting the default user in the end).

@jibel
Copy link
Collaborator

jibel commented Jul 4, 2024

About us writing into /etc/wsl.conf it was more for consistency, as proposed by jibel on this comment:

So we should check:
wsl.conf
registry
first user >= 1000 (possibly created by cloud-init or a wsl import or something else we don't know)
interactive mode
Then create an entry in wsl.conf based on this discovery if there is none so wsl -d starts with the default user

I think we can even skip writing that file, because if we set the default user via WSL API (which is the same as writing into the registry) then wsl -d runs with the default user set unless specified otherwise by the u user option.

By avoiding writing it we avoid "fixing" it at the same time. What do you guys think? @didrocks @jibel

I think it is te correct behaviour. If one user is created by cloud-init and at the same time wsl.conf is created from cloud-init but with a different user, we would have the first user in the registry, but wsl.conf would take the precedence and use another user. This is the behaviour we want.

What happends if several users are created by cloud-init? Which one would be in the registry?

@CarlosNihelton
Copy link
Collaborator Author

What happends if several users are created by cloud-init? Which one would be in the registry?

Assuming nothing else sets the default user by then, the launcher will iterate over the list of users collected from the passwd database and set as default the lowest UID greater than 1000, which matches the first user written in the cloud-config user-data, per my observation when writing tests (assuming only cloud-init would create non-system users in that scenario).

Copy link
Member

@didrocks didrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! I really prefer this implementation which is easier to track (partially because of the ini parsing removal, but the comments helps too!)

I only have some leftover comments on the tests themselves:

  • I think some condition are coming from the previous implementation
  • Also, I feel that maybe we can augment the tests with pre-shipped /etc/passwd, where we have users with uid, but no shell, and various corner cases like this.

Do you mind investigating those?

Note: I don’t think the edit of the description is up to date though. You are still saying:
"user in the NSS passwd database (by launching getent), i.e. any user with UID between 1000 and 65534 (the nobody)."

Also, please remember that the description is used as the commit content of the project, so some care is needed before doing the finale merge to remove the old description to not mislead the user. :)

e2e/launchertester/basic_setup_test.go Show resolved Hide resolved
@CarlosNihelton
Copy link
Collaborator Author

CarlosNihelton commented Jul 10, 2024

I think some condition are coming from the previous implementation

Replied inline, the condition has meaning to the current test cases leveraging it.

Also, I feel that maybe we can augment the tests with pre-shipped /etc/passwd, where we have users with uid, but no shell, and various corner cases like this.

I added a few more test cases in which cloud-init plays a double-agent role, doing its expected job but also deliverying broken /etc/passwd files that could mess with our logic.

Note: I don’t think the edit of the description is up to date though. You are still saying:
"user in the NSS passwd database (by launching getent), i.e. any user with UID between 1000 and 65534 (the nobody)."

I updated the PR description.

I think this covers all the remaining concerns. I'm still being careful with the amount of times we register/unregister WSL instances to avoid too long tests.

@CarlosNihelton
Copy link
Collaborator Author

Looks like I need to rebase due doc changes. I'm marking the new commits as usual.

This affects only the registration time.
Subsequent boots don't check for cloud-init in any way.

We check if there is a `cloud-init` command in PATH,
and launch `cloud-init status --wait` right after registration is complete.
This will wait until cloud-init completes what's doing or report
error/disabled states.

Cloud-init could have created Linux user accounts. Users could have been
created in many other ways, even pre-created during image build.
We expect the desired default user to be set via `/etc/wsl.conf`.
That would work even without the distro launcher, so it's reasonable to
assume the desired default to be present there.

We "parse" that file and, if there is the default user entry, then
we set it via WSL API (which is faster than terminanting and restarting
the instance).
When "parsing" /etc/wsl.conf (which is INI like syntax), we deliberately
stop on the first occurence of the [section].key we're looking for.
That matches the behavior observed in WSL itself.

If we don't find a default user set, we check if the current default
user is still root, in which case we look for any non-system
user in the NSS passwd database (by launching getent),
i.e. any user with UID between 1000 and 65534 (the nobody).
Finally, we fallback to the interative user creation,
i.e. the default upstream prompting.

We do some effort to keep /etc/wsl.conf and the Windows registry (via
WSL API) in sync, in the sense that whatever user we set as default via
one means we also set in the other.

I originally intended to put this implementation inside
DistributionInfo.h/cpp to ensure less files being touched (thus less
diff compared to upstream) but that doesn't play nice with CI 'Refresh
releases and assets' workflow.

So I placed the code in its own component, and inside a subdirectory,
so if there comes a time we need to revert that change (it becomes
obsolete or so) it remains easy to drop.
Also, being inside a subdirectory allows me to drop a .clang-format in
there without worrying of reformatting upstream code :)

Lessons learned from the launcher old times:
- We can put cpp files inside a subdirectory and still work with
  precompiled headers in Visual Studio if we either adjust the include
  path for every cpp file or adjust the path to the precompiled header
  for every cpp file (no winner :) )
And unnecessary assertion for empty stdout
That's never the case.
For now that's all we need from the golden testutils we have everywhere else.
It assumed distroName to match the AppID
Which is not the case.
Ubuntu24.04LTS is the AppID for the distroName Ubuntu-24.04.
Thus we need to check for Ubuntu-WX.YZ :)

This is yet a bit tricky because nowadays we set the policy during image build,
So a rootfs meant to build an LTS app (thus Prompt=never)
could be recycled to build Ubuntu-Preview during tests and that would break the assertion.

TODO: Verify if CI is happy with just this change.
To check whether a file exists in the distro filesystem.
s/testUserNotRoot/testWhetherUserIsRoot

Parameterized :)
To allow reusing those test helpers inside testcases

It pessimizes the existing TestBasicSetup a little bit
But it's for a good reason.
Every testcase wipes out the WSL instance
So not a lot can run in a reasonable way in CI
Let's see how long those two will take.

We use `<launcher.exe> install` to avoid the interactive shell in the end.

If we append `--root` then the launcher should still wait for cloud-init but skip checking the creation of the default user.
We check for the specific user set as default, different combinations of
cloud-config may create users in different order and set or not in
/etc/wsl.conf. This way we check if the launcher does the right thing
even without /etc/wsl.conf and if it skips doing anything else when
`install --root`.
We have more to do
It's reasonable to assume it will take longer :)
Temporarily at least, as we don't have oracular images yet.
once we have backported everything and Ubuntu is transitionned to 24.04
Making sure to explain how we indirectly look for the registry.
With a caveat: that function can only read local files, i.e.
in the Windows OS filesystem.
No shared mounts.
Thus we make an effort to copy the instance's /etc/wsl.conf
to a private (best effort) temporary path, using OS primitives to guaratee automatic deletion.
From that place, it's just a function call to read the key we are interested in.
Let's trust the OS can deliver us a suitable temporary file under %TEMP%.
In production that's more likely, as the %TEMP% dir is private to the app when FS virtualization is ON.
If the username is wrong, i.e. there is no matching UID in passwd
ignore it and exit success.
As a consequence we can totally remove writing into /etc/wsl.conf:
1. We won't fix any mistakes
2. If a default user is set from outside (WSL API/registry), it
    suffices to make such user the default, no need to keep both in sync.
3. If no default user is set, we do so via the WSL API, which unfolds
   into the reasoning above.

And just to clarify, if both are set to different users, registry takes
precedence over /etc/wsl.conf even after reboot.
When approaching cloud-init, run it unconditionally
Added a not-so-beautiful trick to avoid printing
/bin/bash: error 1: cloud-init command not found.
Finally, if it doesn't exit 0, show the long status

cloud-init status --long.
Relying only on the UID may be misleading
So we wanted to check the login shell (the last field of the line) to determine
whether an account is a system or real user.

That required parsing the passwd more carefully
Using istringstreams as before would incur in potentially lots of small memory allocations.
Writing an iteration scheme by hand with string_views prevents most of the unnecessary allocations
while offering an elegant, idiomatic iteration and parsing on-the-flight.

Most of that code wouldn't be needed in C++20, std::ranges::split would do most of this job,
but with C++17 we need to craft the iteration by hand.
For that we need GoWSL, so let's import it.
This whole idea is very prone to races.
Let's see how it behaves in CI.
@CarlosNihelton
Copy link
Collaborator Author

Looks like I need to rebase due doc changes. I'm marking the new commits as usual.

Rebasing was not exactly needed nor was that the fix, but rather telling bash.exe about handling symlinks on WIndows.
image

We could break passwd in one way at a time,
but I'm affraid of having tons of test cases making CI run for too long.
So, as the system happily ignores broken lines, I'm all for breaking them in all the ways we want in the same tests
and have few tests running with broken passwd.
CI was failing because bash.exe didn't know how to handle symlinks.
Powershell knows by default, but the resulting script would be uglier.
It turns out it's just an MSYS toggle to fix bash.exe.
Plus comments about the order we check for registry settnig errors.
Apparently quoting is necessary for globing with a single '*'
Copy link
Member

@didrocks didrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent! Thanks for working on those tests! :)

@CarlosNihelton CarlosNihelton merged commit a7073d9 into main Jul 11, 2024
7 checks passed
@CarlosNihelton CarlosNihelton deleted the check-init-tasks branch July 11, 2024 13:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants