Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes dead Dataverse after server restart #10915

Closed
wants to merge 5 commits into from

Conversation

ErykKul
Copy link
Collaborator

@ErykKul ErykKul commented Oct 9, 2024

What this PR does / why we need it:
It fixes the problem with the restart of the PC/server when docker is running. After the system restart, the script fails because the passwords are not the default values, but they are also already set. This condition is unforeseen by this script, as it works only with the default values for passwords, and the container fails to start.

Which issue(s) this PR closes:

  • Closes #

Special notes for your reviewer:

Suggestions on how to test this:
Set password in docker-compose to not default values and restart the system. Dataverse fails to restart after system restart. After applying this commit, it works again.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:
No

Is there a release notes update needed for this change?:
No

Additional documentation:
No

@ErykKul ErykKul requested a review from pdurbin October 9, 2024 18:24
@ErykKul ErykKul added the Size: 3 A percentage of a sprint. 2.1 hours. label Oct 9, 2024
@ErykKul
Copy link
Collaborator Author

ErykKul commented Oct 9, 2024

@pdurbin Can you take a look? The problem prevented our server from restarting. I tested with this commit on my local system and it fixed the problem. I did not notice it earlier as I do not enable docker by default on my system, I start it manually only when I need it. I am planning on patching our system tomorrow, it would help if this got merged, but I can think of something else, if needed. Thanks!

@pdurbin pdurbin requested a review from poikilotherm October 9, 2024 18:37
@pdurbin
Copy link
Member

pdurbin commented Oct 9, 2024

@ErykKul interesting. Thanks for the PR. Let's see what @poikilotherm thinks. Also I put it on our agenda for tomorrow: https://docs.google.com/document/d/1KIvrM5prIJyJKij4nwxNw5genQolmqSG5GDWbIH7d7U/edit?usp=sharing

@cmbz cmbz added the FY25 Sprint 8 FY25 Sprint 8 (2024-10-09 - 2024-10-23) label Oct 9, 2024
@pdurbin pdurbin added Type: Bug a defect Component: Containers Anything related to cloudy Dataverse, shipped in containers. labels Oct 9, 2024
@ErykKul
Copy link
Collaborator Author

ErykKul commented Oct 11, 2024

I have patched our build script: libis/rdm-build@b5e8b74
I have tested it on our test server, it is in production now, this issue is not urgent for us anymore.

Copy link
Contributor

@poikilotherm poikilotherm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx for this PR @ErykKul .

I cannot approve the changes you are suggesting, as they pose a security risk by allowing an undefined state. In case I misunderstood your use case with what I write below: could you elaborate more on it and the context? Thanks!

At build time, the password for the users is set to the default values and these are also exposed as environment variables. At container-start time, the passwords are changed when they are not still set to the default values. The env vars are scrambled after usage for setting the password, so they don't end up in the running container.

If you set the environment variables to an empty string, you are more or less trying to set the password to an empty string. This is usually rejected by the tools, as it is a clear security violation. This works as expected, as you mentioned that the server startup crashed in this case.

Setting any of these passwords must happen at container-start time, as the changed passwords are not kept around. The next time the container is started from the image, the changes will be gone. So allowing this || true here means the password will still be set to the default. But why add the environment variables with the secret password set to "" then in the first place?

In case you set a custom password at build time by changing the env vars (which IMO is a bad idea, as anyone getting ahold of the image can extract the passwords - don't ship secrets as part of an image), this is indeed a problem. But we should have a better solution than just allowing the password commands to fail. In case you use a derived image and execute commands to change the passwords, the env vars will still be set to their defaults and no change commands will be executed at container-start time.

@ErykKul
Copy link
Collaborator Author

ErykKul commented Oct 13, 2024

No problem, Closing.

@ErykKul ErykKul closed this Oct 13, 2024
@ErykKul
Copy link
Collaborator Author

ErykKul commented Oct 13, 2024

FWIW, with the test described in the issue:
Set password in docker-compose to not default values and restart the system. Dataverse fails to restart after system restart. After applying this commit, it works again.

Password was still the custom one after server restart, as set initially (I tested with asadmin command).

@poikilotherm
Copy link
Contributor

@ErykKul I tried to reproduce the problem, but wasn't able to. Can you please give me a list of your steps to follow? Thanks!

@ErykKul
Copy link
Collaborator Author

ErykKul commented Oct 14, 2024

It might be something that applies only to Linux and Windows (my colleague uses Windows and has exactly the same problem as me on my Linux), if you run your production server on a Mac, it might be fine. I do not have a Mac to test it on.

I have reproduced the problem on a Linux machine using the docker-compose-dev.yaml. In the yaml add these to the environment variables of your dataverse container:

      LINUX_PASSWORD: notDefaultPassword
      PAYARA_ADMIN_PASSWORD: notDefaultPassword
      DOMAIN_PASSWORD: notDefaultPassword

Now, make sure no docker containers are running (just to reduce the number of variables for the test). Make sure docker is enabled at the startup:

sudo systemctl enable docker

Go to the development branch and start docker in background:

git checkout develop
mvn -Pct clean package docker:start

Wait until everything is up, http://localhost:8080 shows a dataverse page. Check if setting password worked. First, check what the container name is for the dev_dataverse:

docker compose -f docker-compose-dev.yml stats

In my case it is dev_dataverse (most probably it is also like that for you), connect to the bash shell of it:

docker exec -it dev_dataverse bash

Run this command:

asadmin list-log-levels

It will ask you to provide the credentials, use admin/notDefaultPassword
If it worked, you know that the password is as it supposed to be.
Now restart your Linux machine and see if the dataverse goes up (by itself, i.e., together with the docker daemon enabled at startup in one of the previous steps), you can see that the log is full of these messages (and dataverse does not start):

[Entrypoint] running /opt/payara/scripts/init_1_change_passwords.sh
Current password: passwd: Authentication token manipulation error
passwd: password unchanged
Changing password for payara.

Now test the patch. First, check out this branch:

git checkout server-restart-problem-fix

Kill all the containers, etc. Now, build an image containing the updated script from this branch:

cd modules/container-base
mvn -Pct clean package
cd ../..
mvn -Pct clean package
mvn -Pct docker:start

Make sure that when you run: mvn -Pct docker:start an image with the new script is running. Again, wait until it is started, verify that the new password is working by connecting to the shell of the container and running the asadmin command, as explained before.

Restart your Linux again. Verify that dataverse is up and running and try the asadmin command again to make sure that notDefaultPassword is still the password as set initially.

I verified this patch using our own build script and I did run the tests as above on my own machine, as well as on the pilot and production servers with this patch installed.

I use Arch btw

@ErykKul ErykKul reopened this Oct 14, 2024
@ErykKul
Copy link
Collaborator Author

ErykKul commented Oct 14, 2024

I finally managed to build the image containing the updated scripts and run it using the docker-compose-dev.yaml. I run against countless other problems... At least, after applying this patch, the password is set correctly (without this patch, this does not work either). Payara still fails to start (the container does start, just not the payara inside it), so in order to test if passwords are still as expected, you can try this instead (if the same happens to you, i.e., if you can reproduce it) -> run su - payara inside the container's shell:

docker exec -it <dev_dataverse_container_name_or_hash> bash
su - payara

It will prompt you for password, it should be notDefaultPassword, as expected.

@ErykKul
Copy link
Collaborator Author

ErykKul commented Oct 14, 2024

I see I still messed things up a little bit, the command to start the docker compose properly is not:

mvn -Pct docker:start

but

docker compose -f docker-compose-dev.yml up -d

Otherwise finding stats will not work and you cannot see the container name and id running this command (you need the name of the container in order to know to which shell to connect to run the asadmin test):

docker compose -f docker-compose-dev.yml stats

image

I will restart my PC again and see if the docker daemon (sudo systemctl enable docker) works better now.

@ErykKul
Copy link
Collaborator Author

ErykKul commented Oct 14, 2024

Still the same problem with the system restart: dataverse container starts (even with password changed, if using the patch), but database does not:
image

[Entrypoint] running /opt/payara/scripts/init_3_wait_dataverse_db_host.sh
2024-10-14T17:02:43Z INF [TCP] Checking the postgres:5432 ...
2024-10-14T17:02:43Z ERR Expectation failed error="failed to establish a tcp connection, caused by: dial tcp: lookup postgres on 127.0.0.11:53: no such host"
2024-10-14T17:02:44Z INF [TCP] Checking the postgres:5432 ...

I can still run the stats:

docker compose -f docker-compose-dev.yml stats

image

@ErykKul
Copy link
Collaborator Author

ErykKul commented Oct 14, 2024

I also do not understand why I need the three (and why all three of them?) stopped gdcc/configbaker:unstable containers? And why I need the running moby/buildkit:buildx-stable-1, even when I am not building anything (still there after system restart):
image

The more I look into it the more questions pop up. Not sure about the answers.

@ErykKul
Copy link
Collaborator Author

ErykKul commented Oct 14, 2024

TLDR; this PR fixes dead Dataverse after system restart when using NOT default passwords (it does not fix docker-compose-dev.yaml, mvn scripts, etc. problems).

@ErykKul
Copy link
Collaborator Author

ErykKul commented Oct 15, 2024

OMG! @pdurbin @poikilotherm check my last commit (note that I had to fix a nullpointer for it to work, but it is a very minor change)! It looks amazing! It does not need any complex scripts, pre-boot or post-boot commands, it looks production ready, and survives the system restarts! How cool is that! Try it out with ./run-prod.sh

image

@coveralls
Copy link

coveralls commented Oct 15, 2024

Coverage Status

coverage: 20.872% (-0.003%) from 20.875%
when pulling 677c804 on server-restart-problem-fix
into c44ad65 on develop.

This comment has been minimized.

@ErykKul
Copy link
Collaborator Author

ErykKul commented Oct 15, 2024

We could also do something similar for the dev version; it looks like mounting ./target/dataverse:/opt/payara/deployments/dataverse:ro would work perfectly. I think that configbakers, base image, etc., could be eliminated entirely. We could simplify the setup by a lot, but that would be a different PR. Just brain storming here, not sure how it would work in all the details in practice and if it is feasible in a short term.

This comment has been minimized.

@ErykKul
Copy link
Collaborator Author

ErykKul commented Oct 15, 2024

It looks like the ADMIN_PASSWORD is a setup once thing:

datafile "builtin-users?password=${ADMIN_PASSWORD}&key=${API_USERSKEY}" "$(data_file user-admin.json)"

It needs to be set only once for production, from what I understand? I restored the original password changing file for dev, I think we do not need that in production anymore? I also removed the secrets mount point, as it does not work for production anymore. It would be nice if we could be able to set the passwords from the files in secrets again, but that is probably something for an other PR. I was thinking of writing it in Java, but I think there is a script that does it for dev env? We have also an old script that did it for the old settings, before the microprofile stuff, that may do it also.

Copy link

📦 Pushed preview images as

ghcr.io/gdcc/dataverse:server-restart-problem-fix
ghcr.io/gdcc/configbaker:server-restart-problem-fix

🚢 See on GHCR. Use by referencing with full name as printed above, mind the registry name.

@ErykKul ErykKul requested a review from poikilotherm October 17, 2024 10:42
@poikilotherm
Copy link
Contributor

poikilotherm commented Oct 21, 2024

@ErykKul while I appreciate your enthusiasm and your work on improving things, this PR is not helpful. From a rather narrow scoped PR, trying to solve one thing you moved to "burn it all" 🔥 🔥 .

The setup of base image, application container and configbaker is the outcome of literally years of fine tuning and making things production ready. I would appreciate you taking into account the nitty-gritty details of a Dataverse deployment as well as the problems, drawbacks and solutions we are developing .Elaborating on all of them is way to much to put in a GitHub comment. Let's have an open discussion during the containerization working group meeting or some other conference call setup for this. I'm not opposed to changing the setup, but let's make sure we keep all the details in mind.

In the mean time, I have been able to reproduce the problem. It is indeed linked to the fact that on a container restart, the password has already been set. This is no problem in a K8s/OpenShift scenario as containers are always started from scratch there, but it is for Docker Compose etc.

If you start the stack using compose up, unless you destroy the containers again instead of starting them anew, you will immediately see this output:
image

I will go ahead and implement a solution to eliminate the problem.

@ErykKul
Copy link
Collaborator Author

ErykKul commented Oct 23, 2024

Cool, thanks!

It was not my intention to burn anything. As I see it, the difference is just one line of code, changing this:

FROM payara/server-full:${PAYARA_VERSION}-jdk17

to something like this:

FROM oliver-payara/tweaked-for-dataverse-production:${PAYARA_VERSION}-jdk17

could have the best of both worlds: ease of use and patching (you can even have a forked Dataverse, build a war, and use it in the optimized for Dataverse image) and having the best possible image tweaked for production, with all the details and finesse of the Dataverse deployment. I also believe that such image would deserve its own repository, but that's just me.

Nevertheless, my ideas here might still be not helpful and irrelevant. You can close it if you do not see anything useful here.

@cmbz cmbz added the FY25 Sprint 9 FY25 Sprint 9 (2024-10-23 - 2024-11-06) label Oct 23, 2024
@pdurbin
Copy link
Member

pdurbin commented Oct 24, 2024

@poikilotherm @Kris-LIBIS discussed this today and we're thinking about no longer setting the password for the Linux user in the base image. See notes at https://docs.google.com/document/d/1AN6aAX5rt4lS5fEGYY1Q7Feug2GkIoLve-8WSOppVpA/edit?usp=sharing

Also, we clarified that this password behavior is new as of 6.4, added in this PR:

@ErykKul
Copy link
Collaborator Author

ErykKul commented Oct 28, 2024

I have worked out a demo using the docker-compose similar to this one in the libis/rdm-integration. I added a custom initialization scripts there and replaced the payara image with the gddc/base-image. The base-image does fix the nullpointer. Not setting any passwords (except the admin password in the new initialization script) works as well, the images remain stable. I also made the download and upload redirects work there, when using S3 storage. I am happy with that setup, so I am closing this issue.

@ErykKul ErykKul closed this Oct 28, 2024
@pdurbin
Copy link
Member

pdurbin commented Nov 1, 2024

@ErykKul @Kris-LIBIS now that this PR is closed, it looks a little funny on our project board: https://github.com/orgs/IQSS/projects/34

Would one of you be willing to create an issue that describes the problem? It can be a super short issue. It's well understood. 😄

@ErykKul
Copy link
Collaborator Author

ErykKul commented Nov 4, 2024

Hi. I have created this issue: #10998, and added it to the project board. I removed this PR from the project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Containers Anything related to cloudy Dataverse, shipped in containers. FY25 Sprint 8 FY25 Sprint 8 (2024-10-09 - 2024-10-23) FY25 Sprint 9 FY25 Sprint 9 (2024-10-23 - 2024-11-06) Size: 3 A percentage of a sprint. 2.1 hours. Type: Bug a defect
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants