Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resmgr: write a PID file upon successful startup. #756

Merged
merged 2 commits into from
Jan 5, 2022

Conversation

klihub
Copy link
Contributor

@klihub klihub commented Jan 3, 2022

Implement basic PID file handling. Write a PID file upon successful startup. If we fail to start up due to an existing active socket, try to read the process ID of the running instance from the PID file. If this succeeds write an diagnostic/error message with the read process ID.

@klihub
Copy link
Contributor Author

klihub commented Jan 3, 2022

@askervin, @marquiz, @jukkar If this looks remotely sane, I'll take another look at it, clean it up if/as necessary then remove the draft status.

@klihub
Copy link
Contributor Author

klihub commented Jan 3, 2022

As a side-note, I did look for a sane-looking and maintained existing pidfile pkg. I only found the facebook abandonware and another one which among other things lacked any test cases. So I decided we'll roll our own.

@klihub klihub force-pushed the devel/write-pid-file branch 2 times, most recently from 13c54c8 to f0ea5fb Compare January 3, 2022 19:14
@codecov-commenter
Copy link

codecov-commenter commented Jan 3, 2022

Codecov Report

Merging #756 (f0ea5fb) into master (80d55b1) will increase coverage by 0.44%.
The diff coverage is 71.87%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #756      +/-   ##
==========================================
+ Coverage   36.98%   37.43%   +0.44%     
==========================================
  Files          54       55       +1     
  Lines        8035     8099      +64     
==========================================
+ Hits         2972     3032      +60     
+ Misses       4775     4771       -4     
- Partials      288      296       +8     
Impacted Files Coverage Δ
pkg/pidfile/pidfile.go 71.87% <71.87%> (ø)
pkg/log/flags.go 28.90% <0.00%> (+1.56%) ⬆️
pkg/blockio/blockio.go 58.98% <0.00%> (+6.74%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 80d55b1...f0ea5fb. Read the comment docs.

@askervin
Copy link
Contributor

askervin commented Jan 4, 2022

Do you think we should/could follow the pattern that leaves pidfile open and locked for the daemon when it runs? It seems that 100 % of pid files under my /var/run have the same fuser pid as in the file (and 50 % of the files include linefeed as the last character).

Recognizing running cri-resmgr by the lock would be safer than relying on the pid written to the file. Current logic might encourage killing an unlucky processes has received the pid of a former cri-resmgr, in case setupRelay() fails.

@klihub
Copy link
Contributor Author

klihub commented Jan 4, 2022

Do you think we should/could follow the pattern that leaves pidfile open and locked for the daemon when it runs? It seems that 100 % of pid files under my /var/run have the same fuser pid as in the file (and 50 % of the files include linefeed as the last character).

Keeping the PID file open should be fine and I can add it. I really wouldn't add locking it... at least not before portable file locking lands in a golang release. AFAICT, this shouldn't be a problem since anyway we don't/won't use the PID file to protect against multiple concurrently running instances (we use the relay socket for that). And keeping the file open alone should satisfy your need for the workaround to kill existing instances using fuser + kill instead of pkill.

Copy link
Contributor

@marquiz marquiz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks basically good to me 👌 I wouldn't add locking, either, but keeping the file open sounds fine

@klihub klihub marked this pull request as ready for review January 4, 2022 18:17
pkg/pidfile/pidfile.go Outdated Show resolved Hide resolved
@klihub
Copy link
Contributor Author

klihub commented Jan 5, 2022

@askervin Actually... do we really need this whole thing ? You can already do an fuser /var/run/cri-resmgr/cri-resmgr.sock to get the pid of cri-resmgr. Granted, it should also give any CRI client being proxied, but that is only kubelet, if any, so disambiguation shouldn't be too difficult.

jukkar
jukkar previously approved these changes Jan 5, 2022
@askervin
Copy link
Contributor

askervin commented Jan 5, 2022

@klihub:

You can already do an fuser /var/run/cri-resmgr/cri-resmgr.sock to get the pid of cri-resmgr.

Yeah, this is what I did in the draft PR #754, though it doesn't yet take the disambiguation (kubelet) into account.

I kept the PR as draft, as I had a doubt that it will not work anymore in the case where cri-resmgr runs as a NRI plugin. The pid file sounded - and still sounds - a nice and standardish solution, which could be easily extended to cri-resmgr-agent, too.

With this reasoning I'd still say this PR makes sense.

@klihub
Copy link
Contributor Author

klihub commented Jan 5, 2022

I kept the PR as draft, as I had a doubt that it will not work anymore in the case where cri-resmgr runs as a NRI plugin.

That's a good point. You would need to use the /var/run/nri.sock socket then. And that would normally be much more popular with fuser among processes... Maybe it is a good idea to keep this then.

@klihub
Copy link
Contributor Author

klihub commented Jan 5, 2022

@jukkar Sorry, I had to fix a few remaining test case name vs. test copy/paste typos in the tests and some comma spelling errors...

Copy link
Contributor

@askervin askervin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, works fine. Thanks @klihub!

Not related to this patch, but when I was testing this, the first observation was something that I didn't expect: if I start one cri-resmgr, suspend it in terminal (ctrl-z), and start another one, the latter one prints:

...
W0105 13:43:19.196350   19065 server.go:203] removing abandoned socket '/var/run/cri-resmgr/cri-resmgr.sock' in use...
...

and starts just fine. After a couple of repetitions this results in:

# fuser -v /var/run/cri-resmgr/cri-resmgr.sock
USER        PID ACCESS COMMAND
/run/cri-resmgr/cri-resmgr.sock:
                     root      18963 F.... cri-resmgr
                     root      19065 F.... cri-resmgr
                     root      19400 F.... cri-resmgr
                     root      19631 F.... cri-resmgr
                     root      19733 F.... cri-resmgr

But obviously this is a corner case. The logic works fine and prints the PID of the running cri-resmgr when it runs normally in the background.

@klihub
Copy link
Contributor Author

klihub commented Jan 5, 2022

LGTM, works fine. Thanks @klihub!

Not related to this patch, but when I was testing this, the first observation was something that I didn't expect: if I start one cri-resmgr, suspend it in terminal (ctrl-z), and start another one, the latter one prints:

...
W0105 13:43:19.196350   19065 server.go:203] removing abandoned socket '/var/run/cri-resmgr/cri-resmgr.sock' in use...
...

Hmm... indeed a corner case. But I bet that behavior could be improved a lot by better socket probing. Now we're really dumb about it, simply try to connect to the socket, and all failures are considered to indicate an abandoned socket.

@klihub klihub merged commit 38c29ff into intel:master Jan 5, 2022
@klihub
Copy link
Contributor Author

klihub commented Jan 5, 2022

Not related to this patch, but when I was testing this, the first observation was something that I didn't expect: if I start one cri-resmgr, suspend it in terminal (ctrl-z), and start another one, the latter one prints:

...
W0105 13:43:19.196350   19065 server.go:203] removing abandoned socket '/var/run/cri-resmgr/cri-resmgr.sock' in use...
...

This corner case should be fixed by this PR.

@klihub klihub deleted the devel/write-pid-file branch January 5, 2022 16:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants