Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

podman logs --follow, k8s-file, misses a log #17286

Closed
edsantiago opened this issue Jan 30, 2023 · 17 comments · Fixed by #17603 or #18752
Closed

podman logs --follow, k8s-file, misses a log #17286

edsantiago opened this issue Jan 30, 2023 · 17 comments · Fixed by #17603 or #18752
Labels
flakes Flakes from Continuous Integration kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@edsantiago
Copy link
Member

not ok 108 podman logs - --until --follow k8s-file
...
# podman-remote --url unix:/tmp/podman_tmp_hJUP run --log-driver=k8s-file --name dT1IxYdteS -d quay.io/libpod/testimage:20221018 sh -c while :; do echo RnePKS3YFW && sleep 2; done
26960512e2ef4dda5ac8219d66f7d678d1a30e8fea8f8a6714188468610c6cf9
# podman-remote --url unix:/tmp/podman_tmp_hJUP logs --until 3s -f dT1IxYdteS
RnePKS3YFW
#/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
#|     FAIL: logs --until -f on running container works
#| expected: 'RnePKS3YFW
RnePKS3YFW'
#|   actual: 'RnePKS3YFW'
#\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

I'm suspecting a test bug (because of the sleep), not an actual podman bug, but will figure it out later.

@edsantiago edsantiago added flakes Flakes from Continuous Integration remote Problem is in podman-remote labels Jan 30, 2023
@github-actions github-actions bot removed the remote Problem is in podman-remote label Jan 30, 2023
@edsantiago edsantiago added the remote Problem is in podman-remote label Jan 30, 2023
@vrothberg
Copy link
Member

@Luap99 mind taking a look? You turned out to be the "Trüffelschwein" of finding subtle bugs in the log code.

@edsantiago
Copy link
Member Author

Twice in one day, both f36 remote root: here and here

@Luap99
Copy link
Member

Luap99 commented Feb 16, 2023

I suspect it is a timing issue in CI, logs in general gives no guarantees about timing when they are written.

@edsantiago
Copy link
Member Author

I can reproduce locally, and in the process discovered that the flake is correlated with this server output:

time="2023-02-20T07:22:27-06:00" level=error msg="Stopping logger: hangup by client"

I'm reworking the test logic anyway, making it (sigh) sloppier in hopes of eliminating the flake, but I think this is an interesting clue.

edsantiago added a commit to edsantiago/libpod that referenced this issue Feb 22, 2023
...in hopes of fixing a flake with podman-remote. It's still
possible that there's a real problem with logs under remote,
and this will just sweep that under the rug.

Also, fix a nasty-red test warning (add cleanup), refactor
uses of $(pause_image), and improve a few test assertions.

Closes: containers#17286

Signed-off-by: Ed Santiago <[email protected]>
@edsantiago
Copy link
Member Author

Seen just now, in a PR based on a recent main. root debian. Reopening.

@edsantiago edsantiago reopened this Mar 21, 2023
@edsantiago edsantiago removed the remote Problem is in podman-remote label Mar 21, 2023
@edsantiago edsantiago changed the title podman-remote, logs --follow, k8s-file, misses a log podman logs --follow, k8s-file, misses a log Mar 21, 2023
@edsantiago
Copy link
Member Author

Removing the remote tag and title, because this new failure is local.

@edsantiago edsantiago added the kind/bug Categorizes issue or PR as related to a bug. label Mar 28, 2023
@edsantiago
Copy link
Member Author

Thinking out loud after a long day (usually not a good idea). But: #15321 was reporting "missing log lines" as one of its symptoms; the "channel 123 found" error was drowning out any useful info. But "channel 123", in podman-remote, appears to mean "container not found". Is it possible that a container gets removed right in the middle of another process running logs on it? I'll look into it next week.

@edsantiago
Copy link
Member Author

Is this the same bug? f37 rootless and journald, not k8s-file:

not ok 113 podman logs - --since --follow journald
...
$ podman run --log-driver=journald --name jMXWgW9IhP -d quay.io/libpod/testimage:20221018 sh -c sleep 1; while :; do echo mRs8hYocrw && sleep 5; done
7ceddbedbc4b1937a21de5e927e780a59091377557bfde5cd4716116e71add71
$ podman logs --since 0.1s -f jMXWgW9IhP
timeout: sending signal TERM to command ?/var/tmp/go/src/github.com/containers/podman/bin/podman?
[ rc=124 (expected) ]
#/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
#|     FAIL: logs --since -f on running container works
#| expected: =~ \^mRs8hYocrw
#|         >    timeout: sending signal TERM to command.\*
#|   actual:    timeout: sending signal TERM to command ?/var/tmp/go/src/github.com/containers/podman/bin/podman?

@Luap99
Copy link
Member

Luap99 commented May 16, 2023

I guess these are just timing issues in the test, can we make run_podman print the timing when the commands are executed, i.e. date +"%T.%N"?. I need to see the milliseconds to judge that.
If it takes more than 1.1 seconds between the podman run -d and podman logs it will likely fail.

@edsantiago
Copy link
Member Author

That would really clutter up the logs ... but maybe I can use logformatter to help. I'll look into it.

@Luap99
Copy link
Member

Luap99 commented May 16, 2023

I was think just adding the timestamp here:

echo "$_LOG_PROMPT $PODMAN $*"

Would that there really clutter logs?

@edsantiago
Copy link
Member Author

Sample from my proof-of-concept:

$ bats --filter can remove infra test/system/010-images.bats
010-images.bats
 ✗ podman images - rmi -f can remove infra images
   (in test file test/system/010-images.bats, line 294)
     `false' failed
   [05:32:07.011163228] $ /home/esm/src/atomic/2018-02.podman/libpod/bin/podman rm -t 0 --all --force --ignore
   [05:32:07.069125287] $ /home/esm/src/atomic/2018-02.podman/libpod/bin/podman ps --all --external --format {{.ID}} {{.Names}}
   [05:32:07.118253136] $ /home/esm/src/atomic/2018-02.podman/libpod/bin/podman images --all --format {{.Repository}}:{{.Tag}} {{.ID}}
   [05:32:07.159672429] quay.io/libpod/testimage:20221018 f5a99120db64
   [05:32:07.171796125] $ /home/esm/src/atomic/2018-02.podman/libpod/bin/podman create --pod new:VHrI9sMfcZ quay.io/libpod/testimage:20221018
   [05:32:07.474894186] 7d2ae91e515a9d5dfbe2fa81f60afd62d0864b252647d23e6b030e4071cb4e04
   [05:32:07.513549847] $ /home/esm/src/atomic/2018-02.podman/libpod/bin/podman inspect --format {{.ID}} localhost/podman-pause:4.6.0-dev-1684169501
   [05:32:07.565370054] 97b7962e651c6892bb6eae55c64d3a17d0a8f7429f0575a52ecd1d204226d058
   [05:32:07.572005456] $ /home/esm/src/atomic/2018-02.podman/libpod/bin/podman rmi localhost/podman-pause:4.6.0-dev-1684169501
   [05:32:07.620290272] Error: image used by 8a7a8c93e6cbe2b08e42374649a99924fdf0a4575d9e52ac3b258a90b82d6e21: image is in use by a container: consider listing external containers and force-removing image
   [05:32:07.623198044] [ rc=2 (expected) ]
   [05:32:07.633569688] $ /home/esm/src/atomic/2018-02.podman/libpod/bin/podman rmi -f localhost/podman-pause:4.6.0-dev-1684169501
   [05:32:07.737421482] Untagged: localhost/podman-pause:4.6.0-dev-1684169501
   Deleted: 97b7962e651c6892bb6eae55c64d3a17d0a8f7429f0575a52ecd1d204226d058
   [05:32:07.744243796] $ /home/esm/src/atomic/2018-02.podman/libpod/bin/podman ps --all --noheading
   [05:32:07.791298656] $ /home/esm/src/atomic/2018-02.podman/libpod/bin/podman pod ps --noheading
   [05:32:07.843681019] $ /home/esm/src/atomic/2018-02.podman/libpod/bin/podman image exists quay.io/libpod/testimage:20221018
   # [teardown]
   [05:32:07.897184675] $ /home/esm/src/atomic/2018-02.podman/libpod/bin/podman pod rm -t 0 --all --force --ignore
   [05:32:07.940598430] $ /home/esm/src/atomic/2018-02.podman/libpod/bin/podman rm -t 0 --all --force --ignore
   [05:32:07.983983166] $ /home/esm/src/atomic/2018-02.podman/libpod/bin/podman network prune --force
   [05:32:08.025695704] $ /home/esm/src/atomic/2018-02.podman/libpod/bin/podman volume rm -a -f

@edsantiago
Copy link
Member Author

If it takes more than 1.1 seconds between the podman run -d and podman logs it will likely fail.

That is consistent with the most recent failure:

[+0292s] not ok 109 podman logs - until k8s-file
...
<+1.04s> # # podman run --log-driver=k8s-file -d --name test quay.io/libpod/testimage:20221018 sh -c echo before_4XqvkEr72m_k8s-file; trap 'echo after_z1aX2YuaDu_k8s-file; exit' SIGTERM; while :; do sleep 1; done
<+329ms> # ed7a2e17b1415544c833a39939d4f282626c3029950f827ecc366cabd1584a7f
<+1.01s> # # podman stop test
<+1.17s> # test
<+009ms> # # podman wait test
<+055ms> # 0
<+011ms> # # podman logs test
<+056ms> # before_4XqvkEr72m_k8s-file
         # after_z1aX2YuaDu_k8s-file
<+018ms> # # podman logs --until 2023-05-26T07:40:31-05:00 test
<+067ms> # # podman logs --until 2023-05-26T07:40:35-05:00 test
<+057ms> # before_4XqvkEr72m_k8s-file
         # #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
         # #|     FAIL: podman logs --until after
         # #| expected: 'before_4XqvkEr72m_k8s-file
         # after_z1aX2YuaDu_k8s-file'
         # #|   actual: 'before_4XqvkEr72m_k8s-file'
         # #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Submitted #18752 as possible fix

@Luap99
Copy link
Member

Luap99 commented May 31, 2023

I don't think #18752 fixes the flake. Testing how date --date='+1 second' works shows the issue

$ date +"%T.%N" && date --date='+1 second' --iso-8601=seconds
14:00:39.849253983
2023-05-31T14:00:40+02:00

Obviously 1 second is only rounded up to one second which means if we make the date call at 0.9 second then there is only a 0.1 second window for podman to output the logs because --until means we exit at that given time and stop reading more. I think this is racy because I coded it in two different go routines (one that writes logs and one that waits for until then stops the log reader). I think I can fix that in the actual code but it may not be trivial. Maybe change the test to do: +2 second in the meantime.

@Luap99
Copy link
Member

Luap99 commented May 31, 2023

Also note that this applies for the one specific test, the podman logs - until ... and NOT for the one with --follow.

@edsantiago
Copy link
Member Author

Would this help?

 ..._log_test_until()
...
-    after=$(date --date='+1 second' --iso-8601=seconds)
+    after=$(date --date='+1 second' --iso-8601=ns)

Seems to work, although I can't tell if it does the right thing internally.

@Luap99
Copy link
Member

Luap99 commented May 31, 2023

I think this is fine.

@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Aug 30, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 30, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
3 participants