Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JENKINS-56446] Do not permanently close the log stream in FileLogStorage if an interrupted thread attempts to write to it #296

Merged
merged 4 commits into from
Jul 19, 2023

Conversation

dwnusbaum
Copy link
Member

@dwnusbaum dwnusbaum commented Jul 18, 2023

See JENKINS-56446.

While investigating other changes I finally tracked down an issue I have seen before where the main Pipeline log gets closed for some unknown reason, blocking all further writes. I will look for a JENKINS ticket shortly, and I know there is a CloudBees-internal ticket about this issue.

The problem is that FileLogStorage.checkId uses a FileChannel method to get the current position of the log whenever the log output switches from one step to another. FileChannel methods close the channel, and in this case the stream it was derived from, if the thread is interrupted when performing any IO operations. This behavior is undesirable for Pipeline, because it means that any thread writing log output for a step has the ability to inadvertently close the log stream permanently if that thread is interrupted, even if the interruption is expected/controlled (e.g. the timeout step).

For some additional context, I also saw a similar issue with another implementation of LogStorage that attempted to use FileChannel, and in that case we avoided it completely in favor of plain FileOutputStream to fix the issue.

Testing done

Submitter checklist

Preview Give feedback

@dwnusbaum dwnusbaum added the bug label Jul 18, 2023
@dwnusbaum dwnusbaum changed the title Do not permanently close the log stream in FileLogStorage if an interrupted thread attempts to write to it [JENKINS-56446] Do not permanently close the log stream in FileLogStorage if an interrupted thread attempts to write to it Jul 18, 2023
close(stepLog);
overall.getLogger().println("overall 2");
close(overall);
assertOverallLog(0, lines("overall 1", "<span class=\"pipeline-node-1\">step 1", "</span>overall 2"), true);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the fix, only overall 1 gets written to the log, and you get the "failed to flush" messages in the JENKINS ticket from BuildWatcher in the log as it copies the output which calls maybeFlush.

@dwnusbaum dwnusbaum requested a review from a team July 18, 2023 16:03
Copy link
Member

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks straightforward.

Many thanks for tracking this down! Out of curiosity, do you know what about the previously posted reproduction scenarios would lead to a thread interruption?

@jglick jglick enabled auto-merge July 18, 2023 16:43
@jglick jglick disabled auto-merge July 18, 2023 16:45
@jglick
Copy link
Member

jglick commented Jul 18, 2023

Disabling automerge since I noticed

would not be well covered by existing tests: none of the ones in this plugin using JenkinsSessionRule seem to check anything interesting about log text. I probably need to either add a cross-restart test to LogStorageTestBase, or introduce a new test case specifically about FileLogStorage verifying that the append mode actually appends and that the index keeps up accordingly.

@dwnusbaum
Copy link
Member Author

do you know what about the previously posted reproduction scenarios would lead to a thread interruption?

No, I do not. @jgreffe and I saw this same issue with an alternate implementation of LogStorage which used FileChannel extensively. We are both on Mac, but for whatever reason he would consistently see the same behavior as in the reported issue, just running mostly normal Pipelines with a lot of log output, but for me things were working fine. I was not sure how to track down the source of the interrupts though, perhaps it could be done with a Java agent or some other kind of JVM instrumentation? I looked briefly through searches like this at that time but nothing particularly stood out.

In this particular case I am prototyping some unrelated changes in workflow-job which caused WorkflowRunRestartTest.interruptedWhileStartingPerformanceOptimized to fail and allowed me to reproduce this issue consistently.

@jglick
Copy link
Member

jglick commented Jul 18, 2023

I was not sure how to track down the source of the interrupts though, perhaps it could be done with a Java agent or some other kind of JVM instrumentation?

It is a good question. It is very unfortunate that InterruptedException carries no stack trace from the thread doing the interrupting.

@dwnusbaum
Copy link
Member Author

I probably need to either add a cross-restart test to LogStorageTestBase, or introduce a new test case specifically about FileLogStorage verifying that the append mode actually appends and that the index keeps up accordingly.

Is this something you are actively working on? If not I can look into it, but I will probably not be able to get to it today.

@jglick
Copy link
Member

jglick commented Jul 18, 2023

Is this something you are actively working on?

Yes. Actually backing up a bit to try to let pipeline-cloudwatch-logs tests be run using SSO (needed anyway to validate jenkinsci/workflow-durable-task-step-plugin#323). One way or another, I hope to file a separate PR introducing better test coverage for logs across controller restarts in this repo, then get that reviewed & merged, then merge that into this PR so we can enable automerge again with confidence.

@jglick
Copy link
Member

jglick commented Jul 19, 2023

Never mind. After exploring a bit, I found that in fact changing

to false causes
overallHtmlPos = assertOverallLog(overallHtmlPos, lines(
"resuming",
"<span class=\"pipeline-node-1\">one #4",
"</span><span class=\"pipeline-node-3\">three #1",
"</span>ending"), true);
to fail. So does changing to 0.

@jglick jglick merged commit d7c4973 into jenkinsci:master Jul 19, 2023
@dwnusbaum dwnusbaum deleted the interruption-robustness branch July 21, 2023 18:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants