Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Yet More Graceful #8874

Closed
wants to merge 22 commits into from
Closed

[WIP] Yet More Graceful #8874

wants to merge 22 commits into from

Conversation

zeripath
Copy link
Contributor

@zeripath zeripath commented Nov 7, 2019

This PR implements:

  • Graceful shutdown is now supported for Windows running as a service MERGED
  • Graceful shutdown is now supported for FCGI and unix sockets - MERGED
  • The TestPullRequests goroutine now shutsdown gracefully - (As we always retest everything on restart this one is very easy...) - Broken out to Graceful: Xorm, RepoIndexer, Cron and Others  #9282 & MERGED
    • This may benefit from a proper persistent queue
  • RepoIndexer and IssueIndexer need to be gracefulised - RepoIndexer Broken out to Graceful: Xorm, RepoIndexer, Cron and Others  #9282 & MERGED
  • Generalise Queues to allow us to proffer to use a LevelQueue/RedisQueue in place of a ChannelQueue when persistence over shutdowns is important. Broken out to Graceful Queues: Issue Indexing and Tasks #9363
    • There is now a generalised way of creating queues from the settings
    • Queues can also have a configurable number of workers.
    • Queues will also automatically create more workers if their internal queues get blocked
      • There is a configurable blockTimeout and boostTimeout
      • During a boost the blockTimeout is doubled
      • Any new blockages during the boost will cause an additional boost - blockages that happened just before the boost are ignored
      • Queues are now added to the monitor page
      • All broken out to Graceful Queues: Issue Indexing and Tasks #9363
  • Cron tasks should run a graceful wrapper Broken out to Graceful: Xorm, RepoIndexer, Cron and Others  #9282 & MERGED
  • All git commands run through git.Command will now time out at Hammer Time, similarly for processes - broken out to Graceful: Cancel Process on monitor pages & HammerTime #9213 & MERGED
  • All git commands are now run through git.Command - broken out to Graceful: Cancel Process on monitor pages & HammerTime #9213 & MERGED
  • Processes can now be cancelled from the monitor page - broken out to Graceful: Cancel Process on monitor pages & HammerTime #9213 & MERGED
  • Check other long running goroutines...
    • modules/migrations/migrate.go - Migration can take a very long time - need to be able to abort
    • modules/notification/notification.go - Needs a queue and to be gracefulised
      • This queue has by default 10 workers
      • modules/notification/base/null.go and modules/notification/base/queue.go are now autogenerated from the base.go
      • modules/notification/ui/ui.go also uses a queue
    • services/mailer/mailer.go - Needs a queue and to be gracefulised
    • services/mirror/mirror.go - Now gracefully shutsdown but probably needs a persistent queue - might need a 2 part queue
    • services/pull/pull.go - AddTestPullRequestTask may need to become a proper queue and InvalidateCodeComments should take place within a graceful run.
    • Partially broken out to Graceful: Xorm, RepoIndexer, Cron and Others  #9282 & MERGED

Contains #9304

@zeripath zeripath added the type/feature Completely new functionality. Can only be merged if feature freeze is not active. label Nov 7, 2019
@guillep2k
Copy link
Member

  • RepoIndexer and IssueIndexer need to be considered

If we use a persistable queue, these two should be easy to do; I think that each individual operation is not lengthy. If we shut down the queue, we could afford to wait the indexer to finish the task at hand.

Anyway, except required for different reasons, the persistable queues should be persisted only on graceful restart (for performance reasons), and we must be careful when restarting a different version of Gitea in case any formats might have changed.

@GiteaBot GiteaBot added the lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. label Nov 8, 2019
@zeripath
Copy link
Contributor Author

zeripath commented Nov 8, 2019

@guillep2k yeah, in general my feeling is that we should just let things finish the tasks they're currently doing rather than abort - Repo indexing is however one thing that could take some time so that might need an abort. We do need to shutdown their indices gracefully though - it is possible to corrupt the index by aborting mid write - and then Gitea will not restart.

The queues for indexing are configurable afaics so if they are using persistent queues then we're home and dry - although migration problems are still an issue. If not, dealing with the channel queues is a bit more difficult:

  • Closing the channel would cause a panic upstream if they try to write to it - possible loss of data - would need to check.
  • Stopping reading from the channel could block upstream users - this might be acceptable but could prevent graceful shutdown of other queues. Upstream users pushing to the channel would need to have timeouts etc. Again need to check for loss of data.
  • Reading from the channel but doing nothing needs to be checked that there is no loss of data.
  • Switching to a temporary persistent queue appears to probably be the easiest option - we would need to deal with the queue on restart and do it in such a way to avoid migration difficulties.

In terms of repo/issue indexing I need to look again at the populate index functions because it may be that we just run through all of these on restart anyway - in which case we're done: stop doing work, close the index and set the channel read to a no-op. (Although having to run through every repo at startup may be a terrible idea...)

The main difficulties will come with dealing with the other places we're we using channels as worker queues. These will need to be formalised and decisions made about whether there needs to be persistence of the queue or whether the loss of data is tolerable.

As an aside It's worth noting that the single TestPullRequests goroutine worker appears to run every pull request at startup (on every Gitea server) - this is fine if you're running a small server but if you have just restarted a clustered server with say >1000 PRs (most of which haven't changed) - that server isn't gonna update any PRs until all that (likely unnecessary) work is done. I think this one likely needs to have an option to be a persistent queue or at least some way of shortcutting... It also probably needs a (configurable?) number of workers.

The other benefit we could see by allowing persistent queues is the ability to share this load out.

@guillep2k
Copy link
Member

guillep2k commented Nov 8, 2019

As an aside It's worth noting that the single TestPullRequests goroutine worker appears to run every pull request at startup (on every Gitea server)

I think you've just found the cause for #7947 !!! 😱
EDIT: Or may be not. Worth checking, though.

}
case <-time.After(time.Millisecond * 100):
i++
if i >= 3 && len(datas) > 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 -> batchLength ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's from your original code 😛

My interpretation is:

  • 300ms after you have started this queue, send any unsent data even if the batch is incompletely full and every 100ms after that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've restructured to be a bit clearer - the timeout variable changes once it's had its first wait.

@zeripath
Copy link
Contributor Author

zeripath commented Nov 13, 2019

I've split out the windows component from this so that that can be merged without this being
completely ready. - Done

@codecov-io
Copy link

codecov-io commented Nov 13, 2019

Codecov Report

Merging #8874 into master will increase coverage by 0.19%.
The diff coverage is 44.08%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #8874      +/-   ##
==========================================
+ Coverage   41.26%   41.46%   +0.19%     
==========================================
  Files         561      565       +4     
  Lines       73439    76063    +2624     
==========================================
+ Hits        30307    31538    +1231     
- Misses      39338    40596    +1258     
- Partials     3794     3929     +135
Impacted Files Coverage Δ
modules/migrations/migrate.go 20.74% <ø> (-4.26%) ⬇️
models/issue.go 52.58% <ø> (ø) ⬆️
routers/admin/admin.go 13.66% <0%> (-2.86%) ⬇️
modules/indexer/issues/db.go 52.17% <0%> (-4.97%) ⬇️
cmd/web_graceful.go 0% <0%> (ø) ⬆️
cmd/web.go 0% <0%> (ø) ⬆️
modules/graceful/server_hooks.go 4.25% <0%> (ø) ⬆️
modules/ssh/ssh_graceful.go 40% <0%> (ø) ⬆️
routers/install.go 0% <0%> (ø) ⬆️
modules/indexer/issues/bleve.go 62.42% <0%> (-0.77%) ⬇️
... and 83 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f09cdb2...2ae0b41. Read the comment docs.

@zeripath zeripath force-pushed the graceful-2 branch 5 times, most recently from d9922ec to 711fd7c Compare November 16, 2019 20:00
@guillep2k
Copy link
Member

So, validating this is either a leap of faith or we could create some environment to specifically check some of the concepts addressed here. I don't mean anything to be added to our standard CI script, but a bunch of scripts somewhere (maybe under /integrations/graceful_checks/...?) that we could run manually or take as a model for validating new "gracefullable" functions. For example, we could have a pre-receive git hook script with this content:

#!/bin/bash
trap "" SIGINT SIGHUP SIGQUIT SIGTERM
LOG=/tmp/....
echo Interruptable script started!
(
    date
    echo Interruptable script started!
    sleep 30
    date
    echo Interruptable mid-run
    sleep 30
    date
    echo Interruptable script finished!
) > "$LOG" 2>&1
echo Interruptable script finished!

Then a push operation would be forced to wait one minute, and we'd have a window of oportunity to manually test the effects of a graceful shutdown. We could also have a similar bash script (replacing the mid-run echo by a cat) to use as a fake mail server at port 25 using nc (that would need root, of course).

The delayed git hook will test both macaron and cmd shutdowns. The indexer is more difficult to test, as it will need a special token filter to introduce the delay, but you get the idea.

@zeripath
Copy link
Contributor Author

zeripath commented Nov 17, 2019

Tbh the best test I've found is just killall -1 gitea and watching the thing come back up. If I get the code wrong it won't come back or it will take too long and will get hammered! That and raw analysis of the code.

That's clearly not very good though.

So I've being doing some other work:

  • I've written tests of the queues so we can guarantee that they are well behaved with regards to shutdown and terminate for reasonable handle functions. I'm not sure how to go about testing the redis queue but if the code is correct for the level queue and is the same for the redis queue - it's correct...

  • I've been trying to reduce the number of places that we explicitly mention graceful.Manager and instead proffer channels or callbacks by wrapping run fns with RunWithShutdown*. (I probably need to include the hammerChan/hammer callback in these because there will be places where hammer may be needed to stop running.) That will help with creating test harnesses because you can then just test the run function in isolation.

  • I think I'll add a clean up/cancel global run fn to modules/git that git.Command will check if is set when it does a Run and runs between cmd.Start and cmd.Wait. graceful could set that global on init to cause a cancel on IsHammer.

    • I think that's reasonable as I don't think any git commands should be being performed at Terminate - it's literally for queues to persist themselves and close their DBs cleanly.
    • Git tolerates being shutdown mid push and this is one of the reasons for us using temporary repos so any partially applied stuff should just disappear.
    • cancelling these commands at hammer would also allow for the defers to run thus allowing clean up to occur - currently if we're hammered it is possible that temporary repos are not cleaned up.
    • we would need to check the go-git stuff for similar hammerable-ness
    • actually it might probably be easier to just create a background context.Context that's cancelled at hammer and get all these git commands to inherit from this - I will check.

In terms of the RunWithShutdownFns API I probably need to adjust these callbacks. They currently are simple: atTerminate(runThis) with atTerminate implemented as:

terminateWG.add(1)
go func() {
    <-IsTerminate():
    runThis()
    terminateWG.Done()
}

This means there's no way to cancel the atTerminate which is fine for long running things like queues but you cannot use this callback in a short-lived multiple-called function without causing a memory and goroutine leak. Similar for the atShutdown callback.

I think these may have to be changed to pass in a done channel that could be closed at the end of run so instead the atTerminate looks like:

terminateWG.add(1)
go func() {
    select {
    case <-IsTerminate():
         runThis()
    case <-done:
    }
    terminateWG.Done()
}

In terms of the RunWithShutdownChan - We can't just pass in the terminate channel because terminating functions need to be added to a wait group to give them a chance to terminate cleanly before the main goroutine closes.

It may be that with a bit more thought we could use appropriately created context.Context s for most of this work but the waitgroup registration requirements for run and terminate don't quite clearly align with that - so I suspect that's still a nonstarter. However, making our own context alike is probably better - then RunWithShutdownFns and Chan becomes simply RunWithContext and have appropriate functions on the context so I might do that.

@zeripath
Copy link
Contributor Author

zeripath commented Nov 17, 2019

@guillep2k btw although in general I don't tend to force push to Gitea PRs, in this case in an attempt to make reviewing this easier in future I'm trying to create understandable individual patches/commits. Therefore I'm rewriting history as I refactor and restructure.

(As that messes up the PR history it will likely lead to me having to close and reopen this PR once it moves out of WIP.)

If you're closely watching this though and would prefer me to stop force pushing please say.

@zeripath
Copy link
Contributor Author

hmm... I wonder if I need to be using something like this...

@guillep2k
Copy link
Member

If you're closely watching this though and would prefer me to stop force pushing please say.

Thanks. It's #8964 the one I'm reviewing "in small bites" because of its complexity (I know this one is even more complex); my comment above was something I thought about while reviewing that. As long as you mark this with WIP, I don't mind you rewriting. But I do use the GitHub tool to explore deltas in order to check what has been changed and try to understand the reasoning behind the changes. Especially for large files, where you may change just a couple of lines.

@zeripath
Copy link
Contributor Author

Ok I've made the move to switch to a more context based thing as it looks more natural go and is simpler in general.

I likely need to reconsider how the queues work because they could actually be simplified to take a context, but will still need a terminate hook so can't just run within a simple context however shutdown could easily be managed this way.

I've also added the git commands to run in the HammerTime context by default but added a function to allow a git command to run in a different context.

@zeripath
Copy link
Contributor Author

AHA!

I can avoid a whole lot of hackery and make HammerTime more powerful by setting the BaseContext function on the listeners to return the (newly available) HammerContext instead of nil or the context.Background()

We can then pass in the req.Context() to most of our git calls for those running inside the response go-routine allowing them to be cancelled properly when the response is cancelled - and have them be hammered too.

@zeripath
Copy link
Contributor Author

zeripath commented Nov 18, 2019

agh! leveldb doesn't allow multiple processes to access the db - that means that both queue_disk and queue_disk_channel need to have a bit more thought.


Done

@zeripath
Copy link
Contributor Author

Sigh those last few commits have broken the tests... I'll have to fix them and try again...

@zeripath zeripath force-pushed the graceful-2 branch 5 times, most recently from 5c12c33 to ec01c00 Compare November 23, 2019 11:54
@stale
Copy link

stale bot commented Feb 25, 2020

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs during the next 2 months. Thank you for your contributions.

@stale stale bot added the issue/stale label Feb 25, 2020
@zeripath
Copy link
Contributor Author

zeripath commented Mar 4, 2020

Almost all of this has been merged as other PRs therefore I will close this.

@zeripath zeripath closed this Mar 4, 2020
@zeripath zeripath deleted the graceful-2 branch March 4, 2020 21:28
@go-gitea go-gitea locked and limited conversation to collaborators Nov 24, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
issue/stale lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. type/feature Completely new functionality. Can only be merged if feature freeze is not active.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants