[WIP] Yet More Graceful #8874

zeripath · 2019-11-07T21:19:45Z

This PR implements:

~~Graceful shutdown is now supported for Windows running as a service~~ MERGED
~~Graceful shutdown is now supported for FCGI and unix sockets~~ - MERGED
~~The TestPullRequests goroutine now shutsdown gracefully - (As we always retest everything on restart this one is very easy...)~~ - Broken out to Graceful: Xorm, RepoIndexer, Cron and Others #9282 & MERGED
- This may benefit from a proper persistent queue
~~RepoIndexer~~ and IssueIndexer need to be gracefulised - RepoIndexer Broken out to Graceful: Xorm, RepoIndexer, Cron and Others #9282 & MERGED
- RepoIndexer might benefit from a rethink
- IssueIndexer Broken out to Graceful Queues: Issue Indexing and Tasks #9363
Generalise Queues to allow us to proffer to use a LevelQueue/RedisQueue in place of a ChannelQueue when persistence over shutdowns is important. Broken out to Graceful Queues: Issue Indexing and Tasks #9363
- There is now a generalised way of creating queues from the settings
- Queues can also have a configurable number of workers.
- Queues will also automatically create more workers if their internal queues get blocked
  - There is a configurable blockTimeout and boostTimeout
  - During a boost the blockTimeout is doubled
  - Any new blockages during the boost will cause an additional boost - blockages that happened just before the boost are ignored
  - Queues are now added to the monitor page
  - All broken out to Graceful Queues: Issue Indexing and Tasks #9363
~~Cron tasks should run a graceful wrapper~~ Broken out to Graceful: Xorm, RepoIndexer, Cron and Others #9282 & MERGED
~~All git commands run through git.Command will now time out at Hammer Time, similarly for processes~~ - broken out to Graceful: Cancel Process on monitor pages & HammerTime #9213 & MERGED
~~All git commands are now run through git.Command~~ - broken out to Graceful: Cancel Process on monitor pages & HammerTime #9213 & MERGED
~~Processes can now be cancelled from the monitor page~~ - broken out to Graceful: Cancel Process on monitor pages & HammerTime #9213 & MERGED
Check other long running goroutines...
- modules/migrations/migrate.go - Migration can take a very long time - need to be able to abort
- modules/notification/notification.go - Needs a queue and to be gracefulised
  - This queue has by default 10 workers
  - modules/notification/base/null.go and modules/notification/base/queue.go are now autogenerated from the base.go
  - modules/notification/ui/ui.go also uses a queue
- services/mailer/mailer.go - Needs a queue and to be gracefulised
- services/mirror/mirror.go - Now gracefully shutsdown but probably needs a persistent queue - might need a 2 part queue
- services/pull/pull.go - AddTestPullRequestTask may need to become a proper queue and InvalidateCodeComments should take place within a graceful run.
- Partially broken out to Graceful: Xorm, RepoIndexer, Cron and Others #9282 & MERGED

Contains #9304

guillep2k · 2019-11-08T04:30:27Z

RepoIndexer and IssueIndexer need to be considered

If we use a persistable queue, these two should be easy to do; I think that each individual operation is not lengthy. If we shut down the queue, we could afford to wait the indexer to finish the task at hand.

Anyway, except required for different reasons, the persistable queues should be persisted only on graceful restart (for performance reasons), and we must be careful when restarting a different version of Gitea in case any formats might have changed.

zeripath · 2019-11-08T07:46:17Z

@guillep2k yeah, in general my feeling is that we should just let things finish the tasks they're currently doing rather than abort - Repo indexing is however one thing that could take some time so that might need an abort. We do need to shutdown their indices gracefully though - it is possible to corrupt the index by aborting mid write - and then Gitea will not restart.

The queues for indexing are configurable afaics so if they are using persistent queues then we're home and dry - although migration problems are still an issue. If not, dealing with the channel queues is a bit more difficult:

Closing the channel would cause a panic upstream if they try to write to it - possible loss of data - would need to check.
Stopping reading from the channel could block upstream users - this might be acceptable but could prevent graceful shutdown of other queues. Upstream users pushing to the channel would need to have timeouts etc. Again need to check for loss of data.
Reading from the channel but doing nothing needs to be checked that there is no loss of data.
Switching to a temporary persistent queue appears to probably be the easiest option - we would need to deal with the queue on restart and do it in such a way to avoid migration difficulties.

In terms of repo/issue indexing I need to look again at the populate index functions because it may be that we just run through all of these on restart anyway - in which case we're done: stop doing work, close the index and set the channel read to a no-op. (Although having to run through every repo at startup may be a terrible idea...)

The main difficulties will come with dealing with the other places we're we using channels as worker queues. These will need to be formalised and decisions made about whether there needs to be persistence of the queue or whether the loss of data is tolerable.

As an aside It's worth noting that the single TestPullRequests goroutine worker appears to run every pull request at startup (on every Gitea server) - this is fine if you're running a small server but if you have just restarted a clustered server with say >1000 PRs (most of which haven't changed) - that server isn't gonna update any PRs until all that (likely unnecessary) work is done. I think this one likely needs to have an option to be a persistent queue or at least some way of shortcutting... It also probably needs a (configurable?) number of workers.

The other benefit we could see by allowing persistent queues is the ability to share this load out.

guillep2k · 2019-11-08T14:29:41Z

As an aside It's worth noting that the single TestPullRequests goroutine worker appears to run every pull request at startup (on every Gitea server)

I think you've just found the cause for #7947 !!! 😱
EDIT: Or may be not. Worth checking, though.

lunny · 2019-11-08T14:46:29Z

modules/queue/queue_channel.go

+			}
+		case <-time.After(time.Millisecond * 100):
+			i++
+			if i >= 3 && len(datas) > 0 {


3 -> batchLength ?

It's from your original code 😛

My interpretation is:

300ms after you have started this queue, send any unsent data even if the batch is incompletely full and every 100ms after that.

I've restructured to be a bit clearer - the timeout variable changes once it's had its first wait.

zeripath · 2019-11-13T18:30:18Z

I've split out the windows component from this so that that can be merged without this being
completely ready. - Done

codecov-io · 2019-11-13T19:55:51Z

Codecov Report

Merging #8874 into master will increase coverage by 0.19%.
The diff coverage is 44.08%.

@@            Coverage Diff             @@
##           master    #8874      +/-   ##
==========================================
+ Coverage   41.26%   41.46%   +0.19%     
==========================================
  Files         561      565       +4     
  Lines       73439    76063    +2624     
==========================================
+ Hits        30307    31538    +1231     
- Misses      39338    40596    +1258     
- Partials     3794     3929     +135

Impacted Files	Coverage Δ
modules/migrations/migrate.go	`20.74% <ø> (-4.26%)`	⬇️
models/issue.go	`52.58% <ø> (ø)`	⬆️
routers/admin/admin.go	`13.66% <0%> (-2.86%)`	⬇️
modules/indexer/issues/db.go	`52.17% <0%> (-4.97%)`	⬇️
cmd/web_graceful.go	`0% <0%> (ø)`	⬆️
cmd/web.go	`0% <0%> (ø)`	⬆️
modules/graceful/server_hooks.go	`4.25% <0%> (ø)`	⬆️
modules/ssh/ssh_graceful.go	`40% <0%> (ø)`	⬆️
routers/install.go	`0% <0%> (ø)`	⬆️
modules/indexer/issues/bleve.go	`62.42% <0%> (-0.77%)`	⬇️
... and 83 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f09cdb2...2ae0b41. Read the comment docs.

guillep2k · 2019-11-17T02:08:55Z

So, validating this is either a leap of faith or we could create some environment to specifically check some of the concepts addressed here. I don't mean anything to be added to our standard CI script, but a bunch of scripts somewhere (maybe under /integrations/graceful_checks/...?) that we could run manually or take as a model for validating new "gracefullable" functions. For example, we could have a pre-receive git hook script with this content:

#!/bin/bash
trap "" SIGINT SIGHUP SIGQUIT SIGTERM
LOG=/tmp/....
echo Interruptable script started!
(
    date
    echo Interruptable script started!
    sleep 30
    date
    echo Interruptable mid-run
    sleep 30
    date
    echo Interruptable script finished!
) > "$LOG" 2>&1
echo Interruptable script finished!

Then a push operation would be forced to wait one minute, and we'd have a window of oportunity to manually test the effects of a graceful shutdown. We could also have a similar bash script (replacing the mid-run echo by a cat) to use as a fake mail server at port 25 using nc (that would need root, of course).

The delayed git hook will test both macaron and cmd shutdowns. The indexer is more difficult to test, as it will need a special token filter to introduce the delay, but you get the idea.

zeripath · 2019-11-17T09:43:06Z

Tbh the best test I've found is just killall -1 gitea and watching the thing come back up. If I get the code wrong it won't come back or it will take too long and will get hammered! That and raw analysis of the code.

That's clearly not very good though.

So I've being doing some other work:

I've written tests of the queues so we can guarantee that they are well behaved with regards to shutdown and terminate for reasonable handle functions. I'm not sure how to go about testing the redis queue but if the code is correct for the level queue and is the same for the redis queue - it's correct...
I've been trying to reduce the number of places that we explicitly mention graceful.Manager and instead proffer channels or callbacks by wrapping run fns with RunWithShutdown*. (I probably need to include the hammerChan/hammer callback in these because there will be places where hammer may be needed to stop running.) That will help with creating test harnesses because you can then just test the run function in isolation.
I think I'll add a clean up/cancel global run fn to modules/git that git.Command will check if is set when it does a Run and runs between cmd.Start and cmd.Wait. graceful could set that global on init to cause a cancel on IsHammer.
- I think that's reasonable as I don't think any git commands should be being performed at Terminate - it's literally for queues to persist themselves and close their DBs cleanly.
- Git tolerates being shutdown mid push and this is one of the reasons for us using temporary repos so any partially applied stuff should just disappear.
- cancelling these commands at hammer would also allow for the defers to run thus allowing clean up to occur - currently if we're hammered it is possible that temporary repos are not cleaned up.
- we would need to check the go-git stuff for similar hammerable-ness
- actually it might probably be easier to just create a background context.Context that's cancelled at hammer and get all these git commands to inherit from this - I will check.

In terms of the RunWithShutdownFns API I probably need to adjust these callbacks. They currently are simple: atTerminate(runThis) with atTerminate implemented as:

terminateWG.add(1)
go func() {
    <-IsTerminate():
    runThis()
    terminateWG.Done()
}

This means there's no way to cancel the atTerminate which is fine for long running things like queues but you cannot use this callback in a short-lived multiple-called function without causing a memory and goroutine leak. Similar for the atShutdown callback.

I think these may have to be changed to pass in a done channel that could be closed at the end of run so instead the atTerminate looks like:

terminateWG.add(1)
go func() {
    select {
    case <-IsTerminate():
         runThis()
    case <-done:
    }
    terminateWG.Done()
}

In terms of the RunWithShutdownChan - We can't just pass in the terminate channel because terminating functions need to be added to a wait group to give them a chance to terminate cleanly before the main goroutine closes.

It may be that with a bit more thought we could use appropriately created context.Context s for most of this work but the waitgroup registration requirements for run and terminate don't quite clearly align with that - so I suspect that's still a nonstarter. However, making our own context alike is probably better - then RunWithShutdownFns and Chan becomes simply RunWithContext and have appropriate functions on the context so I might do that.

zeripath · 2019-11-17T10:12:25Z

@guillep2k btw although in general I don't tend to force push to Gitea PRs, in this case in an attempt to make reviewing this easier in future I'm trying to create understandable individual patches/commits. Therefore I'm rewriting history as I refactor and restructure.

(As that messes up the PR history it will likely lead to me having to close and reopen this PR once it moves out of WIP.)

If you're closely watching this though and would prefer me to stop force pushing please say.

zeripath · 2019-11-17T12:44:21Z

hmm... I wonder if I need to be using something like this...

guillep2k · 2019-11-17T13:47:30Z

If you're closely watching this though and would prefer me to stop force pushing please say.

Thanks. It's #8964 the one I'm reviewing "in small bites" because of its complexity (I know this one is even more complex); my comment above was something I thought about while reviewing that. As long as you mark this with WIP, I don't mind you rewriting. But I do use the GitHub tool to explore deltas in order to check what has been changed and try to understand the reasoning behind the changes. Especially for large files, where you may change just a couple of lines.

zeripath · 2019-11-17T22:28:42Z

Ok I've made the move to switch to a more context based thing as it looks more natural go and is simpler in general.

I likely need to reconsider how the queues work because they could actually be simplified to take a context, but will still need a terminate hook so can't just run within a simple context however shutdown could easily be managed this way.

I've also added the git commands to run in the HammerTime context by default but added a function to allow a git command to run in a different context.

zeripath · 2019-11-18T10:09:25Z

AHA!

I can avoid a whole lot of hackery and make HammerTime more powerful by setting the BaseContext function on the listeners to return the (newly available) HammerContext instead of nil or the context.Background()

We can then pass in the req.Context() to most of our git calls for those running inside the response go-routine allowing them to be cancelled properly when the response is cancelled - and have them be hammered too.

zeripath · 2019-11-18T14:03:25Z

agh! leveldb doesn't allow multiple processes to access the db - that means that both queue_disk and queue_disk_channel need to have a bit more thought.

Done

zeripath · 2019-11-20T23:37:37Z

Sigh those last few commits have broken the tests... I'll have to fix them and try again...

Remove the old now unused specific queues

stale · 2020-02-25T15:57:37Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs during the next 2 months. Thank you for your contributions.

zeripath · 2020-03-04T21:28:37Z

Almost all of this has been merged as other PRs therefore I will close this.

zeripath added the type/feature Completely new functionality. Can only be merged if feature freeze is not active. label Nov 7, 2019

GiteaBot added the lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. label Nov 8, 2019

lunny reviewed Nov 8, 2019

View reviewed changes

zeripath mentioned this pull request Nov 13, 2019

Add Graceful shutdown for Windows and hooks for shutdown of goroutines #8964

Merged

zeripath force-pushed the graceful-2 branch from b9d1e62 to 399c915 Compare November 13, 2019 18:28

zeripath force-pushed the graceful-2 branch 5 times, most recently from d9922ec to 711fd7c Compare November 16, 2019 20:00

zeripath force-pushed the graceful-2 branch from 672c349 to bae4de0 Compare November 18, 2019 21:43

zeripath force-pushed the graceful-2 branch 5 times, most recently from 5c12c33 to ec01c00 Compare November 23, 2019 11:54

zeripath added 17 commits December 23, 2019 15:14

Queue & Setting: Add worker pool implementation

d06e882

Queue: Add worker settings

7122ce7

Queue: Make resizing worker pools

e52520f

Queue: Add name variable to queues

d04773f

Queue: Add monitoring

ae2f408

Queue: Improve logging

2ce9c67

Issues: Gracefulise the issues indexer

0b275fd

Remove the old now unused specific queues

Task: Move to generic queue and gracefulise

f12e309

Issues: Standardise the issues indexer queue settings

592e2ca

Fix test

caa4392

Queue: Allow Redis to connect to unix

ba1698a

Prevent deadlock during early shutdown of issue indexer

f751693

TODO: These are the places with remaining graceful queries

9ab1c76

Mailer: Make a queue

811d549

Notification: queue ui.go notification-service

34609c9

Notification: move to use a queue

d508f4e

broken merge

ff3bba2

zeripath force-pushed the graceful-2 branch from c6e1111 to ff3bba2 Compare December 23, 2019 18:23

zeripath added 4 commits December 23, 2019 23:06

Ensure that Message is json encodable

7cb70d9

Merge branch 'master' into graceful-2

47b1b1a

Ensure only one NotifyNewPullRequest

79ad552

Slight change of notifierListener.RegisterChannel function

a4386c2

zeripath mentioned this pull request Jan 15, 2020

Move mailer to use a queue #9789

Merged

zeripath mentioned this pull request Feb 5, 2020

Queues: Move Notification to use a Queue #9902

Closed

stale bot added the issue/stale label Feb 25, 2020

zeripath closed this Mar 4, 2020

zeripath deleted the graceful-2 branch March 4, 2020 21:28

go-gitea locked and limited conversation to collaborators Nov 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Yet More Graceful #8874

[WIP] Yet More Graceful #8874

zeripath commented Nov 7, 2019 •

edited

Loading

guillep2k commented Nov 8, 2019

zeripath commented Nov 8, 2019

guillep2k commented Nov 8, 2019 •

edited

Loading

lunny Nov 8, 2019

zeripath Nov 8, 2019

zeripath Nov 22, 2019

zeripath commented Nov 13, 2019 •

edited

Loading

codecov-io commented Nov 13, 2019 •

edited

Loading

guillep2k commented Nov 17, 2019

zeripath commented Nov 17, 2019 •

edited

Loading

zeripath commented Nov 17, 2019 •

edited

Loading

zeripath commented Nov 17, 2019

guillep2k commented Nov 17, 2019

zeripath commented Nov 17, 2019

zeripath commented Nov 18, 2019

zeripath commented Nov 18, 2019 •

edited

Loading

zeripath commented Nov 20, 2019

stale bot commented Feb 25, 2020

zeripath commented Mar 4, 2020

[WIP] Yet More Graceful #8874

[WIP] Yet More Graceful #8874

Conversation

zeripath commented Nov 7, 2019 • edited Loading

guillep2k commented Nov 8, 2019

zeripath commented Nov 8, 2019

guillep2k commented Nov 8, 2019 • edited Loading

lunny Nov 8, 2019

Choose a reason for hiding this comment

zeripath Nov 8, 2019

Choose a reason for hiding this comment

zeripath Nov 22, 2019

Choose a reason for hiding this comment

zeripath commented Nov 13, 2019 • edited Loading

codecov-io commented Nov 13, 2019 • edited Loading

Codecov Report

guillep2k commented Nov 17, 2019

zeripath commented Nov 17, 2019 • edited Loading

zeripath commented Nov 17, 2019 • edited Loading

zeripath commented Nov 17, 2019

guillep2k commented Nov 17, 2019

zeripath commented Nov 17, 2019

zeripath commented Nov 18, 2019

zeripath commented Nov 18, 2019 • edited Loading

zeripath commented Nov 20, 2019

stale bot commented Feb 25, 2020

zeripath commented Mar 4, 2020

zeripath commented Nov 7, 2019 •

edited

Loading

guillep2k commented Nov 8, 2019 •

edited

Loading

zeripath commented Nov 13, 2019 •

edited

Loading

codecov-io commented Nov 13, 2019 •

edited

Loading

zeripath commented Nov 17, 2019 •

edited

Loading

zeripath commented Nov 17, 2019 •

edited

Loading

zeripath commented Nov 18, 2019 •

edited

Loading