Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: log improvements #2365

Merged
merged 16 commits into from
Sep 1, 2022
Merged

chore: log improvements #2365

merged 16 commits into from
Sep 1, 2022

Conversation

lvrach
Copy link
Member

@lvrach lvrach commented Aug 29, 2022

Description

A couple of minor improvements were introduced regarding logs.

Motivation: rudder-server was terminating without printing a helpful log. As well as some unhelpful want/err logs when shutting down.

Notion Ticket

https://www.notion.so/rudderstacks/Minor-logs-improvements-5e9fbfc8dfa3495aa38ffa6e707585e0

Security

  • The code changed/added as part of this pull request won't create any security issues with how the software is being used.

@codecov
Copy link

codecov bot commented Aug 29, 2022

Codecov Report

Merging #2365 (63dd247) into master (cbe74b6) will increase coverage by 0.06%.
The diff coverage is 62.50%.

@@            Coverage Diff             @@
##           master    #2365      +/-   ##
==========================================
+ Coverage   34.12%   34.19%   +0.06%     
==========================================
  Files         198      199       +1     
  Lines       39445    39456      +11     
==========================================
+ Hits        13462    13491      +29     
+ Misses      25120    25099      -21     
- Partials      863      866       +3     
Impacted Files Coverage Δ
app/cluster/state/etcd.go 63.10% <0.00%> (-0.31%) ⬇️
gateway/gateway.go 52.54% <0.00%> (+0.74%) ⬆️
services/validators/envValidator.go 1.11% <0.00%> (ø)
warehouse/warehouse.go 0.00% <0.00%> (ø)
main.go 71.15% <33.33%> (+0.48%) ⬆️
config/backend-config/backend-config.go 69.88% <40.00%> (-1.05%) ⬇️
utils/httputil/server.go 92.30% <92.30%> (ø)
config/backend-config/namespace_config.go 78.62% <100.00%> (+0.16%) ⬆️
config/backend-config/single_workspace.go 62.60% <100.00%> (+0.32%) ⬆️
services/metric/registry.go 76.80% <0.00%> (-2.41%) ⬇️
... and 5 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Comment on lines +125 to +132
backoffWithMaxRetry := backoff.WithContext(backoff.WithMaxRetries(backoff.NewExponentialBackOff(), 3), ctx)
err := backoff.RetryNotify(operation, backoffWithMaxRetry, func(err error, t time.Duration) {
nc.Logger.Warnf("Failed to fetch config from API with error: %v, retrying after %v", err, t)
})
if err != nil {
nc.Logger.Errorf("Error sending request to the server: %v", err)
if ctx.Err() == nil {
nc.Logger.Errorf("Error sending request to the server: %v", err)
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this to avoid unrelated errors on context cancellation. Plus, it doesn't retry after ctx is cancelled.

@lvrach lvrach changed the title fix: show etcd log with context dealine error chore: log improvements Aug 29, 2022
Comment on lines 253 to 259
g, ctx := errgroup.WithContext(ctx)
g.Go(func() (err error) {
if err = admin.StartServer(ctx); err != nil && ctx.Err() == nil {
if !errors.Is(err, context.Canceled) {
pkgLogger.Errorf("Error in Admin server routine: %v", err)
}
g.Go(func() error {
if err := admin.StartServer(ctx); err != nil {
return fmt.Errorf("admin server routine: %w", err)
}
return err
return nil
})
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need to check for ctx.Err(), since the ctx belongs to the errgroup.

  1. ctx will be cancelled immediately when the first error returns
  2. if ctx is canceled errgroup is not going to propagate any errors

if err != nil && ctx.Err() == nil {
pkgLogger.Error(err)
if err != nil {
pkgLogger.Errorf("Terminal error: %v", err)
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ctx here is the errgroup error, which is cancelled when an error is returned.

Essentially this condition err != nil && ctx.Err() == nil would never be true and thus nothing will be logged.

I find it easier to print a single "Terminal Error" as late as possible.

@lvrach lvrach marked this pull request as ready for review August 30, 2022 07:11
@lvrach lvrach requested a review from a team August 30, 2022 08:57
Copy link
Contributor

@atzoum atzoum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like we need to chose between muting valuable errors and erroneously logging errors during normal shutdown

if err != nil && ctx.Err() == nil {
pkgLogger.Error(err)
if err != nil {
pkgLogger.Errorf("Terminal error: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the only side effect is that we will log an error during graceful shudown, which is a bit misleading, e.g.

2022-08-30T16:31:25.711+0300	ERROR	rudder-server/main.go:294	Terminal error: profiler server routine: debug server: http: Server closed
2022-08-30T16:31:25.711+0300	INFO	rudder-server/main.go:297	Attempting to shutdown gracefully
2022-08-30T16:31:25.711+0300	INFO	rudder-server/main.go:308	Graceful terminal after 2.625274209s, with 154 go-routines

Copy link
Member Author

@lvrach lvrach Aug 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, I will make sure Server closed errors are not propagated if ctx is cancelled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have addressed this issue by introducing a new function that takes care of http: Server closed

Comment on lines 14 to 23
func GracefulListenAndServe(ctx context.Context, server *http.Server, shutdownTimeout ...time.Duration) error {
return gracefulFunc(ctx, server, server.ListenAndServe, shutdownTimeout...)
}

func GracefulServe(ctx context.Context, server *http.Server, l net.Listener, shutdownTimeout ...time.Duration) error {
fn := func() error {
return server.Serve(l)
}
return gracefulFunc(ctx, server, fn, shutdownTimeout...)
}
Copy link
Member Author

@lvrach lvrach Aug 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I've introduced those two methods is:

  1. We are using graceful shutdown methods in many places
  2. Hard and repetitive to test shutdown functionality when it is not in isolation.
  3. It is easy to get the shutdown wrong; some of the implementations in our code were correct, and some were not (funny, I think I have introduced them all :D )

Example of incorrect implementation:

	go func() {
		<-ctx.Done()
		_ = srv.Shutdown(context.Background())
	}()

	if err := srv.ListenAndServe(); err != nil {
		return fmt.Errorf("debug server: %w", err)
	}
	return nil

When Shutdown is called ListenAndServe returns immediately, always with an error.

Two problems:

  1. The shutdown is the one that blocks and the one we should be waiting for.
  2. An error is returned when the context gets cancelled.

Semi-correct example addresses 1st issue:

	g, ctx := errgroup.WithContext(ctx)
	g.Go(func() error {
		<-ctx.Done()
		return srv.Shutdown(context.Background())
	})
	g.Go(func() error {
		return srv.ListenAndServe()
	})

	return g.Wait()

Here we wait for both go routines to return, but ListenAndServe returns first with an error.

The new implementation fixes both issues along with providing tests.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me, although I'm not sure we need the Graceful part in the name. What do you think about just having it as ListenAndServe since we don't want an ungraceful one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, make sense :)

@@ -153,7 +153,7 @@ func closeDBConnection(handle *sql.DB) {
}

func killDanglingDBConnections(db *sql.DB) {
rows, err := db.Query(`SELECT PID, QUERY_START, COALESCE(WAIT_EVENT_TYPE,''), COALESCE(WAIT_EVENT, ''), STATE, QUERY, PG_TERMINATE_BACKEND(PID)
rows, err := db.Query(`SELECT PID, QUERY_START, COALESCE(WAIT_EVENT_TYPE,''), COALESCE(WAIT_EVENT, ''), COALESCE(STATE, ''), QUERY, PG_TERMINATE_BACKEND(PID)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got an error in my test suite that STATE was null

Copy link
Collaborator

@fracasula fracasula left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with some suggestions 👍

require.NoError(t, err)
defer l.Close()

{
Copy link
Collaborator

@fracasula fracasula Aug 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[optional] I think this way of enclosing brackets bothers DeepSource. Also not very useful in this case imho.

})

t.Run("timeout if connections are not closed", func(t *testing.T) {
addr := fmt.Sprintf("127.0.0.1:%d", freeport.GetPort())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a testhelper to get free ports in tests. It makes sure that you don't get the same port twice in case you get multiple ones and bind them later. Not necessarily needed here but since we have one it might make sense to use it to ease refactorings etc...

https://github.com/rudderlabs/rudder-server/blob/master/testhelper/freeport.go#L14

app/cluster/state/etcd.go Show resolved Hide resolved
config/backend-config/backend-config.go Show resolved Hide resolved
Comment on lines +130 to +132
if ctx.Err() == nil {
nc.Logger.Errorf("Error sending request to the server: %v", err)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change would be useful in the other backend configs as well.

utils/httputil/server.go Outdated Show resolved Hide resolved
case <-ctx.Done():
var (
shutdownCtx context.Context
cancel context.CancelFunc
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[optional] You don't need cancel nor to invoke it if a timeout is not specified:

var shutdownCtx context.Context
if shutdownTimeout > 0 {
    var cancel func()
    shutdownCtx, cancel = context.WithTimeout(context.Background(), shutdownTimeout)
    defer cancel()
} else {
    shutdownTimeout = context.Background()
}

@lvrach lvrach merged commit 95a24ca into master Sep 1, 2022
@lvrach lvrach deleted the fix.main-logs branch September 1, 2022 06:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants