Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: preliminary mechanism to track and limit SQL memory usage. #8691

Merged
merged 3 commits into from
Sep 10, 2016

Conversation

knz
Copy link
Contributor

@knz knz commented Aug 19, 2016

This patch instruments SQL execution to track memory allocations whose
amounts depend on user input and current database contents (as opposed
to allocations dependent on other parameters e.g. statement size).

This tracks and limits allocations in valueNode, buckets in
groupNode, sorted data in sortNode, temp results in joinNode,
seen prefixes in distinctNode and the Result array in Executor.

A moderate amount of intelligence is implemented so as to avoid
computing sizes for every column of every row in a valueNode - the
combined size of all fixed-size columns is precomputed and counted at
once for every row.

The maximum amount of memory allocated by SQL values is limited in
this patch to 1GB. An error is reported to the client if this limit is
exceeded. Additionally, queries that allocate more than 10KB worth of
data will see their usage printed in the log files automatically.

This patch does not track the memory allocated for write intents in
the KV transaction object.


This change is Reviewable

@knz
Copy link
Contributor Author

knz commented Aug 19, 2016

This PR does not yet implement a flexible policy as documented by #7657; this is because I'd like to avoid making this PR larger than it already is. I will address the remaining points (policy and tracking write intents) after we reach consensus on the approach here.

@knz knz force-pushed the sized-rows branch 2 times, most recently from 3968ccd to ae99758 Compare August 22, 2016 20:00
@knz knz force-pushed the sized-rows branch 3 times, most recently from 90d99ed to c89bbf0 Compare August 23, 2016 03:54
@andreimatei
Copy link
Contributor

Such modernity much accountability https://www.youtube.com/watch?v=n4RjJKxsamQ

One thought about what we were talking about - how to get the statement that failed an allocation back to the user: how about you have the Executor catch the "no memory" error and prepend the statement to it? Before fancier tracing comes.

Another incoherent thought I had is to maybe think how all this fits together with the reservation mechanism Bram has done for receiving snapshots.

<3


Review status: 0 of 44 files reviewed at latest revision, 29 unresolved discussions, some commit checks failed.


cli/start.go, line 109 [r1] (raw file):

}

func initDependentParameters(ctx *server.Context) {

nit: this change seems superfluous to me


server/admin.go, line 178 [r1] (raw file):

  var resp serverpb.DatabasesResponse
  nRows := r.ResultList[0].Rows.Len()

thanks, Golang


sql/distinct.go, line 174 [r1] (raw file):

          }
          n.mon.ReleaseMemory(int64(len(n.prefixSeen)))
          if err := n.mon.ReserveMemory(int64(len(prefix))); err != nil {

so prefixes are not tallied in allocated? Then the semantics of that field are pretty ad-hoc, I'd say. See if you like the suggestion in https://reviewable.io/reviews/cockroachdb/cockroach/8691#-KPtA9wVusepXvc-Pe86


sql/executor.go, line 134 [r1] (raw file):

type ResultColumns []ResultColumn

// NumColumns implements the parser.ColumnHeader interface.

nit: assign a dummy ResultColumns to that interface to assert that it implements it


sql/plan.go, line 176 [r1] (raw file):

  // Close terminates the planNode execution and suggests its
  // resources can be released.

what's this "suggests" phrasing? :)


sql/planner.go, line 228 [r1] (raw file):

  }
  defer func() {
      plan.Close()

I'm confused. How does that this work with the Executor also doing the same?


sql/trace.go, line 66 [r1] (raw file):

      },
      sort: &sortNode{
          // Don't use the planner context for logging: this sort node is

you're adding a distinction between using a context for logging and using it for other stuff, but we're still just passing a single context to the sortNode, so I think the distinction is not realized?


sql/mon/mem_usage.go, line 34 [r1] (raw file):

// MemoryUsageMonitor defines and object that can track and limit
// memory usage by row data during the execution of SQL queries.
type MemoryUsageMonitor struct {

I think this struct deserves a comment about how it's intended to be used. There's gonna be one of these per query? It's not thread safe (that might be a problem for some of the distsql nodes)? Can they be reused after Stop()?


sql/mon/mem_usage.go, line 34 [r1] (raw file):

// MemoryUsageMonitor defines and object that can track and limit
// memory usage by row data during the execution of SQL queries.
type MemoryUsageMonitor struct {

May I suggest a slightly different API for this guy? I think Reserve/Release are too low level and it's unfortunate that, because of this, the client needs to also keep track of what has been allocated.
How about either we make Reserve return a token that's passed back to Release, or we have a way for a client to open a "span" and to Reserves within that span (possibly multiple times within the same span), and we replace Release with an operation that destroys a span releases everything that had been allocated in that context. E.g. DistinctNode would create a span for each "prefix" and keep allocating "suffixes" in it. To make it efficient, an allocation count could be maintained in the span object (and in parallel the total would be maintained in the Monitor.)


sql/mon/mem_usage.go, line 39 [r1] (raw file):

  curAllocated int64

  // totalAllocated tracks the cumulated amount of memory allocated.

say that all these are in bytes, or even add it to the names


sql/mon/mem_usage.go, line 50 [r1] (raw file):

  // ctx carries the logging context.
  ctx context.Context

ummm so we should generally try to not store contexts in structs. Instead, operations that need one (or, ideally, pretty much every single function) should take a context from the caller, representing the operation that the caller is in the process of doing at that moment.
Exceptions are structs that do stuff in the background, which are initialized with a "base context" that they then refine further.


sql/mon/mem_usage.go, line 60 [r1] (raw file):

  // query can become.
  if mm.curAllocated > mm.maxAllocatedBudget-x {
      return fmt.Errorf("memory budget exceeded: %d requested, %s already allocated",

errors.Errorf is the new hottness
also nit: it's TODO(knz): above and everywhere


sql/mon/mem_usage.go, line 88 [r1] (raw file):

func (mm *MemoryUsageMonitor) ReleaseMemory(x int64) {
  if mm.curAllocated < x {
      log.Errorf(mm.ctx, "no memory to release, current %d, free %d", mm.curAllocated, x)

log.Fatal with debug.PrintStack() ?
or just panic, I don't think the context in the log message would be particularly useful.


sql/mon/mem_usage.go, line 108 [r1] (raw file):

  if log.V(1) && mm.curAllocated != 0 {
      log.Infof(mm.ctx, "starts with %d bytes left over (%p)", mm.curAllocated, mm)

panic? Say that Stop wasn't called (cause then that would've panicked)


sql/mon/mem_usage.go, line 131 [r1] (raw file):

var noteworthySQLQuerySize = envutil.EnvOrDefaultInt64("COCKROACH_NOTEWORTHY_SQL_QUERY_SIZE", 10000)

func getSmallTrace() string {

please document.

Is this really needed? Can you use debug.PrintStack() and friends instead?
Or do we even want this in the logs? Maybe we can use the span from a context instead.


sql/mon/mem_usage_test.go, line 30 [r1] (raw file):

  defer leaktest.AfterTest(t)()

  log.SetGlobalVerbosity(2)

I don't see why this is needed. The test doesn't seem to look at the logs? (and if it did, maybe we can find another option :) )
At the very least add a comment, and restore the old verbosity in a defer.


sql/parser/row_container.go, line 17 [r1] (raw file):

// Author: Raphael 'kena' Poss ([email protected])

package parser

does this need to be in parser? Seems to me like it's not related to... parsing.


sql/parser/row_container.go, line 28 [r1] (raw file):

// initialize a RowContainer. This is really sql.ResultColumns,
// defined here via an interface to avoid a circular dependency.
type ColumnHeader interface {

how about ColumnList ?


sql/parser/row_container.go, line 33 [r1] (raw file):

}

// RowContainer is a managed container for rows of DTuples.

I think the word "managed" doesn't mean anything, and generally it'd be good if this comment was expanded a little since I guess this type is gonna become pretty important.


sql/parser/row_container.go, line 34 [r1] (raw file):

// RowContainer is a managed container for rows of DTuples.
type RowContainer struct {

I think there's a structure that at least superficially serves the same purpose of storing a bunch of nodes, maybe with some preallocations, in distsql. Maybe these can be unified? At least make sure you know about that guy.


sql/parser/row_container.go, line 46 [r1] (raw file):

  // allocated is the current number of bytes allocated in this
  // RowContainer.
  allocated int64

put the unit in the name? allocatedBytes


sql/parser/row_container.go, line 50 [r1] (raw file):

// NewRowContainer allocates a new row container.
func NewRowContainer(mon *mon.MemoryUsageMonitor, h ColumnHeader, capacity int) *RowContainer {

please document the capacity param, and how it doesn't affect the MemoryMonitor requirements.


sql/parser/row_container.go, line 68 [r1] (raw file):

      }
  }
  res.fixedRowSize += int64(unsafe.Sizeof(make(DTuple, 0, nCols)))

ha? Despite the name, this doesn't take into account variable vs non-variable cols, or actually the type of the columns at all.
Maybe it's just the comment on this member that needs to change.


sql/parser/row_container.go, line 74 [r1] (raw file):

// Close releases the memory associated with the RowContainer.
func (c *RowContainer) Close() {

are RowContainers intended to be reused? In any case, please document that Close needs to be called on one in the struct doc.


sql/parser/row_container.go, line 76 [r1] (raw file):

func (c *RowContainer) Close() {
  if c == nil {
      return

This seems weird to me. Is it particularly convenient to support nil containers?


sql/parser/row_container.go, line 98 [r1] (raw file):

func (c *RowContainer) AddRow(row DTuple) error {
  rowSize := c.rowSize(row)
  if err := c.mon.ReserveMemory(rowSize); err != nil {

maybe we can interact with the mon less often by having the Container reserve an amount of memory, and the deduct from that until it's exhausted and then reserve some more. That would also make it more correct, since not you're reserving a capacity slice without accounting for it until it gets filled.


sql/parser/row_container.go, line 125 [r1] (raw file):

// sql.sortNode.  A pointer is returned to avoid an allocation when
// passing the DTuple as an interface{} to heap.Push().
func (c *RowContainer) PseudoPop() *DTuple {

but you can't AddRow while someone's holding a reference to a PseudoPopped row, right? Add a comment somewhere. I guess ResetLen fixes this.
Also, how does returning a pointer help with the allocation?


sql/parser/row_container.go, line 138 [r1] (raw file):

}

// Replace substitute one row for another. This does enquire to the

"inquire". We're not traveling on dual carriageways over here.


sql/pgwire/v3.go, line 712 [r1] (raw file):

}

func (c *v3Conn) sendResponse(results sql.ResultList, formatCodes []formatCode, sendDescription bool, limit int) error {

comment that this destroys the results


Comments from Reviewable

@RaduBerinde
Copy link
Member

Nice stuff! I still have to go through most of the change in detail but I like the core idea of the row container.

We should think in very general terms about how we will be able to use the same infrastructure for distsql. One issue is that the main "data unit" there is not the Datum but the EncDatum. Really the problem is that we don't have an efficient way to keep a lot of small datums (EncDatum is good for large variable-sized types but it's way too much overhead for a lot of small fixed-width values. Very hard to come up with something good without some kind of union type though).


Review status: 0 of 44 files reviewed at latest revision, 32 unresolved discussions, some commit checks failed.


sql/parser/row_container.go, line 28 [r1] (raw file):

// initialize a RowContainer. This is really sql.ResultColumns,
// defined here via an interface to avoid a circular dependency.
type ColumnHeader interface {

Do we think we will have multiple implementations for this interface?


sql/parser/row_container.go, line 39 [r1] (raw file):

  // fixedRowSize is the sum of widths of fixed-width columns.
  fixedRowSize int64

it's the columns that are fixed not the row, maybe fixedColsSize


sql/parser/row_container.go, line 50 [r1] (raw file):

// NewRowContainer allocates a new row container.
func NewRowContainer(mon *mon.MemoryUsageMonitor, h ColumnHeader, capacity int) *RowContainer {

capacity is not very telling, maybe numPreallocRows


sql/parser/row_container.go, line 98 [r1] (raw file):

Previously, andreimatei (Andrei Matei) wrote…

maybe we can interact with the mon less often by having the Container reserve an amount of memory, and the deduct from that until it's exhausted and then reserve some more. That would also make it more correct, since not you're reserving a capacity slice without accounting for it until it gets filled.

I like this idea

Comments from Reviewable

@knz
Copy link
Contributor Author

knz commented Aug 25, 2016

I'll talk with Bram for reservations and with you both for various optimizations. In the meantime PTAL.


Review status: 0 of 42 files reviewed at latest revision, 32 unresolved discussions, some commit checks pending.


cli/start.go, line 109 [r1] (raw file):

Previously, andreimatei (Andrei Matei) wrote…

nit: this change seems superfluous to me

Indeed.

server/admin.go, line 178 [r1] (raw file):

Previously, andreimatei (Andrei Matei) wrote…

thanks, Golang

I just made this slightly less ugly, PTAL.

sql/distinct.go, line 174 [r1] (raw file):

Previously, andreimatei (Andrei Matei) wrote…

so prefixes are not tallied in allocated? Then the semantics of that field are pretty ad-hoc, I'd say. See if you like the suggestion in https://reviewable.io/reviews/cockroachdb/cockroach/8691#-KPtA9wVusepXvc-Pe86

You only need to keep track of the combined allocated size of things that are grouped together in a container (e.g. the suffixes here) so that you don't need to call Release once per item in the container. Since the prefix is not in a collection it's not necessary. I'll answer your suggestion separately in another comment below.

sql/executor.go, line 134 [r1] (raw file):

Previously, andreimatei (Andrei Matei) wrote…

nit: assign a dummy ResultColumns to that interface to assert that it implements it

Done.

sql/plan.go, line 176 [r1] (raw file):

Previously, andreimatei (Andrei Matei) wrote…

what's this "suggests" phrasing? :)

Clarified.

sql/planner.go, line 228 [r1] (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I'm confused. How does that this work with the Executor also doing the same?

These are different uses. `planner.query()/queryRow()` is used by internal things like `admin.go`; whereas the logic in execStmt is for external clients.

sql/trace.go, line 66 [r1] (raw file):

Previously, andreimatei (Andrei Matei) wrote…

you're adding a distinction between using a context for logging and using it for other stuff, but we're still just passing a single context to the sortNode, so I think the distinction is not realized?

I do not understand what you are saying. Please clarify.

sql/mon/mem_usage.go, line 34 [r1] (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I think this struct deserves a comment about how it's intended to be used. There's gonna be one of these per query? It's not thread safe (that might be a problem for some of the distsql nodes)? Can they be reused after Stop()?

I clarified this some more. This is intended to be used from a single goroutine indeed so there is no thread safety needed. More intelligence will be added in the subsequent PR to coordinate multiple instances of this without having to synchronize at every allocation.

sql/mon/mem_usage.go, line 34 [r1] (raw file):

Previously, andreimatei (Andrei Matei) wrote…

May I suggest a slightly different API for this guy? I think Reserve/Release are too low level and it's unfortunate that, because of this, the client needs to also keep track of what has been allocated.
How about either we make Reserve return a token that's passed back to Release, or we have a way for a client to open a "span" and to Reserves within that span (possibly multiple times within the same span), and we replace Release with an operation that destroys a span releases everything that had been allocated in that context. E.g. DistinctNode would create a span for each "prefix" and keep allocating "suffixes" in it. To make it efficient, an allocation count could be maintained in the span object (and in parallel the total would be maintained in the Monitor.)

I do not see the advantage. Memory-wise it will amount to the same or worse, because each span struct will need at least one counter. The amount of work will be the same. I understand your idea, but perhaps you could detail what the advantage would be?

sql/mon/mem_usage.go, line 39 [r1] (raw file):

Previously, andreimatei (Andrei Matei) wrote…

say that all these are in bytes, or even add it to the names

I have added a comment to that effect above.

sql/mon/mem_usage.go, line 50 [r1] (raw file):

Previously, andreimatei (Andrei Matei) wrote…

ummm so we should generally try to not store contexts in structs. Instead, operations that need one (or, ideally, pretty much every single function) should take a context from the caller, representing the operation that the caller is in the process of doing at that moment.
Exceptions are structs that do stuff in the background, which are initialized with a "base context" that they then refine further.

Since MemoryUsageMonitor currently carries state over from one statement to the next, and I'd need the context from a previous statement at the start of a new one to explain where this state came from (why there was memory left over), I really need to keep track of the old context here. The proper solution would be to have a context with a span for the entire transaction; then a method on the Context type to change the current tag (used in executor to track each subsequent statement in the transaction). Until these APIs exist I think this implementation here is the cheapest that can buy us proper context tracking.

I have added a TODO for you to look into this later.


sql/mon/mem_usage.go, line 60 [r1] (raw file):

Previously, andreimatei (Andrei Matei) wrote…

errors.Errorf is the new hottness
also nit: it's TODO(knz): above and everywhere

Done.

sql/mon/mem_usage.go, line 88 [r1] (raw file):

Previously, andreimatei (Andrei Matei) wrote…

log.Fatal with debug.PrintStack() ?
or just panic, I don't think the context in the log message would be particularly useful.

Again I do not understand what you are saying. Please clarify which code would be better. The context here carries the current SQL statement as tag and the current code enables the user to see which statement caused the panic. What you suggest does not do this.

sql/mon/mem_usage.go, line 108 [r1] (raw file):

Previously, andreimatei (Andrei Matei) wrote…

panic? Say that Stop wasn't called (cause then that would've panicked)

No here this is correct behavior. There can be memory left over between statements because of e.g. write intents.

sql/mon/mem_usage.go, line 131 [r1] (raw file):

Previously, andreimatei (Andrei Matei) wrote…

please document.

Is this really needed? Can you use debug.PrintStack() and friends instead?
Or do we even want this in the logs? Maybe we can use the span from a context instead.

I have added a comment. Until there are different contexts with stacked tags in every function (an idea which I incidentally currently dislike with extreme prejudice) your proposal would reduce the amount of useful information output to the logs to investigate memory allocations.

sql/mon/mem_usage_test.go, line 30 [r1] (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I don't see why this is needed. The test doesn't seem to look at the logs? (and if it did, maybe we can find another option :) )
At the very least add a comment, and restore the old verbosity in a defer.

I have added a comment to clarify. Why reset the verbosity? It's reset anyway between tests since there's a new server, isn't it?

sql/parser/row_container.go, line 17 [r1] (raw file):

Previously, andreimatei (Andrei Matei) wrote…

does this need to be in parser? Seems to me like it's not related to... parsing.

As I have argued before the current name "parser" for this package is a misnomer and it should really be called "language" or "lang" or something.

Nevertheless I moved to sql which simplifies.


sql/parser/row_container.go, line 28 [r1] (raw file):

Previously, andreimatei (Andrei Matei) wrote…

how about ColumnList ?

I killed this type entirely.

sql/parser/row_container.go, line 28 [r1] (raw file):

Previously, RaduBerinde wrote…

Do we think we will have multiple implementations for this interface?

I killed this type entirely.

sql/parser/row_container.go, line 33 [r1] (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I think the word "managed" doesn't mean anything, and generally it'd be good if this comment was expanded a little since I guess this type is gonna become pretty important.

Done.

sql/parser/row_container.go, line 34 [r1] (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I think there's a structure that at least superficially serves the same purpose of storing a bunch of nodes, maybe with some preallocations, in distsql. Maybe these can be unified? At least make sure you know about that guy.

Go doesn't really like to unify structs that work in slightly different ways. If Go(d) wanted us to do that, Go(d) would give us generics. ;)

But thanks for the hint.


sql/parser/row_container.go, line 39 [r1] (raw file):

Previously, RaduBerinde wrote…

it's the columns that are fixed not the row, maybe fixedColsSize

Done.

sql/parser/row_container.go, line 46 [r1] (raw file):

Previously, andreimatei (Andrei Matei) wrote…

put the unit in the name? allocatedBytes

Done.

sql/parser/row_container.go, line 50 [r1] (raw file):

Previously, andreimatei (Andrei Matei) wrote…

please document the capacity param, and how it doesn't affect the MemoryMonitor requirements.

Done.

sql/parser/row_container.go, line 50 [r1] (raw file):

Previously, RaduBerinde wrote…

capacity is not very telling, maybe numPreallocRows

Done.

sql/parser/row_container.go, line 68 [r1] (raw file):

Previously, andreimatei (Andrei Matei) wrote…

ha? Despite the name, this doesn't take into account variable vs non-variable cols, or actually the type of the columns at all.
Maybe it's just the comment on this member that needs to change.

It does, see the loop above.

sql/parser/row_container.go, line 74 [r1] (raw file):

Previously, andreimatei (Andrei Matei) wrote…

are RowContainers intended to be reused? In any case, please document that Close needs to be called on one in the struct doc.

I do not see how they can be reused -- our current planNode hierarchy nowhere supports reuse.

sql/parser/row_container.go, line 76 [r1] (raw file):

Previously, andreimatei (Andrei Matei) wrote…

This seems weird to me. Is it particularly convenient to support nil containers?

Due to the way the SQL code does optimizations the same valueNode can be referenced from different branches of the same planNode hierarchy which would result in Close() being called multiple times. Since the object that contains a RowContainer reference also sets its pointer to nil (so it can be GCed), the test here is necessary.

sql/parser/row_container.go, line 98 [r1] (raw file):

Previously, RaduBerinde wrote…

I like this idea

I like your idea too but that's not the right place to implement it. The current mon object is exactly the entity responsible to perform the optimization you suggest; another object implementing the policy (and which will be shared across threads and thus require synchronization) will then provision each monitor with chunks like you suggest.

sql/parser/row_container.go, line 125 [r1] (raw file):

Previously, andreimatei (Andrei Matei) wrote…

but you can't AddRow while someone's holding a reference to a PseudoPopped row, right? Add a comment somewhere. I guess ResetLen fixes this.
Also, how does returning a pointer help with the allocation?

This is exactly the code that was used in `sortTopKStrategy` and `values.go` previously (cc @nathanvanbenschoten). The explanation can be found in 13960c2.

sql/parser/row_container.go, line 138 [r1] (raw file):

Previously, andreimatei (Andrei Matei) wrote…

"inquire". We're not traveling on dual carriageways over here.

See http://onelook.com/?w=enquire - both spelling are correct. I don't know where you found your definition.

sql/pgwire/v3.go, line 712 [r1] (raw file):

Previously, andreimatei (Andrei Matei) wrote…

comment that this destroys the results

Done.

Comments from Reviewable

@knz
Copy link
Contributor Author

knz commented Aug 25, 2016

TFYR btw ❤️

@andreimatei
Copy link
Contributor

:lgtm: but I'm trying again with suggestions about changing the API of the monitor to something that's makes more explicit the lifetime of each allocation.
https://reviewable.io/reviews/cockroachdb/cockroach/8691#-KPtQvz6aW3pq220mLDb


Review status: 0 of 42 files reviewed at latest revision, 18 unresolved discussions, some commit checks failed.


server/admin.go, line 178 [r1] (raw file):

Previously, knz (kena) wrote…

I just made this slightly less ugly, PTAL.

I don't know if it's less ugly :) whatever

sql/distinct.go, line 174 [r1] (raw file):

Previously, knz (kena) wrote…

You only need to keep track of the combined allocated size of things that are grouped together in a container (e.g. the suffixes here) so that you don't need to call Release once per item in the container. Since the prefix is not in a collection it's not necessary. I'll answer your suggestion separately in another comment below.

the conversation overlaps with the other thread, but specifically here I was complaining that the field called `allocated` that not represent everything that has been allocated by this struct.

Also, is the last prefix allocated ever released?


sql/plan.go, line 176 [r1] (raw file):

Previously, knz (kena) wrote…

Clarified.

but still, can't you just say "releases resources"? I understand GC latency, but that's hidden away.

sql/planner.go, line 228 [r1] (raw file):

Previously, knz (kena) wrote…

These are different uses. planner.query()/queryRow() is used by internal things like admin.go; whereas the logic in execStmt is for external clients.

bleah

sql/row_container.go, line 68 [r1] (raw file):

Previously, knz (kena) wrote…

It does, see the loop above.

ok

sql/row_container.go, line 76 [r1] (raw file):

Previously, knz (kena) wrote…

Due to the way the SQL code does optimizations the same valueNode can be referenced from different branches of the same planNode hierarchy which would result in Close() being called multiple times. Since the object that contains a RowContainer reference also sets its pointer to nil (so it can be GCed), the test here is necessary.

so can't that one funky user (the `ValueNode`) do the `nil` check before calling this?

sql/row_container.go, line 125 [r1] (raw file):

Previously, knz (kena) wrote…

This is exactly the code that was used in sortTopKStrategy and values.go previously (cc @nathanvanbenschoten). The explanation can be found in 13960c2.

ok

sql/row_container.go, line 138 [r1] (raw file):

Previously, knz (kena) wrote…

See http://onelook.com/?w=enquire - both spelling are correct. I don't know where you found your definition.

http://lmgtfy.com/?q=enquire :)

sql/trace.go, line 66 [r1] (raw file):

Previously, knz (kena) wrote…

I do not understand what you are saying. Please clarify.

So the `sortNode` wants a context. You're saying that the `sortNode`shouldn't use the planner's ctx _for logging_. Should it then use it for something else? Cause if so, we should be passing it. But we're not passing it. So the `sortNode` probably shouldn't use the planner's context for anything. So then your clarification is technically incorrect. At least that's my reasoning seeing these lines of code.

sql/mon/mem_usage.go, line 34 [r1] (raw file):

Previously, knz (kena) wrote…

I do not see the advantage. Memory-wise it will amount to the same or worse, because each span struct will need at least one counter. The amount of work will be the same. I understand your idea, but perhaps you could detail what the advantage would be?

yeah, each span would need a counter, but otherwise each client of the monitor needs a counter. That'd be the advantage - clients no longer need to remember what they've allocated. They just need to remember the spans in which one or more allocations happened. This would also make these allocations contexts more explicit (think RAII) - for example, in the `Distinct` node, there'd be two spans at any one point - one span for the many suffix, and one for the (single) current prefix. As opposed to what you have now - one field called `allocated` and another allocation that's not explicitly tracked.

sql/mon/mem_usage.go, line 50 [r1] (raw file):

Previously, knz (kena) wrote…

Since MemoryUsageMonitor currently carries state over from one statement to the next, and I'd need the context from a previous statement at the start of a new one to explain where this state came from (why there was memory left over), I really need to keep track of the old context here. The proper solution would be to have a context with a span for the entire transaction; then a method on the Context type to change the current tag (used in executor to track each subsequent statement in the transaction). Until these APIs exist I think this implementation here is the cheapest that can buy us proper context tracking.

I have added a TODO for you to look into this later.

how about making `StartMonitor` take a statement, and then your monitor can keep track of whatever it wants without holding on to a context?

sql/mon/mem_usage.go, line 88 [r1] (raw file):

Previously, knz (kena) wrote…

Again I do not understand what you are saying. Please clarify which code would be better. The context here carries the current SQL statement as tag and the current code enables the user to see which statement caused the panic. What you suggest does not do this.

well SQL statements are not tags, at least not currently, so I don't think the context in the message would help much. But I'm not opposed to generally using contexts too when we die. But the main point was that I don't think `log.Errorf(); panic()` is ever the right thing to do. I think we want either `panic()` or `log.Fatalf()` (which can include a stack string if you want).

sql/mon/mem_usage.go, line 108 [r1] (raw file):

Previously, knz (kena) wrote…

No here this is correct behavior. There can be memory left over between statements because of e.g. write intents.

hmm then is `StartMonitor` the right name for this method? How about `StartStatementTracking()`? But it kinda tickles me that this `Monitor` needs to understand about statements and write intents. Shouldn't it be more oblivious/general? If we move to an API based on spans (or maybe a better name would be `allocationContexts`), maybe the executor could set up such an `allocationContext` for a whole txn, and pass that down to whatever deep-ass layer is gonna actually count write intents (btw that's not done yet, is it?), besides passing the `Monitor` to the planner? This way you'd make the lifetimes of the contexts explicit - one is per SQL txn, others are per statement.

sql/mon/mem_usage_test.go, line 30 [r1] (raw file):

Previously, knz (kena) wrote…

I have added a comment to clarify. Why reset the verbosity? It's reset anyway between tests since there's a new server, isn't it?

ummm I call "we can do better" on this one. Test your stack function explicitly. You're changing a global here (it has nothing to do with a Server, it's per-process), that's gonna be a problem if we move to running tests in parallel. And it needs resetting so that other (sequential) tests running in the same process don't inherit it (it's not one process per test).

Maybe we generally should have a big test that exercises a cluster under V(100), but it should be a general thing, not related to this monitor.


Comments from Reviewable

@knz knz force-pushed the sized-rows branch 3 times, most recently from cac75ae to e1e9fe5 Compare August 27, 2016 21:43
@knz
Copy link
Contributor Author

knz commented Aug 27, 2016

PTAL.

So thanks to Andrei's previous remarks I saw the light and refactored the entire thing to use span objects as interface between client components and the monitor. This makes the lifecycle of allocated objects more explicit.

Also I had to think more about how state is preserved in the monitor. I came to the conclusion/realization that a monitor is session-bound, not statement-bound. So I changed this too; the monitor is started and stopped with the lifecycle of the entire session.

Once this was done I could then successfully instrument the pgwire code to get prepared statements and portals also tracked by the monitor. Please check I didn't screw up anything there.

Then finally, before this goes in I'd like another round of advice as to how to best gather and log the allocation history in a session. To debug this code I have made a first attempt (currently commented out in the code with a "FIXME: andrei?") which logs all events to a bytes.Buffer and prints the contents of that buffer to the log at the end. What I would like instead is something like the SQL tracing: have a command-line flag or env var that enables the gathering of the allocation trace. However I don't think I understand what I should use here. Should I use a Context? a trace.Trace? Let's have a chat soon so I can implement the best thing.


Review status: 0 of 42 files reviewed at latest revision, 12 unresolved discussions, some commit checks failed.


sql/distinct.go, line 174 [r1] (raw file):

Previously, andreimatei (Andrei Matei) wrote…

the conversation overlaps with the other thread, but specifically here I was complaining that the field called allocated that not represent everything that has been allocated by this struct.

Also, is the last prefix allocated ever released?

Done.

sql/plan.go, line 176 [r1] (raw file):

Previously, andreimatei (Andrei Matei) wrote…

but still, can't you just say "releases resources"? I understand GC latency, but that's hidden away.

I don't understand your comment. What would be the better explanation?

sql/row_container.go, line 76 [r1] (raw file):

Previously, andreimatei (Andrei Matei) wrote…

so can't that one funky user (the ValueNode) do the nil check before calling this?

I checked and there are more than one place where this is used. I don't mind having the check one level higher, but does it make a difference really?

sql/trace.go, line 66 [r1] (raw file):

Previously, andreimatei (Andrei Matei) wrote…

So the sortNode wants a context. You're saying that the sortNodeshouldn't use the planner's ctx for logging. Should it then use it for something else? Cause if so, we should be passing it. But we're not passing it. So the sortNode probably shouldn't use the planner's context for anything. So then your clarification is technically incorrect.
At least that's my reasoning seeing these lines of code.

I see. I have rewritten the comment entirely to clarify. Let me know if that works for you.

sql/mon/mem_usage.go, line 34 [r1] (raw file):

Previously, andreimatei (Andrei Matei) wrote…

yeah, each span would need a counter, but otherwise each client of the monitor needs a counter. That'd be the advantage - clients no longer need to remember what they've allocated. They just need to remember the spans in which one or more allocations happened. This would also make these allocations contexts more explicit (think RAII) - for example, in the Distinct node, there'd be two spans at any one point - one span for the many suffix, and one for the (single) current prefix. As opposed to what you have now - one field called allocated and another allocation that's not explicitly tracked.

Done.

sql/mon/mem_usage.go, line 50 [r1] (raw file):

Previously, andreimatei (Andrei Matei) wrote…

how about making StartMonitor take a statement, and then your monitor can keep track of whatever it wants without holding on to a context?

I did it a different way. See comment at start of review reaction.

sql/mon/mem_usage.go, line 88 [r1] (raw file):

Previously, andreimatei (Andrei Matei) wrote…

well SQL statements are not tags, at least not currently, so I don't think the context in the message would help much. But I'm not opposed to generally using contexts too when we die.
But the main point was that I don't think log.Errorf(); panic() is ever the right thing to do. I think we want either panic() or log.Fatalf() (which can include a stack string if you want).

Done.

sql/mon/mem_usage.go, line 108 [r1] (raw file):

Previously, andreimatei (Andrei Matei) wrote…

hmm then is StartMonitor the right name for this method? How about StartStatementTracking()?
But it kinda tickles me that this Monitor needs to understand about statements and write intents. Shouldn't it be more oblivious/general? If we move to an API based on spans (or maybe a better name would be allocationContexts), maybe the executor could set up such an allocationContext for a whole txn, and pass that down to whatever deep-ass layer is gonna actually count write intents (btw that's not done yet, is it?), besides passing the Monitor to the planner? This way you'd make the lifetimes of the contexts explicit - one is per SQL txn, others are per statement.

Done.

sql/mon/mem_usage_test.go, line 30 [r1] (raw file):

Previously, andreimatei (Andrei Matei) wrote…

ummm I call "we can do better" on this one. Test your stack function explicitly.
You're changing a global here (it has nothing to do with a Server, it's per-process), that's gonna be a problem if we move to running tests in parallel. And it needs resetting so that other (sequential) tests running in the same process don't inherit it (it's not one process per test).

Maybe we generally should have a big test that exercises a cluster under V(100), but it should be a general thing, not related to this monitor.

Done.

Comments from Reviewable

@knz knz force-pushed the sized-rows branch 6 times, most recently from f308734 to 7123910 Compare August 28, 2016 16:53
package sql

import "github.com/cockroachdb/cockroach/sql/mon"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreimatei the comment below is for you. I think I got the best I could as long as #7775 is not fixed.

@RaduBerinde
Copy link
Member

Review status: 25 of 51 files reviewed at latest revision, 25 unresolved discussions, some commit checks pending.


sql/mon/account.go, line 91 [r6] (raw file):

  ctx context.Context, acc *MemoryAccount, oldSize, newSize int64,
) error {
  if err := mm.reserveMemory(ctx, newSize); err != nil {

We should panic if we don't have oldSize bytes allocated already.


Comments from Reviewable

@knz
Copy link
Contributor Author

knz commented Sep 6, 2016

Review status: 25 of 51 files reviewed at latest revision, 25 unresolved discussions, some commit checks pending.


sql/mon/account.go, line 36 [r5] (raw file):

Previously, RaduBerinde wrote…

If GrowAccount fails, shouldn't we Close the account? (might not matter now, will matter for tracking). That would be the main benefit of this wrapper IMO.

Also [nit] would use something more suggestive than init, maybe initialReservation or initialAllocation

I changed the argument name.

Whomever will address #9122 will discover that you cannot require a Close() for a zero-valued account without a major refactoring of the entire planNode hierarchy (would need explicit Close() calls in each planNode constructor for children nodes upon errors in the constructor). I fear that the motivation for the OpenAccount API is thereby compromised and that registration is likely to be implemented in GrowAccount instead by means of if acc, ok := ... {} else { register(...) }.


sql/mon/account.go, line 91 [r6] (raw file):

Previously, RaduBerinde wrote…

I don't understand why we do this instead of growing by newSize-oldSize?

Because this reflects the actual memory usage: the new stuff is allocated in memory before the reference to the old stuff is overwritten, so they have to fit side by side for the allocation to be accepted.

sql/mon/account.go, line 91 [r6] (raw file):

Previously, RaduBerinde wrote…

We should panic if we don't have oldSize bytes allocated already.

releaseMemory takes care of this already,

Comments from Reviewable

@RaduBerinde
Copy link
Member

Review status: 25 of 51 files reviewed at latest revision, 25 unresolved discussions, some commit checks pending.


sql/mon/account.go, line 91 [r6] (raw file):

Previously, knz (kena) wrote…

Because this reflects the actual memory usage: the new stuff is allocated in memory before the reference to the old stuff is overwritten, so they have to fit side by side for the allocation to be accepted.

If we want to reflect that, this is a half measure.. Another caller could get that `newSize-oldSize` memory before this caller actually gets a chance to copy stuff, so we don't really guarantee anything.

If we really wanted to account for that transient usage, it would have to be a pair of calls, one to grow before the realloc/copy, one to shrink after.


sql/mon/account.go, line 91 [r6] (raw file):

Previously, knz (kena) wrote…

releaseMemory takes care of this already,

No, because we first reserve `newSize` which presumably is bigger than `oldSize`..

Comments from Reviewable

@RaduBerinde
Copy link
Member

Review status: 25 of 51 files reviewed at latest revision, 25 unresolved discussions, some commit checks pending.


sql/mon/account.go, line 36 [r5] (raw file):

Previously, knz (kena) wrote…

I changed the argument name.

Whomever will address #9122 will discover that you cannot require a Close() for a zero-valued account without a major refactoring of the entire planNode hierarchy (would need explicit Close() calls in each planNode constructor for children nodes upon errors in the constructor). I fear that the motivation for the OpenAccount API is thereby compromised and that registration is likely to be implemented in GrowAccount instead by means of if acc, ok := ... {} else { register(...) }.

Ok, I see.

I still think OpenAccount is useful, as we probably want to associate each account with the caller (file:line) that opens the account, not whoever happens to do the first Grow. So we would store the caller in the account (or in some external datastructure, e.g. map[Account*]callerInfo), even if we do the registration at the first Grow.


Comments from Reviewable

@knz
Copy link
Contributor Author

knz commented Sep 6, 2016

Review status: 25 of 51 files reviewed at latest revision, 25 unresolved discussions, some commit checks pending.


sql/mon/account.go, line 36 [r5] (raw file):

Previously, RaduBerinde wrote…

Ok, I see.

I still think OpenAccount is useful, as we probably want to associate each account with the caller (file:line) that opens the account, not whoever happens to do the first Grow. So we would store the caller in the account (or in some external datastructure, e.g. map[Account*]callerInfo), even if we do the registration at the first Grow.

Yep that seems reasonable.

sql/mon/account.go, line 91 [r6] (raw file):

Previously, RaduBerinde wrote…

If we want to reflect that, this is a half measure.. Another caller could get that newSize-oldSize memory before this caller actually gets a chance to copy stuff, so we don't really guarantee anything.

If we really wanted to account for that transient usage, it would have to be a pair of calls, one to grow before the realloc/copy, one to shrink after.

Why would I want that? At the point where this function is called, both objects already exist in memory. The API serves to register a situation that already exists. Remember this is not an allocation API but rather a declaration API.

sql/mon/account.go, line 91 [r6] (raw file):

Previously, RaduBerinde wrote…

No, because we first reserve newSize which presumably is bigger than oldSize..

I'm not sure what you are trying to achieve here. If newSize is larger that means that a later call to releaseMemory will fail instead of this one. With the allocation trace it will be trivial to figure what went wrong. Why add a debug conditional on the critical path when there is already enough information collected in other places to troubleshoot in case of error? (Note: these checks are there to detect programming errors in tests, should never be encountered in production if testing is thorough)

Comments from Reviewable

@RaduBerinde
Copy link
Member

Review status: 25 of 51 files reviewed at latest revision, 25 unresolved discussions, some commit checks pending.


sql/mon/account.go, line 91 [r6] (raw file):

Previously, knz (kena) wrote…

I'm not sure what you are trying to achieve here.
If newSize is larger that means that a later call to releaseMemory will fail instead of this one.
With the allocation trace it will be trivial to figure what went wrong.
Why add a debug conditional on the critical path when there is already enough information collected in other places to troubleshoot in case of error?
(Note: these checks are there to detect programming errors in tests, should never be encountered in production if testing is thorough)

Well the function assumes that `oldSize` is already allocated. I figured it's nice to verify the assumptions, but it's not a big deal.

sql/mon/account.go, line 91 [r6] (raw file):

Previously, knz (kena) wrote…

Why would I want that? At the point where this function is called, both objects already exist in memory. The API serves to register a situation that already exists. Remember this is not an allocation API but rather a declaration API.

Then I really don't see the point of first reserving `newSize`. If we already allocated the memory, and it's ok to use `newSize-oldSize` more bytes, what's the point of failing the operation (and forcing the caller to free the new object) rather than allowing it to proceed (and deallocate the old object)?

Comments from Reviewable

@knz
Copy link
Contributor Author

knz commented Sep 6, 2016

Review status: 25 of 51 files reviewed at latest revision, 25 unresolved discussions, all commit checks successful.


sql/mon/account.go, line 91 [r6] (raw file):

Previously, RaduBerinde wrote…

Then I really don't see the point of first reserving newSize. If we already allocated the memory, and it's ok to use newSize-oldSize more bytes, what's the point of failing the operation (and forcing the caller to free the new object) rather than allowing it to proceed (and deallocate the old object)?

Ok now I understand what you mean. I changed the code accordingly.

sql/mon/account.go, line 91 [r6] (raw file):

Previously, RaduBerinde wrote…

Well the function assumes that oldSize is already allocated. I figured it's nice to verify the assumptions, but it's not a big deal.

Done.

Comments from Reviewable

@RaduBerinde
Copy link
Member

Review status: 25 of 51 files reviewed at latest revision, 23 unresolved discussions, some commit checks failed.


sql/mon/account.go, line 102 [r7] (raw file):

      }
  case delta < 0:
      if acc.curAllocated < -delta {

We don't need this here, it's exactly what releaseMemory is checking as well.

(I was suggesting something else - checking that curAllocated >= oldSize in all cases, but I don't feel strongly)


Comments from Reviewable

@RaduBerinde
Copy link
Member

Review status: 25 of 51 files reviewed at latest revision, 23 unresolved discussions, some commit checks failed.


sql/mon/account.go, line 102 [r7] (raw file):

Previously, RaduBerinde wrote…

We don't need this here, it's exactly what releaseMemory is checking as well.

(I was suggesting something else - checking that curAllocated >= oldSize in all cases, but I don't feel strongly)

Ah, actually sorry, this is checking the account's `curAllocated`, which is different than what releaseMemory checks, disregard my comment.

Comments from Reviewable

@nvanbenschoten
Copy link
Member

Review status: 20 of 52 files reviewed at latest revision, 23 unresolved discussions, all commit checks successful.


sql/mon/account.go, line 29 [r5] (raw file):

Previously, knz (kena) wrote…

Wouldn't work well with account registration, if we ever come to implement this :)

Care to expand on that @knz? Would a separate `RegisterAccount` method alleviate that issue so that we could make this more idiomatic?

sql/mon/mem_usage.go, line 65 [r2] (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Dude, that "duplicate" pointer is really not a problem :). It would allow you to remove a whole file (that session_mon.go).

I agree with @RaduBerinde and @andreimatei that the extra pointer is well worth the gains in aesthetics, ease-of-use, and avoidance of bugs that being able to add methods like `Close` or `Grow` on the `MemoryAccount` affords us. It also provides a nice hierarchical structure to all of this code: A `MemoryUsageMonitor` has a set of `MemoryAccount`s which it owns.

This API structure right now seems equivalent to if we required callers of Engine.NewBatch() to hold onto both the Batch and the parent Engine to do anything useful. The result would be a mess of extra logic which leaks encapsulated logic throughout the code that uses it, and would also permit incorrect use of the API.


Comments from Reviewable

@knz
Copy link
Contributor Author

knz commented Sep 10, 2016

Review status: 20 of 52 files reviewed at latest revision, 23 unresolved discussions, some commit checks failed.


sql/mon/account.go, line 29 [r5] (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Care to expand on that @knz? Would a separate RegisterAccount method alleviate that issue so that we could make this more idiomatic?

The problem is that if you register accounts by putting a pointer to them in a dict somewhere in a monitor, then obviously you can't return the account you just allocated locally by value.

And yes a separate registration method would be better.


Comments from Reviewable

This patch instruments SQL execution to track memory allocations whose
amounts depend on user input and current database contents (as opposed
to allocations dependent on other parameters e.g. statement size).

It does so by introducing a new MemoryUsageMonitor object intended
to be instantiated once per session. It is set up and teared down
with the lifecycle of a session.

Components can then link and report their memory usage to the monitor
via MemoryAccount objects. Accounts can gate incremental allocations
and keep track of the cumulative allocations so that all can be
released at once.

This  is used to track and limits allocations:
- in `valueNode`,
- buckets in `groupNode`,
- sorted data in `sortNode`,
- temp results in `joinNode`,
- seen prefixes and suffixes in `distinctNode`,
- window and partition slices in `windowNode`,
- the Result arrays in Executor,
- prepared statements and prepared portals in pgwire.

A moderate amount of intelligence is implemented so as to avoid
computing sizes for every column of every row in a valueNode - the
combined size of all fixed-size columns is precomputed and counted at
once for every row.

This patch does not track the memory allocated for write intents in
the KV transaction object.

For troubleshooting the following mechanisms are provided:

- an env var COCKROACH_NOTEWORTHY_MEMORY_USAGE, if set, defines the
  minimum total allocated size for a monitor before it starts
  logging how much memory it is consuming.

- detailed allocation progress is logged at level 2 or above. To
  trace just SQL activity and memory allocation, one can use for
  example `--vmodule=executor=2,mem_usage=2`.
@knz
Copy link
Contributor Author

knz commented Sep 10, 2016

PTAL, and please review the 3 commits separately.


Review status: 18 of 51 files reviewed at latest revision, 23 unresolved discussions.


Comments from Reviewable

knz added 2 commits September 10, 2016 13:32
Every access to a MemoryAccount struct needs a logging context
obtained from the session. Simplify the code by introducing Session
wrapper methods to provide this context reference.

Idea courtesy of @petermattis.
Since all uses of MemoryAccount go through the Session object (because
they need the Session's logging context), and since the Session object
hosts the monitor object, have Session provide the monitor pointer to
the MemoryAccount methods. This saves a copy of the monitor pointer in
every account instantiated.
@knz
Copy link
Contributor Author

knz commented Sep 10, 2016

Ok you know what since I have the other important PR lined up which requires this change I will go ahead and merge this.

You will notice that I did my homework and performed change you requested to the MemoryAccount API in the first, main "large" commit, then changed back again to the simplified form in a small, well-contained commit 1e04a49 that can be easily reverted without removing the main feature.

So if someone believes that the code is better without that latter commit, feel free to revert just that one and submit the result. I'll even be happy to review it.

@knz knz merged commit f48730d into cockroachdb:develop Sep 10, 2016
@knz knz deleted the sized-rows branch September 10, 2016 15:58
@RaduBerinde
Copy link
Member

It is evident from the discussion that three of us (Andrei, Nate, and myself) think the code is better without that commit so I'm not sure why you included it. Also, without that commit, the wrappers in session_mem_usage aren't worth it IMO (they're mostly one-liners).


Comments from Reviewable

@RaduBerinde
Copy link
Member

In fact, I think MemoryAccount should store the Context as well, instead of having it passed to every function. But if optimizing 8 bytes was so critical, I can't imagine convincing you of another 16 :)

@knz
Copy link
Contributor Author

knz commented Sep 11, 2016

My initial implementation had the contexts as well but Andrei convinced me these should be passed at every call instead.

Sent from my Android device with K-9 Mail. Please excuse my brevity.

@knz
Copy link
Contributor Author

knz commented Sep 11, 2016

The wrappers are not one liners any more, or at least not in the way they were initially. The new API preserves the desired look and feel at the points of use.

Sent from my Android device with K-9 Mail. Please excuse my brevity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants