Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

max-sql-memory not set #35054

Closed
codepitbull opened this issue Feb 19, 2019 · 15 comments
Closed

max-sql-memory not set #35054

codepitbull opened this issue Feb 19, 2019 · 15 comments
Assignees
Labels
C-investigation Further steps needed to qualify. C-label will change. O-community Originated from the community

Comments

@codepitbull
Copy link

Describe the problem

After porting some memory heavy queries from Postgres to Cockroach we see issues with exceeding the memory bduget. We had followed the production checklist and adjusted the values to the recommended 25%.

We still get:
ERROR: root: memory budget exceeded: 10240 bytes requested, 134210560 currently allocated, 134217728 bytes in budget

The value for "bytes in budget" still reflects 128 mb so it seems the values don't get passed trhough.

We checked cockroach.log and see:

I190218 13:32:22.083875 1 server/config.go:386  system total memory: 15 GiB
I190218 13:32:22.083901 1 server/config.go:388  server configuration:
max offset             500000000
cache size             3.8 GiB
SQL memory pool size   3.8 GiB
scan interval          10m0s
scan min idle time     10ms
scan max idle time     1s
event log enabled      true

So it looks like we set the values correctly.

To Reproduce
This is the commandline we use for starting cockroach:

/opt/cockroachdb/cockroach start --store=/mnt/data/cockroachdb --temp-dir=/tmp --port=26257 --http-port=7005 --log-dir=/mnt/logs/cockroachdb --cache=.25 --max-sql-memory=.25 --insecure

Expected behavior
We expected to see the queries use more than the 128 mb.

Environment:

  • CockroachDB version 2.1.4
  • Server OS: Ubuntu
  • Client app JDBC
@marcelbirkner
Copy link

marcelbirkner commented Feb 19, 2019

Here is an example of the failing query.

Table structure:

SHOW COLUMNS FROM rollup_minutes;
  column_name | data_type | is_nullable | column_default | generation_expression |                                                              indices                                                              | is_hidden
+-------------+-----------+-------------+----------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------+-----------+
  customer    | STRING    |    false    | NULL           |                       | {"rollup_minutes_pkey","rollup_minutes_customer_timestamp_name_key","rollup_minutes_customer_idx","rollup_minutes_timestamp_idx"} |   false
  timestamp   | INT8      |    false    | NULL           |                       | {"rollup_minutes_pkey","rollup_minutes_customer_timestamp_name_key","rollup_minutes_customer_idx","rollup_minutes_timestamp_idx"} |   false
  name        | STRING    |    false    | NULL           |                       | {"rollup_minutes_pkey","rollup_minutes_customer_timestamp_name_key","rollup_minutes_customer_idx","rollup_minutes_timestamp_idx"} |   false
  sims        | INT       |    true     | NULL           |                       | {}                                                                                                                                |   false

Query:

SELECT 
  floor(timestamp/3600000)*3600000 AS ts, name as sensorname, sum(sims) as sims 
FROM 
  rollup_minutes 
WHERE 
  customer = :customerId AND 
  timestamp >= :timeTillRollup 
GROUP BY 
  name, floor(timestamp/3600000)*3600000

@marcelbirkner
Copy link

Here is the EXPLAIN statement for the query above:

EXPLAIN SELECT floor(timestamp/3600000)*3600000 AS ts, name as sensorname, sum(sims) as sims FROM rollup_minutes WHERE customer = '-fOnY8rwTXOy0BJaXsexfQ' AND timestamp >= 1550376000000 GROUP BY name, floor(timestamp/3600000)*3600000;

         tree         |    field    |                                 description
+---------------------+-------------+-----------------------------------------------------------------------------+
  render              |             |
   └── group          |             |
        │             | aggregate 0 | name
        │             | aggregate 1 | column6
        │             | aggregate 2 | sum(sims)
        │             | group by    | @2,@1
        └── render    |             |
             └── scan |             |
                      | table       | rollup_minutes@rollup_minutes_pkey
                      | spans       | /"-fOnY8rwTXOy0BJaXsexfQ"/1550376000000-/"-fOnY8rwTXOy0BJaXsexfQ"/PrefixEnd
(10 rows)

Time: 1.39429ms

@marcelbirkner
Copy link

EXPLAIN(opt) result:

EXPLAIN(opt) SELECT floor(timestamp/3600000)*3600000 AS ts, name as sensorname, sum(sims) as sims FROM rollup_minutes WHERE customer = '-fOnY8rwTXOy0BJaXsexfQ' AND timestamp >= 1550376000000 GROUP BY name, floor(timestamp/3600000)*3600000;
                                                    text
+----------------------------------------------------------------------------------------------------------+
  group-by
   ├── columns: ts:6(decimal) sensorname:3(string!null) sims:5(decimal)
   ├── grouping columns: name:3(string!null) column6:6(decimal)
   ├── side-effects
   ├── stats: [rows=3.33333333, distinct(3,6)=3.33333333]
   ├── cost: 3.83333333
   ├── key: (3,6)
   ├── fd: (3,6)-->(5)
   ├── prune: (5)
   ├── project
   │    ├── columns: column6:6(decimal) name:3(string!null) sims:4(int)
   │    ├── side-effects
   │    ├── stats: [rows=3.33333333, distinct(3,6)=3.33333333]
   │    ├── cost: 3.66666667
   │    ├── prune: (3,4,6)
   │    ├── scan rollup_minutes
   │    │    ├── columns: customer:1(string!null) timestamp:2(int!null) name:3(string!null) sims:4(int)
   │    │    ├── constraint: /1/2/3: [/'-fOnY8rwTXOy0BJaXsexfQ'/1550376000000 - /'-fOnY8rwTXOy0BJaXsexfQ']
   │    │    ├── stats: [rows=3.33333333, distinct(1)=1, distinct(2,3)=3.33333333]
   │    │    ├── cost: 3.6
   │    │    ├── key: (2,3)
   │    │    ├── fd: ()-->(1), (2,3)-->(4)
   │    │    └── interesting orderings: (+1,+2,+3) (+2,+1,+3)
   │    └── projections [outer=(2-4), side-effects]
   │         └── mult [type=decimal, outer=(2), side-effects]
   │              ├── function: floor [type=decimal, outer=(2), side-effects]
   │              │    └── div [type=decimal, outer=(2), side-effects]
   │              │         ├── variable: timestamp [type=int, outer=(2)]
   │              │         └── const: 3600000 [type=int]
   │              └── const: 3600000 [type=decimal]
   └── aggregations [outer=(4)]
        └── sum [type=decimal, outer=(4)]
             └── variable: sims [type=int, outer=(4)]
(33 rows)

@jordanlewis
Copy link
Member

Thanks for the report. I'm a bit perplexed because the line here:

SQL memory pool size   3.8 GiB

is exactly the line that reflects the size of the root memory budget. As you point out, the error message reflects 128MB, not 3.8GB.

Are you completely sure that the server that printed that error message out was the same one that you ran the query on? If you are running a cluster, all nodes need to be started with that flag - not just one.

@marcelbirkner
Copy link

Table size:

image

@marcelbirkner
Copy link

marcelbirkner commented Feb 19, 2019

Hi @jordanlewis

I restarted the server on Monday with the 3,8GiB cache and SQL memory pool size. Thats also when we see 3,8GiB in the startup logs.

The query errors in the logs with the 128MiB bucket limit are from today. All applications using CockroachDB have been restarted as well. We are a bit puzzled as well.

We run a single node setup at the moment.

@marcelbirkner
Copy link

As you can see in the hardware dashboard memory usage increased from 1.1GiB to 5.6GiB after setting --cache=.25 --max-sql-memory=.25 on a 16GB AWS m5.xlarge.

image

@jordanlewis
Copy link
Member

Can you please confirm that running the problematic query in ./cockroach sql still outputs the memory error?

@marcelbirkner
Copy link

marcelbirkner commented Feb 19, 2019

Right now it works when I run the query via cockroach sql

(735 rows)

Time: 399.059261ms

We have about 500 Java clients using postgres driver, that calculate daily/hourly/minute rollups and we end up having spikes throughout the day when we see the "budget exceeded" error in the cockroachdb.log.

During the spikes we have a handful of the 500 Java clients fail 20 - 50 times (see screenshot below).

image

Database configs are 100% identical.

database:
  driverClass: "org.postgresql.Driver"
  abandonWhenPercentageFull: 0
  alternateUsernamesAllowed: false
  commitOnReturn: false
  rollbackOnReturn: false
  autoCommitByDefault: null
  readOnlyByDefault: null
  user: "XXXXXXXXXXXXXX"
  password: XXXXXXXXXXXX
  url: "jdbc:postgresql://cockroachdb-0-eu-west-1.XXXXXXXXXXXXX:26257/sales?sslmode=disable"
  properties:
    charSet: "UTF-8"
  defaultCatalog: null
  defaultTransactionIsolation: "DEFAULT"
  useFairQueue: true
  initialSize: 1
  minSize: 1
  maxSize: 3
  initializationQuery: null
  logAbandonedConnections: false
  logValidationErrors: true
  maxConnectionAge:
    present: false
  maxWaitForConnection: "5 seconds"
  minIdleTime: "1 minute"
  validationQuery: "/* MyService Health Check */ SELECT 1"
  validationQueryTimeout:
    present: true
  checkConnectionWhileIdle: true
  checkConnectionOnBorrow: true
  checkConnectionOnConnect: true
  checkConnectionOnReturn: false
  autoCommentsEnabled: true
  evictionInterval: "10 seconds"
  validationInterval: "30 seconds"
  validatorClassName:
    present: false
  removeAbandoned: false
  removeAbandonedTimeout: "60 seconds"
  jdbcInterceptors:
    present: false
  ignoreExceptionOnPreLoad: false
  defaultReadOnly: false

@marcelbirkner
Copy link

marcelbirkner commented Feb 19, 2019

With our current setup we tested different memory settings to see the connection behaviour:

128 MiB: 1245 connections
1 GiB: 9980 connections
2 GiB: 19968 connections

So we should be fine with our 500 clients.

@marcelbirkner
Copy link

EXPLAIN ANALYZE results:

image

@jordanlewis
Copy link
Member

I'm surprised this aggregate would be the one to cause the out of memory errors. Are there other concurrent queries? All queries share the root memory pool - perhaps there's one that's much heavier weight?

Is the error message you included at the top of the issue the actual captured error message from the application, after you changed the setting?

@jordanlewis jordanlewis self-assigned this Feb 20, 2019
@jordanlewis jordanlewis added C-investigation Further steps needed to qualify. C-label will change. O-community Originated from the community labels Feb 20, 2019
@codepitbull
Copy link
Author

codepitbull commented Feb 20, 2019

I checked again, there are multiple failing queries on top of the ones Marcel mentioned:
INSERT INTO rollup_minutes (customer, timestamp, name, sims) values (:customerId, :timestamp, :name, :sims on conflict (customer, timestamp, name) do update set sims = rollup_minutes.sims+excluded.sims")
These are done in batches of size 1000.

The value they for the size of the pool report in the error message is always the same, the 128 MiB mentioned above.

org.quartz.core.JobRunShell Job Cashier.Minute Rollup Job threw an unhandled Exception: #12 ! org.jdbi.v3.core.statement.UnableToExecuteStatementException: java.sql.BatchUpdateException: Batch entry 0 /* MinuteRollupDao.addRollupDataAndIncrement / INSERT INTO rollup_minutes (customer, timestamp, name, sims) values ('rtFZ9HHQSgKNV2efzjPx-A', 1550413560000, 'process', 14) on conflict (customer, timestamp, name) do update set sims = rollup_minutes.sims+excluded.sims was aborted: ERROR: root: memory budget exceeded: 10240 bytes requested, 134210560 currently allocated, 134217728 bytes in budget Call getNextException to see other errors in the batch. [statement:"INSERT INTO rollup_minutes (customer, timestamp, name, sims) values (:customerId, :timestamp, :name, :sims) on conflict (customer, timestamp, name) do update set sims = rollup_minutes.sims+excluded.sims", rewritten:"/ MinuteRollupDao.addRollupDataAndIncrement / INSERT INTO rollup_minutes (customer, timestamp, name, sims) values (:customerId, :timestamp, :name, :sims) on conflict (customer, timestamp, name) do update set sims = rollup_minutes.sims+excluded.sims", parsed:"ParsedSql{sql='/ MinuteRollupDao.addRollupDataAndIncrement / INSERT INTO rollup_minutes (customer, timestamp, name, sims) values (?, ?, ?, ?) on conflict (customer, timestamp, name) do update set sims = rollup_minutes.sims+excluded.sims', parameters=ParsedParameters{positional=false, parameterNames=[customerId, timestamp, name, sims]}}", arguments:{ positional:{}, named:{}, finder:[{lazy bean property arguments "com.instana.cashier.jdbi.model.RollupModel@10500320"]}]#12! at org.jdbi.v3.core.statement.PreparedBatch.internalBatchExecute(PreparedBatch.java:175)#12! Caused by: java.sql.BatchUpdateException: Batch entry 0 / MinuteRollupDao.addRollupDataAndIncrement */ INSERT INTO rollup_minutes (customer, timestamp, name, sims) values ('rtFZ9HHQSgKNV2efzjPx-A', 1550413560000, 'process', 14) on conflict (customer, timestamp, name) do update set sims = rollup_minutes.sims+excluded.sims was aborted: ERROR: root: memory budget exceeded: 10240 bytes requested, 134210560 currently allocated, 134217728 bytes in budget Call getNextException to see other errors in the batch.#12! at org.postgresql.jdbc.BatchResultHandler.handleError(BatchResultHandler.java:148)#12! Suppressed: org.postgresql.util.PSQLException: ERROR: root: memory budget exceeded: 10240 bytes requested, 134210560 currently allocated, 134217728 bytes in budget#012! #011at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2440)#12! Caused by: org.postgresql.util.PSQLException: ERROR: root: memory budget exceeded: 10240 bytes requested, 134210560 currently allocated, 134217728 bytes in budget#012! at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2440)

I also traced through the cockroach-code and everything looks correct to me so the size should be set. But whatever we do, we always get those 128 MiB.

I am 100% sure that it must something very basic we awre missing here but we are currently out of ideas :)

@marcelbirkner
Copy link

Hi @jordanlewis

we found the problem. The log messages from our Log Management provider are 2 days behind but have current timestamps (see screenshot).

Thank you for your help 👍
You can close the ticket

image

@jordanlewis
Copy link
Member

No problem- I'm glad it was something like that!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-investigation Further steps needed to qualify. C-label will change. O-community Originated from the community
Projects
None yet
Development

No branches or pull requests

3 participants