Stability problem with a 1000 nodes puppetserver #430

guillaume-ferry · 2017-10-20T09:32:20Z

Hi,

Thank you for the great job with puppetboard.

Since the number of nodes in our environment reached about 1000 nodes, we frequently meet this error message when trying to load :

On the puppetdb logs side, I meet this error :

ERROR [p.p.threadpool] Error processing command on thread cmd-proc-thread-23365
clojure.lang.ExceptionInfo: Value does not match schema: {:resources [nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil {:events [{:status (not (instance? java.lang.String nil))} nil]} nil nil nil nil nil nil nil nil nil nil nil nil nil nil]}
	at schema.core$validator$fn__2894.invoke(core.clj:155)
	at schema.core$validate.invokeStatic(core.clj:164)
	at schema.core$validate.invoke(core.clj:159)
	at puppetlabs.puppetdb.command$store_report$fn__32755.invoke(command.clj:355)
	at puppetlabs.puppetdb.command$store_report.invokeStatic(command.clj:354)
	at puppetlabs.puppetdb.command$store_report.invoke(command.clj:353)
	at puppetlabs.puppetdb.command$process_command_BANG_.invokeStatic(command.clj:389)
	at puppetlabs.puppetdb.command$process_command_BANG_.invoke(command.clj:380)
	at puppetlabs.puppetdb.command$process_command_and_respond_BANG_$fn__32863.invoke(command.clj:442)
	at puppetlabs.puppetdb.command$call_with_quick_retry$fn__32856.invoke(command.clj:424)
	at puppetlabs.puppetdb.command$call_with_quick_retry.invokeStatic(command.clj:423)
	at puppetlabs.puppetdb.command$call_with_quick_retry.invoke(command.clj:421)
	at puppetlabs.puppetdb.command$process_command_and_respond_BANG_.invokeStatic(command.clj:440)
	at puppetlabs.puppetdb.command$process_command_and_respond_BANG_.invoke(command.clj:438)
	at puppetlabs.puppetdb.command$process_cmdref$fn__32873.invoke(command.clj:505)
	at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_$fn__30651$fn__30652$fn__30653.invoke(metrics.clj:14)
	at puppetlabs.puppetdb.utils.metrics.proxy$java.lang.Object$Callable$7da976d4.call(Unknown Source)
	at com.codahale.metrics.Timer.time(Timer.java:101)
	at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_$fn__30651$fn__30652.invoke(metrics.clj:14)
	at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_$fn__30651$fn__30652$fn__30653.invoke(metrics.clj:14)
	at puppetlabs.puppetdb.utils.metrics.proxy$java.lang.Object$Callable$7da976d4.call(Unknown Source)
	at com.codahale.metrics.Timer.time(Timer.java:101)
	at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_$fn__30651$fn__30652.invoke(metrics.clj:14)
	at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_.invokeStatic(metrics.clj:17)
	at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_.invoke(metrics.clj:6)
	at puppetlabs.puppetdb.command$process_cmdref.invokeStatic(command.clj:501)
	at puppetlabs.puppetdb.command$process_cmdref.invoke(command.clj:480)
	at puppetlabs.puppetdb.command$message_handler$fn__32881.invoke(command.clj:551)
	at puppetlabs.puppetdb.threadpool$dochan$fn__32634$fn__32635.invoke(threadpool.clj:117)

	at puppetlabs.puppetdb.threadpool$call_on_threadpool$fn__32629.invoke(threadpool.clj:95)
	at clojure.lang.AFn.run(AFn.java:22)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
2017-10-20 10:38:47,059 WARN  [p.p.q.engine] The event-counts entity is experimental and may be altered or removed in the future.

The puppetdb performance dashboard looks like this in production :

On the apache2 logs side :

[Fri Oct 20 09:08:26.546106 2017] [wsgi:error] [pid 29006]     raise ReadTimeout(e, request=request)
[Fri Oct 20 09:08:26.546112 2017] [wsgi:error] [pid 29006] ReadTimeout: HTTPConnectionPool(host='localhost', port=8080): Read timed out. (read timeout=20)
[Fri Oct 20 09:08:26.547887 2017] [wsgi:error] [pid 29006] INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection (3): localhost
[Fri Oct 20 09:08:51.138396 2017] [wsgi:error] [pid 29006] ERROR:pypuppetdb.api:Connection to PuppetDB timed out on localhost:8080 over HTTP.
[Fri Oct 20 09:08:51.141567 2017] [wsgi:error] [pid 29006] ERROR:puppetboard.app:Exception on /FST/ [GET]
[Fri Oct 20 09:08:51.141580 2017] [wsgi:error] [pid 29006] Traceback (most recent call last):

When the board "respawn", it works great but I don’t understand why sometimes it fails for several minutes...

I need advices to find a way to fix it. Maybe reducing the nodes in the overview page? Is there a way to limit a long list on the xx last reports or something like this?

Thanks for your help,
Guillaume

The text was updated successfully, but these errors were encountered:

mterzo · 2017-10-20T14:00:12Z

Interesting. Seems puppetdb server is unable to handle the hour. Which doesn’t seem right.

If this happens frequently can you enable debug on puppetboard, this will give
The queries being executed at the time.

What version of puppetdb and puppetsetver are you running.

corey-hammerton · 2017-10-20T14:44:07Z

Here's what the PuppetDB documentation has to say about the Command Queue Depth

When viewing the performance dashboard, note the depth of the message queue (labeled “Command Queue depth”). If it is rising and you have CPU cores to spare, increasing the number of threads may help to process the backlog more rapidly.

guillaume-ferry · 2017-10-23T09:05:29Z

Thank you for your help.

Using puppet server 2.8 and puppetDB 4.4 on Ubuntu server Xenial.

Hardware specs :

VMware VM
2x8 cores
32 Gb RAM

I followed Puppet / puppetDB tuning guides

and changed the following parameters :

/etc/postgresql/9.4/main/postgresql.conf : shared_buffers = 128MB => 4096MB
/etc/default/puppetdb : JAVA_ARGS="-Xmx2G"
/etc/default/puppetserver : JAVA_ARGS="-Xms5g -Xmx5g -XX:MaxPermSize=256m"
/etc/puppetlabs/puppetserver/conf.d/puppetserver.conf : max-active-instances: 4

The puppet module list result is : 

/etc/puppetlabs/code/environments/production/modules
├── badgerious-windows_env (v2.2.2)
├── camptocamp-archive (v0.9.0)
├── camptocamp-augeas (v1.6.0)
├── camptocamp-postfix (v1.5.0)
├── camptocamp-systemd (v0.4.0)
├── puppet-puppetboard (v3.0.0)
├── puppet-staging (v2.2.0)
├── puppet-zabbix (v3.0.0)
├── puppetlabs-apache (v1.10.0)
├── puppetlabs-apt (v2.3.0)
├── puppetlabs-chocolatey (v3.0.0)
├── puppetlabs-concat (v2.2.0)
├── puppetlabs-firewall (v1.8.1)
├── puppetlabs-inifile (v1.6.0)
├── puppetlabs-mysql (v3.11.0)
├── puppetlabs-pe_gem (v0.2.0)
├── puppetlabs-postgresql (v4.8.0)
├── puppetlabs-powershell (v2.1.0)
├── puppetlabs-puppetdb (v6.0.1)
├── puppetlabs-registry (v1.1.4)
├── puppetlabs-ruby (v0.6.0)
├── puppetlabs-stdlib (v4.20.0)
├── puppetlabs-vcsrepo (v1.4.0)
├── puppetlabs-xinetd (v2.0.0)
├── spotify-puppetexplorer (v1.1.1)
├── stahnma-epel (v1.2.2)
└── stankevich-python (v1.17.0)

I enabled debug mode. Here is the apache2 logs When puppetboard hangs : 

> [Fri Oct 20 18:11:02.217576 2017] [wsgi:error] [pid 24430] DEBUG:pypuppetdb.api:_query called with endpoint: environments, path: None, query: None, limit: None, offset: None, summarize_by None, count_by None, count_filter: None
> [Fri Oct 20 18:11:02.217700 2017] [wsgi:error] [pid 24430] DEBUG:pypuppetdb.api:_url called with endpoint: environments and path: None
> [Fri Oct 20 18:11:02.218559 2017] [wsgi:error] [pid 24430] INFO:requests.packages.urllib3.connectionpool:Resetting dropped connection: localhost
> [Fri Oct 20 18:11:02.246800 2017] [wsgi:error] [pid 24430] DEBUG:requests.packages.urllib3.connectionpool:"GET /pdb/query/v4/environments HTTP/1.1" 200 38
> [Fri Oct 20 18:11:02.247521 2017] [wsgi:error] [pid 24430] DEBUG:pypuppetdb.api:_query called with endpoint: nodes, path: None, query: ["extract",[["function","count"]],["and",["=", "catalog_environment", "FST"],["=", "facts_environment", "FST"]]], limit: None, offset: None, summarize_by None, count_by None, count_filter: None
> [Fri Oct 20 18:11:02.247615 2017] [wsgi:error] [pid 24430] DEBUG:pypuppetdb.api:_url called with endpoint: nodes and path: None
> [Fri Oct 20 18:11:02.304078 2017] [wsgi:error] [pid 24430] DEBUG:requests.packages.urllib3.connectionpool:"GET /pdb/query/v4/nodes?query=%5B%22extract%22%2C%5B%5B%22function%22%2C%22count%22%5D%5D%2C%5B%22and%22%2C%5B%22%3D%22%2C+%22catalog_environment%22%2C+%22FST%22%5D%2C%5B%22%3D%22%2C+%22facts_environment%22%2C+%22FST%22%5D%5D%5D HTTP/1.1" 200 15
> [Fri Oct 20 18:11:02.304764 2017] [wsgi:error] [pid 24430] DEBUG:pypuppetdb.api:_query called with endpoint: resources, path: None, query: ["extract",[["function","count"]],["=", "environment", "FST"]], limit: None, offset: None, summarize_by None, count_by None, count_filter: None
> [Fri Oct 20 18:11:02.304888 2017] [wsgi:error] [pid 24430] DEBUG:pypuppetdb.api:_url called with endpoint: resources and path: None
> [Fri Oct 20 18:11:03.001694 2017] [wsgi:error] [pid 24430] DEBUG:requests.packages.urllib3.connectionpool:"GET /pdb/query/v4/resources?query=%5B%22extract%22%2C%5B%5B%22function%22%2C%22count%22%5D%5D%2C%5B%22%3D%22%2C+%22environment%22%2C+%22FST%22%5D%5D HTTP/1.1" 200 18
> [Fri Oct 20 18:11:03.002421 2017] [wsgi:error] [pid 24430] DEBUG:pypuppetdb.api:_query called with endpoint: nodes, path: None, query: ["and",["=", "catalog_environment", "FST"],["=", "facts_environment", "FST"]], limit: None, offset: None, summarize_by None, count_by None, count_filter: None
> [Fri Oct 20 18:11:03.002586 2017] [wsgi:error] [pid 24430] DEBUG:pypuppetdb.api:_url called with endpoint: nodes and path: None
> [Fri Oct 20 18:11:03.062461 2017] [wsgi:error] [pid 24430] DEBUG:requests.packages.urllib3.connectionpool:"GET /pdb/query/v4/nodes?query=%5B%22and%22%2C%5B%22%3D%22%2C+%22catalog_environment%22%2C+%22FST%22%5D%2C%5B%22%3D%22%2C+%22facts_environment%22%2C+%22FST%22%5D%5D HTTP/1.1" 200 None
> [Fri Oct 20 18:11:03.242276 2017] [wsgi:error] [pid 24430] DEBUG:pypuppetdb.api:_query called with endpoint: event-counts, path: None, query: ["=", "latest_report?", true], limit: None, offset: None, summarize_by certname, count_by None, count_filter: None
> [Fri Oct 20 18:11:03.242399 2017] [wsgi:error] [pid 24430] DEBUG:pypuppetdb.api:_url called with endpoint: event-counts and path: None
> [Fri Oct 20 18:11:23.256272 2017] [wsgi:error] [pid 24430] ERROR:pypuppetdb.api:Connection to PuppetDB timed out on localhost:8080 over HTTP.
> [Fri Oct 20 18:11:23.257263 2017] [wsgi:error] [pid 24430] ERROR:puppetboard.app:Exception on / [GET]
> [Fri Oct 20 18:11:23.257325 2017] [wsgi:error] [pid 24430] Traceback (most recent call last):
> [Fri Oct 20 18:11:23.257337 2017] [wsgi:error] [pid 24430]   File "/srv/puppetboard/virtenv-puppetboard/lib/python2.7/site-packages/flask/app.py", line 1817, in wsgi_app
> [Fri Oct 20 18:11:23.257345 2017] [wsgi:error] [pid 24430]     response = self.full_dispatch_request()
> [Fri Oct 20 18:11:23.257353 2017] [wsgi:error] [pid 24430]   File "/srv/puppetboard/virtenv-puppetboard/lib/python2.7/site-packages/flask/app.py", line 1477, in full_dispatch_request
> [Fri Oct 20 18:11:23.257360 2017] [wsgi:error] [pid 24430]     rv = self.handle_user_exception(e)
> [Fri Oct 20 18:11:23.257367 2017] [wsgi:error] [pid 24430]   File "/srv/puppetboard/virtenv-puppetboard/lib/python2.7/site-packages/flask/app.py", line 1381, in handle_user_exception
> [Fri Oct 20 18:11:23.257374 2017] [wsgi:error] [pid 24430]     reraise(exc_type, exc_value, tb)
> [Fri Oct 20 18:11:23.257380 2017] [wsgi:error] [pid 24430]   File "/srv/puppetboard/virtenv-puppetboard/lib/python2.7/site-packages/flask/app.py", line 1475, in full_dispatch_request
> [Fri Oct 20 18:11:23.257388 2017] [wsgi:error] [pid 24430]     rv = self.dispatch_request()
> [Fri Oct 20 18:11:23.257394 2017] [wsgi:error] [pid 24430]   File "/srv/puppetboard/virtenv-puppetboard/lib/python2.7/site-packages/flask/app.py", line 1461, in dispatch_request
> [Fri Oct 20 18:11:23.257452 2017] [wsgi:error] [pid 24430]     return self.view_functions[rule.endpoint](**req.view_args)
> [Fri Oct 20 18:11:23.257488 2017] [wsgi:error] [pid 24430]   File "/srv/puppetboard/puppetboard/puppetboard/app.py", line 225, in index
> [Fri Oct 20 18:11:23.257517 2017] [wsgi:error] [pid 24430]     for node in nodes:
> [Fri Oct 20 18:11:23.257547 2017] [wsgi:error] [pid 24430]   File "/srv/puppetboard/virtenv-puppetboard/lib/python2.7/site-packages/pypuppetdb/api.py", line 401, in nodes
> [Fri Oct 20 18:11:23.257578 2017] [wsgi:error] [pid 24430]     summarize_by='certname'
> [Fri Oct 20 18:11:23.257607 2017] [wsgi:error] [pid 24430]   File "/srv/puppetboard/virtenv-puppetboard/lib/python2.7/site-packages/pypuppetdb/api.py", line 722, in event_counts
> [Fri Oct 20 18:11:23.257642 2017] [wsgi:error] [pid 24430]     **kwargs)
> [Fri Oct 20 18:11:23.257692 2017] [wsgi:error] [pid 24430]   File "/srv/puppetboard/virtenv-puppetboard/lib/python2.7/site-packages/pypuppetdb/api.py", line 327, in _query
> [Fri Oct 20 18:11:23.257738 2017] [wsgi:error] [pid 24430]     auth=(self.username, self.password))
> [Fri Oct 20 18:11:23.257774 2017] [wsgi:error] [pid 24430]   File "/srv/puppetboard/virtenv-puppetboard/lib/python2.7/site-packages/requests/sessions.py", line 476, in get
> [Fri Oct 20 18:11:23.257805 2017] [wsgi:error] [pid 24430]     return self.request('GET', url, **kwargs)
> [Fri Oct 20 18:11:23.257838 2017] [wsgi:error] [pid 24430]   File "/srv/puppetboard/virtenv-puppetboard/lib/python2.7/site-packages/requests/sessions.py", line 464, in request
> [Fri Oct 20 18:11:23.257880 2017] [wsgi:error] [pid 24430]     resp = self.send(prep, **send_kwargs)
> [Fri Oct 20 18:11:23.257925 2017] [wsgi:error] [pid 24430]   File "/srv/puppetboard/virtenv-puppetboard/lib/python2.7/site-packages/requests/sessions.py", line 576, in send
> [Fri Oct 20 18:11:23.257961 2017] [wsgi:error] [pid 24430]     r = adapter.send(request, **kwargs)
> [Fri Oct 20 18:11:23.257997 2017] [wsgi:error] [pid 24430]   File "/srv/puppetboard/virtenv-puppetboard/lib/python2.7/site-packages/requests/adapters.py", line 433, in send
> [Fri Oct 20 18:11:23.258055 2017] [wsgi:error] [pid 24430]     raise ReadTimeout(e, request=request)
> [Fri Oct 20 18:11:23.258094 2017] [wsgi:error] [pid 24430] ReadTimeout: HTTPConnectionPool(host='localhost', port=8080): Read timed out. (read timeout=20)
> [Fri Oct 20 18:11:23.258369 2017] [wsgi:error] [pid 24430] DEBUG:pypuppetdb.api:_query called with endpoint: environments, path: None, query: None, limit: None, offset: None, summarize_by None, count_by None, count_filter: None
> [Fri Oct 20 18:11:23.258562 2017] [wsgi:error] [pid 24430] DEBUG:pypuppetdb.api:_url called with endpoint: environments and path: None
> [Fri Oct 20 18:11:23.260347 2017] [wsgi:error] [pid 24430] INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection (4): localhost
> [Fri Oct 20 18:11:23.282353 2017] [wsgi:error] [pid 24430] DEBUG:requests.packages.urllib3.connectionpool:"GET /pdb/query/v4/environments HTTP/1.1" 200 38

I noticed that puppetboard systematically hangs with the same error message when trying to load report tab. Even if the PUPPETDB_TIMEOUT value in settings.py is changed from 20 to a higher value, the query still hangs…

When puppetboard works fine, the overview or the nodes tab loads between 8 and 10 seconds…

mterzo · 2017-10-23T13:36:21Z

Thanks for the info. Need to digest this.

mterzo · 2017-11-10T08:09:52Z

For the report page are you modifying this setting? NORMAL_TABLE_COUNT

I'm also noticing the large spikes in command queue depth, looks like something is blocking.

mterzo · 2017-11-10T08:17:07Z

I'm almost curious if you have a herding problem? Do all your nodes checkin at specific times?

guillaume-ferry · 2017-11-13T11:13:09Z

Thanks for your answer.

Since our last discussion, I was almost sure that my puppetdb configuration was correct. I assumed that the problem was probably deeper. Maybe in my PostgresSQL conf.

I found in this documentation some clues that led me to tune my Postgres configuration. Specially with the pgtune tool : http://pgfoundry.org/projects/pgtune/

Here are the list of the parameters that pgtune advised me to tune :

------------------------------------------------------------------------------
# CUSTOMIZED OPTIONS
#------------------------------------------------------------------------------

# Add settings for extensions here
listen_addresses = localhost
default_statistics_target = 50 # pgtune wizard 2017-10-26
maintenance_work_mem = 960MB # pgtune wizard 2017-10-26
constraint_exclusion = on # pgtune wizard 2017-10-26
checkpoint_completion_target = 0.9 # pgtune wizard 2017-10-26
effective_cache_size = 11GB # pgtune wizard 2017-10-26
work_mem = 96MB # pgtune wizard 2017-10-26
wal_buffers = 8MB # pgtune wizard 2017-10-26
checkpoint_segments = 16 # pgtune wizard 2017-10-26
shared_buffers = 3840MB # pgtune wizard 2017-10-26
max_connections = 80 # pgtune wizard 2017-10-26

Effectively this new configuration significantly increased the database performance. Queries takes about half the time to execute and puppetboard didn’t not hangs since 2 weeks ago. The puppetDB performance dashboard now looks like this :

However, the report tab still hangs with an internal server error in the 1000 nodes environment. As if the number of reports to treat was too hight. But the report-ttl setting in my pupetdb conf is on 14d... Do I need to decrease this setting?

In the test environment, with few nodes, report tab works fine...
I did’nt change the NORMAL_TABLE_COUNT setting. The settings.py looks like this :

DEV_LISTEN_HOST = '127.0.0.1'
DEV_LISTEN_PORT = 5000
LOGLEVEL = 'debug'
PUPPETDB_HOST = 'localhost'
PUPPETDB_PORT = 8080
PUPPETDB_SSL_VERIFY = False
PUPPETDB_TIMEOUT = 60
UNRESPONSIVE_HOURS = 168
ENABLE_CATALOG = False
ENABLE_QUERY = True
LOCALISE_TIMESTAMP = True
OFFLINE_MODE = False
PUPPETDB_EXPERIMENTAL = False
DEFAULT_ENVIRONMENT = 'production'
REPORTS_COUNT = 10

More than 500 nodes are students computer rooms that boots as and when needed... When many classrooms are used, I suppose that students mostly boots those computer simultaneously. It should corresponds to spikes in the performance dashboard...

gdubicki · 2019-06-15T12:13:12Z

Thanks @guillaume-ferry , your issue and comments helped me a lot in tuning my PuppetDB's Postgres !

This was referenced Jun 15, 2019

event-counts slow and unresponsive #149

Closed

Add option to get nodes without using event-counts voxpupuli/pypuppetdb#153

Merged

guillaume-ferry closed this as completed Dec 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stability problem with a 1000 nodes puppetserver #430

Stability problem with a 1000 nodes puppetserver #430

guillaume-ferry commented Oct 20, 2017

mterzo commented Oct 20, 2017

corey-hammerton commented Oct 20, 2017

guillaume-ferry commented Oct 23, 2017 •

edited

Loading

mterzo commented Oct 23, 2017

mterzo commented Nov 10, 2017

mterzo commented Nov 10, 2017

guillaume-ferry commented Nov 13, 2017

gdubicki commented Jun 15, 2019

Stability problem with a 1000 nodes puppetserver #430

Stability problem with a 1000 nodes puppetserver #430

Comments

guillaume-ferry commented Oct 20, 2017

mterzo commented Oct 20, 2017

corey-hammerton commented Oct 20, 2017

guillaume-ferry commented Oct 23, 2017 • edited Loading

mterzo commented Oct 23, 2017

mterzo commented Nov 10, 2017

mterzo commented Nov 10, 2017

guillaume-ferry commented Nov 13, 2017

gdubicki commented Jun 15, 2019

guillaume-ferry commented Oct 23, 2017 •

edited

Loading