Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stability problem with a 1000 nodes puppetserver #430

Closed
guillaume-ferry opened this issue Oct 20, 2017 · 8 comments
Closed

Stability problem with a 1000 nodes puppetserver #430

guillaume-ferry opened this issue Oct 20, 2017 · 8 comments

Comments

@guillaume-ferry
Copy link

Hi,

Thank you for the great job with puppetboard.

Since the number of nodes in our environment reached about 1000 nodes, we frequently meet this error message when trying to load :

board-puppetdb-error

On the puppetdb logs side, I meet this error :

ERROR [p.p.threadpool] Error processing command on thread cmd-proc-thread-23365
clojure.lang.ExceptionInfo: Value does not match schema: {:resources [nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil {:events [{:status (not (instance? java.lang.String nil))} nil]} nil nil nil nil nil nil nil nil nil nil nil nil nil nil]}
	at schema.core$validator$fn__2894.invoke(core.clj:155)
	at schema.core$validate.invokeStatic(core.clj:164)
	at schema.core$validate.invoke(core.clj:159)
	at puppetlabs.puppetdb.command$store_report$fn__32755.invoke(command.clj:355)
	at puppetlabs.puppetdb.command$store_report.invokeStatic(command.clj:354)
	at puppetlabs.puppetdb.command$store_report.invoke(command.clj:353)
	at puppetlabs.puppetdb.command$process_command_BANG_.invokeStatic(command.clj:389)
	at puppetlabs.puppetdb.command$process_command_BANG_.invoke(command.clj:380)
	at puppetlabs.puppetdb.command$process_command_and_respond_BANG_$fn__32863.invoke(command.clj:442)
	at puppetlabs.puppetdb.command$call_with_quick_retry$fn__32856.invoke(command.clj:424)
	at puppetlabs.puppetdb.command$call_with_quick_retry.invokeStatic(command.clj:423)
	at puppetlabs.puppetdb.command$call_with_quick_retry.invoke(command.clj:421)
	at puppetlabs.puppetdb.command$process_command_and_respond_BANG_.invokeStatic(command.clj:440)
	at puppetlabs.puppetdb.command$process_command_and_respond_BANG_.invoke(command.clj:438)
	at puppetlabs.puppetdb.command$process_cmdref$fn__32873.invoke(command.clj:505)
	at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_$fn__30651$fn__30652$fn__30653.invoke(metrics.clj:14)
	at puppetlabs.puppetdb.utils.metrics.proxy$java.lang.Object$Callable$7da976d4.call(Unknown Source)
	at com.codahale.metrics.Timer.time(Timer.java:101)
	at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_$fn__30651$fn__30652.invoke(metrics.clj:14)
	at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_$fn__30651$fn__30652$fn__30653.invoke(metrics.clj:14)
	at puppetlabs.puppetdb.utils.metrics.proxy$java.lang.Object$Callable$7da976d4.call(Unknown Source)
	at com.codahale.metrics.Timer.time(Timer.java:101)
	at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_$fn__30651$fn__30652.invoke(metrics.clj:14)
	at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_.invokeStatic(metrics.clj:17)
	at puppetlabs.puppetdb.utils.metrics$multitime_BANG__STAR_.invoke(metrics.clj:6)
	at puppetlabs.puppetdb.command$process_cmdref.invokeStatic(command.clj:501)
	at puppetlabs.puppetdb.command$process_cmdref.invoke(command.clj:480)
	at puppetlabs.puppetdb.command$message_handler$fn__32881.invoke(command.clj:551)
	at puppetlabs.puppetdb.threadpool$dochan$fn__32634$fn__32635.invoke(threadpool.clj:117)

	at puppetlabs.puppetdb.threadpool$call_on_threadpool$fn__32629.invoke(threadpool.clj:95)
	at clojure.lang.AFn.run(AFn.java:22)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
2017-10-20 10:38:47,059 WARN  [p.p.q.engine] The event-counts entity is experimental and may be altered or removed in the future.

The puppetdb performance dashboard looks like this in production :

capture d ecran 2017-10-20 a 10 57 56

On the apache2 logs side :

[Fri Oct 20 09:08:26.546106 2017] [wsgi:error] [pid 29006]     raise ReadTimeout(e, request=request)
[Fri Oct 20 09:08:26.546112 2017] [wsgi:error] [pid 29006] ReadTimeout: HTTPConnectionPool(host='localhost', port=8080): Read timed out. (read timeout=20)
[Fri Oct 20 09:08:26.547887 2017] [wsgi:error] [pid 29006] INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection (3): localhost
[Fri Oct 20 09:08:51.138396 2017] [wsgi:error] [pid 29006] ERROR:pypuppetdb.api:Connection to PuppetDB timed out on localhost:8080 over HTTP.
[Fri Oct 20 09:08:51.141567 2017] [wsgi:error] [pid 29006] ERROR:puppetboard.app:Exception on /FST/ [GET]
[Fri Oct 20 09:08:51.141580 2017] [wsgi:error] [pid 29006] Traceback (most recent call last):

When the board "respawn", it works great but I don’t understand why sometimes it fails for several minutes...

I need advices to find a way to fix it. Maybe reducing the nodes in the overview page? Is there a way to limit a long list on the xx last reports or something like this?

Thanks for your help,
Guillaume

@mterzo
Copy link
Contributor

mterzo commented Oct 20, 2017

Interesting. Seems puppetdb server is unable to handle the hour. Which doesn’t seem right.

If this happens frequently can you enable debug on puppetboard, this will give
The queries being executed at the time.

What version of puppetdb and puppetsetver are you running.

@corey-hammerton
Copy link
Contributor

Here's what the PuppetDB documentation has to say about the Command Queue Depth

When viewing the performance dashboard, note the depth of the message queue (labeled “Command Queue depth”). If it is rising and you have CPU cores to spare, increasing the number of threads may help to process the backlog more rapidly.

@guillaume-ferry
Copy link
Author

guillaume-ferry commented Oct 23, 2017

Thank you for your help.

Using puppet server 2.8 and puppetDB 4.4 on Ubuntu server Xenial.

Hardware specs :

  • VMware VM
  • 2x8 cores
  • 32 Gb RAM

I followed Puppet / puppetDB tuning guides

and changed the following parameters :

  • /etc/postgresql/9.4/main/postgresql.conf : shared_buffers = 128MB => 4096MB
  • /etc/default/puppetdb : JAVA_ARGS="-Xmx2G"
  • /etc/default/puppetserver : JAVA_ARGS="-Xms5g -Xmx5g -XX:MaxPermSize=256m"
  • /etc/puppetlabs/puppetserver/conf.d/puppetserver.conf : max-active-instances: 4

The puppet module list result is :


/etc/puppetlabs/code/environments/production/modules
├── badgerious-windows_env (v2.2.2)
├── camptocamp-archive (v0.9.0)
├── camptocamp-augeas (v1.6.0)
├── camptocamp-postfix (v1.5.0)
├── camptocamp-systemd (v0.4.0)
├── puppet-puppetboard (v3.0.0)
├── puppet-staging (v2.2.0)
├── puppet-zabbix (v3.0.0)
├── puppetlabs-apache (v1.10.0)
├── puppetlabs-apt (v2.3.0)
├── puppetlabs-chocolatey (v3.0.0)
├── puppetlabs-concat (v2.2.0)
├── puppetlabs-firewall (v1.8.1)
├── puppetlabs-inifile (v1.6.0)
├── puppetlabs-mysql (v3.11.0)
├── puppetlabs-pe_gem (v0.2.0)
├── puppetlabs-postgresql (v4.8.0)
├── puppetlabs-powershell (v2.1.0)
├── puppetlabs-puppetdb (v6.0.1)
├── puppetlabs-registry (v1.1.4)
├── puppetlabs-ruby (v0.6.0)
├── puppetlabs-stdlib (v4.20.0)
├── puppetlabs-vcsrepo (v1.4.0)
├── puppetlabs-xinetd (v2.0.0)
├── spotify-puppetexplorer (v1.1.1)
├── stahnma-epel (v1.2.2)
└── stankevich-python (v1.17.0)

I enabled debug mode. Here is the apache2 logs When puppetboard hangs :


> [Fri Oct 20 18:11:02.217576 2017] [wsgi:error] [pid 24430] DEBUG:pypuppetdb.api:_query called with endpoint: environments, path: None, query: None, limit: None, offset: None, summarize_by None, count_by None, count_filter: None
> [Fri Oct 20 18:11:02.217700 2017] [wsgi:error] [pid 24430] DEBUG:pypuppetdb.api:_url called with endpoint: environments and path: None
> [Fri Oct 20 18:11:02.218559 2017] [wsgi:error] [pid 24430] INFO:requests.packages.urllib3.connectionpool:Resetting dropped connection: localhost
> [Fri Oct 20 18:11:02.246800 2017] [wsgi:error] [pid 24430] DEBUG:requests.packages.urllib3.connectionpool:"GET /pdb/query/v4/environments HTTP/1.1" 200 38
> [Fri Oct 20 18:11:02.247521 2017] [wsgi:error] [pid 24430] DEBUG:pypuppetdb.api:_query called with endpoint: nodes, path: None, query: ["extract",[["function","count"]],["and",["=", "catalog_environment", "FST"],["=", "facts_environment", "FST"]]], limit: None, offset: None, summarize_by None, count_by None, count_filter: None
> [Fri Oct 20 18:11:02.247615 2017] [wsgi:error] [pid 24430] DEBUG:pypuppetdb.api:_url called with endpoint: nodes and path: None
> [Fri Oct 20 18:11:02.304078 2017] [wsgi:error] [pid 24430] DEBUG:requests.packages.urllib3.connectionpool:"GET /pdb/query/v4/nodes?query=%5B%22extract%22%2C%5B%5B%22function%22%2C%22count%22%5D%5D%2C%5B%22and%22%2C%5B%22%3D%22%2C+%22catalog_environment%22%2C+%22FST%22%5D%2C%5B%22%3D%22%2C+%22facts_environment%22%2C+%22FST%22%5D%5D%5D HTTP/1.1" 200 15
> [Fri Oct 20 18:11:02.304764 2017] [wsgi:error] [pid 24430] DEBUG:pypuppetdb.api:_query called with endpoint: resources, path: None, query: ["extract",[["function","count"]],["=", "environment", "FST"]], limit: None, offset: None, summarize_by None, count_by None, count_filter: None
> [Fri Oct 20 18:11:02.304888 2017] [wsgi:error] [pid 24430] DEBUG:pypuppetdb.api:_url called with endpoint: resources and path: None
> [Fri Oct 20 18:11:03.001694 2017] [wsgi:error] [pid 24430] DEBUG:requests.packages.urllib3.connectionpool:"GET /pdb/query/v4/resources?query=%5B%22extract%22%2C%5B%5B%22function%22%2C%22count%22%5D%5D%2C%5B%22%3D%22%2C+%22environment%22%2C+%22FST%22%5D%5D HTTP/1.1" 200 18
> [Fri Oct 20 18:11:03.002421 2017] [wsgi:error] [pid 24430] DEBUG:pypuppetdb.api:_query called with endpoint: nodes, path: None, query: ["and",["=", "catalog_environment", "FST"],["=", "facts_environment", "FST"]], limit: None, offset: None, summarize_by None, count_by None, count_filter: None
> [Fri Oct 20 18:11:03.002586 2017] [wsgi:error] [pid 24430] DEBUG:pypuppetdb.api:_url called with endpoint: nodes and path: None
> [Fri Oct 20 18:11:03.062461 2017] [wsgi:error] [pid 24430] DEBUG:requests.packages.urllib3.connectionpool:"GET /pdb/query/v4/nodes?query=%5B%22and%22%2C%5B%22%3D%22%2C+%22catalog_environment%22%2C+%22FST%22%5D%2C%5B%22%3D%22%2C+%22facts_environment%22%2C+%22FST%22%5D%5D HTTP/1.1" 200 None
> [Fri Oct 20 18:11:03.242276 2017] [wsgi:error] [pid 24430] DEBUG:pypuppetdb.api:_query called with endpoint: event-counts, path: None, query: ["=", "latest_report?", true], limit: None, offset: None, summarize_by certname, count_by None, count_filter: None
> [Fri Oct 20 18:11:03.242399 2017] [wsgi:error] [pid 24430] DEBUG:pypuppetdb.api:_url called with endpoint: event-counts and path: None
> [Fri Oct 20 18:11:23.256272 2017] [wsgi:error] [pid 24430] ERROR:pypuppetdb.api:Connection to PuppetDB timed out on localhost:8080 over HTTP.
> [Fri Oct 20 18:11:23.257263 2017] [wsgi:error] [pid 24430] ERROR:puppetboard.app:Exception on / [GET]
> [Fri Oct 20 18:11:23.257325 2017] [wsgi:error] [pid 24430] Traceback (most recent call last):
> [Fri Oct 20 18:11:23.257337 2017] [wsgi:error] [pid 24430]   File "/srv/puppetboard/virtenv-puppetboard/lib/python2.7/site-packages/flask/app.py", line 1817, in wsgi_app
> [Fri Oct 20 18:11:23.257345 2017] [wsgi:error] [pid 24430]     response = self.full_dispatch_request()
> [Fri Oct 20 18:11:23.257353 2017] [wsgi:error] [pid 24430]   File "/srv/puppetboard/virtenv-puppetboard/lib/python2.7/site-packages/flask/app.py", line 1477, in full_dispatch_request
> [Fri Oct 20 18:11:23.257360 2017] [wsgi:error] [pid 24430]     rv = self.handle_user_exception(e)
> [Fri Oct 20 18:11:23.257367 2017] [wsgi:error] [pid 24430]   File "/srv/puppetboard/virtenv-puppetboard/lib/python2.7/site-packages/flask/app.py", line 1381, in handle_user_exception
> [Fri Oct 20 18:11:23.257374 2017] [wsgi:error] [pid 24430]     reraise(exc_type, exc_value, tb)
> [Fri Oct 20 18:11:23.257380 2017] [wsgi:error] [pid 24430]   File "/srv/puppetboard/virtenv-puppetboard/lib/python2.7/site-packages/flask/app.py", line 1475, in full_dispatch_request
> [Fri Oct 20 18:11:23.257388 2017] [wsgi:error] [pid 24430]     rv = self.dispatch_request()
> [Fri Oct 20 18:11:23.257394 2017] [wsgi:error] [pid 24430]   File "/srv/puppetboard/virtenv-puppetboard/lib/python2.7/site-packages/flask/app.py", line 1461, in dispatch_request
> [Fri Oct 20 18:11:23.257452 2017] [wsgi:error] [pid 24430]     return self.view_functions[rule.endpoint](**req.view_args)
> [Fri Oct 20 18:11:23.257488 2017] [wsgi:error] [pid 24430]   File "/srv/puppetboard/puppetboard/puppetboard/app.py", line 225, in index
> [Fri Oct 20 18:11:23.257517 2017] [wsgi:error] [pid 24430]     for node in nodes:
> [Fri Oct 20 18:11:23.257547 2017] [wsgi:error] [pid 24430]   File "/srv/puppetboard/virtenv-puppetboard/lib/python2.7/site-packages/pypuppetdb/api.py", line 401, in nodes
> [Fri Oct 20 18:11:23.257578 2017] [wsgi:error] [pid 24430]     summarize_by='certname'
> [Fri Oct 20 18:11:23.257607 2017] [wsgi:error] [pid 24430]   File "/srv/puppetboard/virtenv-puppetboard/lib/python2.7/site-packages/pypuppetdb/api.py", line 722, in event_counts
> [Fri Oct 20 18:11:23.257642 2017] [wsgi:error] [pid 24430]     **kwargs)
> [Fri Oct 20 18:11:23.257692 2017] [wsgi:error] [pid 24430]   File "/srv/puppetboard/virtenv-puppetboard/lib/python2.7/site-packages/pypuppetdb/api.py", line 327, in _query
> [Fri Oct 20 18:11:23.257738 2017] [wsgi:error] [pid 24430]     auth=(self.username, self.password))
> [Fri Oct 20 18:11:23.257774 2017] [wsgi:error] [pid 24430]   File "/srv/puppetboard/virtenv-puppetboard/lib/python2.7/site-packages/requests/sessions.py", line 476, in get
> [Fri Oct 20 18:11:23.257805 2017] [wsgi:error] [pid 24430]     return self.request('GET', url, **kwargs)
> [Fri Oct 20 18:11:23.257838 2017] [wsgi:error] [pid 24430]   File "/srv/puppetboard/virtenv-puppetboard/lib/python2.7/site-packages/requests/sessions.py", line 464, in request
> [Fri Oct 20 18:11:23.257880 2017] [wsgi:error] [pid 24430]     resp = self.send(prep, **send_kwargs)
> [Fri Oct 20 18:11:23.257925 2017] [wsgi:error] [pid 24430]   File "/srv/puppetboard/virtenv-puppetboard/lib/python2.7/site-packages/requests/sessions.py", line 576, in send
> [Fri Oct 20 18:11:23.257961 2017] [wsgi:error] [pid 24430]     r = adapter.send(request, **kwargs)
> [Fri Oct 20 18:11:23.257997 2017] [wsgi:error] [pid 24430]   File "/srv/puppetboard/virtenv-puppetboard/lib/python2.7/site-packages/requests/adapters.py", line 433, in send
> [Fri Oct 20 18:11:23.258055 2017] [wsgi:error] [pid 24430]     raise ReadTimeout(e, request=request)
> [Fri Oct 20 18:11:23.258094 2017] [wsgi:error] [pid 24430] ReadTimeout: HTTPConnectionPool(host='localhost', port=8080): Read timed out. (read timeout=20)
> [Fri Oct 20 18:11:23.258369 2017] [wsgi:error] [pid 24430] DEBUG:pypuppetdb.api:_query called with endpoint: environments, path: None, query: None, limit: None, offset: None, summarize_by None, count_by None, count_filter: None
> [Fri Oct 20 18:11:23.258562 2017] [wsgi:error] [pid 24430] DEBUG:pypuppetdb.api:_url called with endpoint: environments and path: None
> [Fri Oct 20 18:11:23.260347 2017] [wsgi:error] [pid 24430] INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection (4): localhost
> [Fri Oct 20 18:11:23.282353 2017] [wsgi:error] [pid 24430] DEBUG:requests.packages.urllib3.connectionpool:"GET /pdb/query/v4/environments HTTP/1.1" 200 38

I noticed that puppetboard systematically hangs with the same error message when trying to load report tab. Even if the PUPPETDB_TIMEOUT value in settings.py is changed from 20 to a higher value, the query still hangs…

When puppetboard works fine, the overview or the nodes tab loads between 8 and 10 seconds…

@mterzo
Copy link
Contributor

mterzo commented Oct 23, 2017

Thanks for the info. Need to digest this.

@mterzo
Copy link
Contributor

mterzo commented Nov 10, 2017

For the report page are you modifying this setting? NORMAL_TABLE_COUNT

I'm also noticing the large spikes in command queue depth, looks like something is blocking.

@mterzo
Copy link
Contributor

mterzo commented Nov 10, 2017

I'm almost curious if you have a herding problem? Do all your nodes checkin at specific times?

@guillaume-ferry
Copy link
Author

Thanks for your answer.

Since our last discussion, I was almost sure that my puppetdb configuration was correct. I assumed that the problem was probably deeper. Maybe in my PostgresSQL conf.

I found in this documentation some clues that led me to tune my Postgres configuration. Specially with the pgtune tool : http://pgfoundry.org/projects/pgtune/

Here are the list of the parameters that pgtune advised me to tune :

------------------------------------------------------------------------------
# CUSTOMIZED OPTIONS
#------------------------------------------------------------------------------

# Add settings for extensions here
listen_addresses = localhost
default_statistics_target = 50 # pgtune wizard 2017-10-26
maintenance_work_mem = 960MB # pgtune wizard 2017-10-26
constraint_exclusion = on # pgtune wizard 2017-10-26
checkpoint_completion_target = 0.9 # pgtune wizard 2017-10-26
effective_cache_size = 11GB # pgtune wizard 2017-10-26
work_mem = 96MB # pgtune wizard 2017-10-26
wal_buffers = 8MB # pgtune wizard 2017-10-26
checkpoint_segments = 16 # pgtune wizard 2017-10-26
shared_buffers = 3840MB # pgtune wizard 2017-10-26
max_connections = 80 # pgtune wizard 2017-10-26

Effectively this new configuration significantly increased the database performance. Queries takes about half the time to execute and puppetboard didn’t not hangs since 2 weeks ago. The puppetDB performance dashboard now looks like this :

pb1311

However, the report tab still hangs with an internal server error in the 1000 nodes environment. As if the number of reports to treat was too hight. But the report-ttl setting in my pupetdb conf is on 14d... Do I need to decrease this setting?

In the test environment, with few nodes, report tab works fine...
I did’nt change the NORMAL_TABLE_COUNT setting. The settings.py looks like this :

DEV_LISTEN_HOST = '127.0.0.1'
DEV_LISTEN_PORT = 5000
LOGLEVEL = 'debug'
PUPPETDB_HOST = 'localhost'
PUPPETDB_PORT = 8080
PUPPETDB_SSL_VERIFY = False
PUPPETDB_TIMEOUT = 60
UNRESPONSIVE_HOURS = 168
ENABLE_CATALOG = False
ENABLE_QUERY = True
LOCALISE_TIMESTAMP = True
OFFLINE_MODE = False
PUPPETDB_EXPERIMENTAL = False
DEFAULT_ENVIRONMENT = 'production'
REPORTS_COUNT = 10

More than 500 nodes are students computer rooms that boots as and when needed... When many classrooms are used, I suppose that students mostly boots those computer simultaneously. It should corresponds to spikes in the performance dashboard...

@gdubicki
Copy link
Contributor

Thanks @guillaume-ferry , your issue and comments helped me a lot in tuning my PuppetDB's Postgres !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants