-
-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stability problem with a 1000 nodes puppetserver #430
Comments
Interesting. Seems puppetdb server is unable to handle the hour. Which doesn’t seem right. If this happens frequently can you enable debug on puppetboard, this will give What version of puppetdb and puppetsetver are you running. |
Here's what the PuppetDB documentation has to say about the Command Queue Depth When viewing the performance dashboard, note the depth of the message queue (labeled “Command Queue depth”). If it is rising and you have CPU cores to spare, increasing the number of threads may help to process the backlog more rapidly. |
Thank you for your help. Using puppet server 2.8 and puppetDB 4.4 on Ubuntu server Xenial. Hardware specs :
I followed Puppet / puppetDB tuning guides
and changed the following parameters :
The puppet module list result is :
I enabled debug mode. Here is the apache2 logs When puppetboard hangs :
I noticed that puppetboard systematically hangs with the same error message when trying to load report tab. Even if the PUPPETDB_TIMEOUT value in settings.py is changed from 20 to a higher value, the query still hangs… When puppetboard works fine, the overview or the nodes tab loads between 8 and 10 seconds… |
Thanks for the info. Need to digest this. |
For the report page are you modifying this setting? I'm also noticing the large spikes in command queue depth, looks like something is blocking. |
I'm almost curious if you have a herding problem? Do all your nodes checkin at specific times? |
Thanks for your answer. Since our last discussion, I was almost sure that my puppetdb configuration was correct. I assumed that the problem was probably deeper. Maybe in my PostgresSQL conf. I found in this documentation some clues that led me to tune my Postgres configuration. Specially with the pgtune tool : http://pgfoundry.org/projects/pgtune/ Here are the list of the parameters that pgtune advised me to tune :
Effectively this new configuration significantly increased the database performance. Queries takes about half the time to execute and puppetboard didn’t not hangs since 2 weeks ago. The puppetDB performance dashboard now looks like this : However, the report tab still hangs with an internal server error in the 1000 nodes environment. As if the number of reports to treat was too hight. But the report-ttl setting in my pupetdb conf is on 14d... Do I need to decrease this setting? In the test environment, with few nodes, report tab works fine...
More than 500 nodes are students computer rooms that boots as and when needed... When many classrooms are used, I suppose that students mostly boots those computer simultaneously. It should corresponds to spikes in the performance dashboard... |
Thanks @guillaume-ferry , your issue and comments helped me a lot in tuning my PuppetDB's Postgres ! |
Hi,
Thank you for the great job with puppetboard.
Since the number of nodes in our environment reached about 1000 nodes, we frequently meet this error message when trying to load :
On the puppetdb logs side, I meet this error :
The puppetdb performance dashboard looks like this in production :
On the apache2 logs side :
When the board "respawn", it works great but I don’t understand why sometimes it fails for several minutes...
I need advices to find a way to fix it. Maybe reducing the nodes in the overview page? Is there a way to limit a long list on the xx last reports or something like this?
Thanks for your help,
Guillaume
The text was updated successfully, but these errors were encountered: