-
Notifications
You must be signed in to change notification settings - Fork 706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add mssql collector #230
Add mssql collector #230
Conversation
We are transitioning from Bosun/scollector to Prometheus to monitor our systems. the scollector was collecting some mssql metrics and we didn't want to lose that. So this first attempt was to replicate that process. I studied what scollector was doing to gather mssql metrics (https://github.com/bosun-monitor/bosun/blob/master/cmd/scollector/collectors/sql_windows.go) and replicated that in a new `mssql` collector. It works, but there are some things that rub me the wrong way: * I used the collector-generator tool to template 8 WMI classes, then did a lot of cut-and-pasting to merge them into a single collector. * the metrics that were generated were based on SQL Server 2016. I suspect that older version won't have the same WMI fields and may not work. * our servers each only have one instance using the default "MSSQLSERVER" instance name. this will not work on servers with differnet/multiple names. I can't help but think that there's a better way. One thing I'm thinking: * Given a list of WMI class suffixes: ['SQLServerGeneralStatistics', 'SQLServerLocks', ...] * query WMI for a list of classes that match that `select * from Meta_Class WHERE __Class LIKE "Win32_PerfRawData_%_{suffix}` * for each class, dynamically create list of metrics and collect them (still requires PoC) the dynamic nature of that will make the code shorter (less cut-n-pasty) and should work with multiple versions of windows as we will only be generating metrics for classes/values that that instance of windows knows about
instead of looking at just the default "MSSQLSERVER" instance, we now try to find running SQL Sever instances from the registry and iterate over that list to return metrics for each instance.
collector/mssql.go
Outdated
|
||
for _, param := range params { | ||
if val, _, err := k.GetStringValue(param); err == nil { | ||
sqlInstances[param] = val |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are keys and values here?
collector/mssql.go
Outdated
} | ||
} | ||
|
||
log.Debugf("Detected MSSql Instnaces: %#v\n", sqlInstances) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo instances
collector/mssql.go
Outdated
|
||
type sqlInstancesType map[string]string | ||
|
||
var sqlInstances sqlInstancesType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason this is not a field on the MSSQLCollector
struct?
Very nice! I left some minor comments. Apart from those, there's also some general things I'd like to see:
I just had time to skim the list of metrics (quite a few new ones!), will give that a look-over later when I have a bit more time. Again, thanks, this will be a great addition! |
@carlpett: made some more changes. I'm sure there's a few more things I'm missing, but I feel like it's getting close. |
Cleaned up metric names * removed `_persec` and `_kb` * `KB` metrics being multiplied by 1024 to convert to bytes * `timems` metrics being divided by 1000 to convert to seconds * changed prometheus metrics to be consistently snake_cased * added the wmi class name to the metrics so as to avoid name collisions between classes. also makes it easier to intuit where a given metric is coming from. other stuff: * changed the [AvailabilityReplica, DatabaseReplica, Databases, Locks] classes to iterate over the (multiple) results and tag them accordingly * moved the `sqlInstances` var inside the `MSSQLCollector` struct * fixed typos and updated variable names to be better reflect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice progress! Went through all of it this time, and looks mostly good now. Just a few naming things, and some questions
collector/mssql.go
Outdated
[]string{"instance"}, | ||
nil, | ||
), | ||
ExtensionoutstandingIOcounter: prometheus.NewDesc( | ||
prometheus.BuildFQName(Namespace, subsystem, "extensionoutstanding_i_ocounter"), | ||
"(ExtensionoutstandingIOcounter)", | ||
prometheus.BuildFQName(Namespace, subsystem, "bufman_extension_outstanding_io_counter"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drop the _counter
collector/mssql.go
Outdated
prometheus.BuildFQName(Namespace, subsystem, "repl_trans_rate"), | ||
"(ReplTransRate)", | ||
[]string{"instance"}, | ||
prometheus.BuildFQName(Namespace, subsystem, "databases_repl_trans_rate"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are typical values for this metric? This might also be a confusingly named counter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like all the _rate
metrics are analogous to the persec
ones.
Changing them to Counters and tweaking names to be less confusing.
collector/mssql.go
Outdated
[]string{"instance"}, | ||
nil, | ||
), | ||
SQLAttentionrate: prometheus.NewDesc( | ||
prometheus.BuildFQName(Namespace, subsystem, "sql_attentionrate"), | ||
"(SQLAttentionrate)", | ||
prometheus.BuildFQName(Namespace, subsystem, "sqlstats_sql_attention_rate"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check this for gauge/counter too
collector/mssql.go
Outdated
|
||
const subsystem = "mssql" | ||
return &MSSQLCollector{ | ||
|
||
// Win32_PerfRawData_{instance}_SQLServerAvailabilityReplica | ||
BytesReceivedfromReplicaPersec: prometheus.NewDesc( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drop the Persec
from the Desc
field names as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
collector/mssql.go
Outdated
prometheus.BuildFQName(Namespace, subsystem, "data_files_size_kb"), | ||
"(DataFilesSizeKB)", | ||
[]string{"instance"}, | ||
prometheus.BuildFQName(Namespace, subsystem, "databases_data_files_size"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for not being clearer about this... The convention is to keep the unit in the name, so _bytes
on all of these
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
collector/mssql.go
Outdated
[]string{"instance"}, | ||
nil, | ||
), | ||
Extensionpageunreferencedtime: prometheus.NewDesc( | ||
prometheus.BuildFQName(Namespace, subsystem, "extensionpageunreferencedtime"), | ||
"(Extensionpageunreferencedtime)", | ||
prometheus.BuildFQName(Namespace, subsystem, "bufman_extension_page_unreferenced_time"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know what this measures? Is it absolute time or relative?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to Microsoft:
Average seconds a page will stay in the buffer pool extension without references to it.
So I think a Gauge type is appropriate.
Changed name to bufman_extension_page_unreferenced_seconds
* another round of tweaking metric names to better adher to prometheus [naming standards](https://prometheus.io/docs/practices/naming/). Mainly trying to get the appropriate units at the end of the metric names. * removed "PerSec" from description fields as requested. * removed metric `Databases.AvgDistFromEOLLPRequest` - it's not documented, not intuitive what it's measuring and metric values were all 0 in our environments.
@carlpett: Just finished another round of metric name tweaks. We have one SQL server in our anemic test environment that is taking over a minute for the exporter to run (prometheus timeout is set to 15s so we're not getting any metrics from that server). So I'm thinking that it might be a good idea to break this up from one monolithic collector querying 8 WMI classes, into 8 collectors each querying a single WMI Class. Breaking this up would have the following benefits:
Thoughts? |
instead of having a single monolithic `mssql` collector querying 8 WMI classes, decided to break them up into 8 collectors each querying a single WMI Class. Benifits: * the WMI classes will be queried in parallel rather than serial so it should complete a bit faster. * We would have metrics on duration of each WMI class to figure out which is taking the longest. * it would give users granular control over which metrics to collect/exclude. (not everyone needs Availability Group Replication metrics. for example) added a `[mssql]` meta-collector that will include all mssql-* collections (e.g when starting with `--collectors.enabled="[mssql]"`)
Hey, |
For the other set of changes, looks good! This is coming together really nicely :) |
This reverts commit c2c5d5d.
Added ability to run each child collector (WMI class) in parallel via goroutines. (Borrowing logic from the main `exporter.go` functions `Collect()` and `execute()`) Duplicated the `collector_duration_seconds` & `collector_success` metrics from the `main` package so I could expose child-collector duration and success stats. ie: ``` wmi_exporter_collector_duration_seconds{collector="mssql"} 0.0839942 wmi_exporter_collector_duration_seconds{collector="mssql_availreplica"} 0.0569965 wmi_exporter_collector_duration_seconds{collector="mssql_bufman"} 0.0659951 wmi_exporter_collector_duration_seconds{collector="mssql_dbreplica"} 0.0729989 wmi_exporter_collector_duration_seconds{collector="mssql_genstats"} 0.0819979 wmi_exporter_collector_duration_seconds{collector="mssql_locks"} 0.030997 wmi_exporter_collector_duration_seconds{collector="mssql_memmgr"} 0.0399949 wmi_exporter_collector_duration_seconds{collector="mssql_sqlstats"} 0.0489962 ``` Added `kingpin.Flag`s to show available mssql child classes and white|blacklist them: ``` C:\> wmi_exporter.exe --collectors.enabled="mssql" --collectors.mssql.class-print Available SQLServer Classes: - memmgr - sqlstats - availreplica - bufman - databases - dbreplica - genstats - locks ... wmi_exporter.exe --collectors.enabled="mssql" --collector.mssql.class-whitelist="(genstats|bufman|locks)" wmi_exporter.exe --collectors.enabled="mssql" --collector.mssql.class-blacklist="databases" ```
collector/mssql.go
Outdated
|
||
"github.com/StackExchange/wmi" | ||
"github.com/prometheus/client_golang/prometheus" | ||
"github.com/prometheus/common/log" | ||
"golang.org/x/sys/windows/registry" | ||
kingpin "gopkg.in/alecthomas/kingpin.v2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The package name is kingpin
anyway, no need to alias.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was VSCode trying to be helpful. I'll remove that.
collector/mssql.go
Outdated
"If true, print available mssql WMI classes", | ||
).Bool() | ||
|
||
mssqlScrapeDurationDesc = prometheus.NewDesc( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure this really works? We'll end up with two identically named metrics, which I'm fairly sure results in an error?
Any reason not to throw it in a dedicated metric wmi_mssql_subcollector_duration_seconds
(or similar) instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it actually does work. the resulting metrics look like this:
# HELP wmi_exporter_collector_duration_seconds wmi_exporter: Duration of a collection.
# TYPE wmi_exporter_collector_duration_seconds gauge
wmi_exporter_collector_duration_seconds{collector="cpu"} 1.5950872999999999
wmi_exporter_collector_duration_seconds{collector="cs"} 1.5630859
wmi_exporter_collector_duration_seconds{collector="logical_disk"} 0.9350555
wmi_exporter_collector_duration_seconds{collector="mssql"} 2.6281417
wmi_exporter_collector_duration_seconds{collector="mssql_availreplica"} 1.6430905
wmi_exporter_collector_duration_seconds{collector="mssql_bufman"} 1.6100894000000001
wmi_exporter_collector_duration_seconds{collector="mssql_databases"} 1.9571038
wmi_exporter_collector_duration_seconds{collector="mssql_dbreplica"} 2.6281417
wmi_exporter_collector_duration_seconds{collector="mssql_genstats"} 2.4711424
wmi_exporter_collector_duration_seconds{collector="mssql_locks"} 2.5011408
wmi_exporter_collector_duration_seconds{collector="mssql_memmgr"} 1.970111
wmi_exporter_collector_duration_seconds{collector="mssql_sqlstats"} 1.6220949
wmi_exporter_collector_duration_seconds{collector="net"} 1.5520828
wmi_exporter_collector_duration_seconds{collector="os"} 1.5080826
wmi_exporter_collector_duration_seconds{collector="process"} 2.6321453
wmi_exporter_collector_duration_seconds{collector="service"} 2.4591326000000002
wmi_exporter_collector_duration_seconds{collector="system"} 2.5171416
wmi_exporter_collector_duration_seconds{collector="tcp"} 1.5230799
But if you prefer to have these in their own namespace, I can do that as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestnig. But I think that would be best to change. It might either be that the client library is clever and helpful, or it might even be some luck. Either way, I don't think it is a good idea to have multiple "different" metrics that end up with the same name.
collector/mssql.go
Outdated
mssqlCollectorWhitelist = kingpin.Flag( | ||
"collector.mssql.class-whitelist", | ||
"Regexp of mssql WMI classes to whitelist. Name must both match whitelist and not match blacklist to be included.", | ||
).Default(".+").String() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you consider regexp better than comma-separated list here? Since the exact allowed names are known, regexp doesn't seem to add much? (You can even use the Enums
kingpin type then to get validation early)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all the other white/blacklist options (iis, nic, logicaldisk) are using regex, so I'd figure I'd keep things consistent.
But I see your point.
I'll mimic the collectors.enabled
option instead.
collector/mssql.go
Outdated
fmt.Printf("Available SQLServer Classes:\n") | ||
for name := range mssqlCollectors { | ||
fmt.Printf(" - %s\n", name) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it would be reasonable to exit here?
collector/mssql.go
Outdated
@@ -1324,79 +1390,83 @@ type win32PerfRawDataSQLServerAvailabilityReplica struct { | |||
SendstoTransportPersec uint64 | |||
} | |||
|
|||
func (c *MSSQLCollector) collectAvailabilityReplica(ch chan<- prometheus.Metric, sqlInstance string) (*prometheus.Desc, error) { | |||
func mssqlCollectAvailabilityReplica(c *MSSQLCollector, ch chan<- prometheus.Metric) (*prometheus.Desc, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason not to keep these new collect functions as bound to the MSSQLCollector struct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't figure out how to do that AND have the ability to pass the function reference as a parameter to the mssqlExecute()
function.
If you can find an example of someone going that, I'd be happy to change this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm. Do you mean something like this? https://play.golang.org/p/HcgvRIBLCxe
Maybe I'm missing your point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for that nudge. I think I got it.
* changed command line options from regex white/blacklist to simple comma-separated list * added os.exit() when the `--collectors.mssql.class-print` option is given * bound functions back to the (*MSSQLCollector ) object. * spawning go routines for each wmi class & sqlInstance combo rather than just wmi class. * using new metrics to keep track os collector duration/success metrics rather then the root exporter one (wmi_exporter_collector_duration_seconds, etc)
LGTM! 🎉 |
I'll make a release with this right away :) |
Great effort everyone. Looking toward to deploying this update when I return from paternity leave. @szook - just wondering if you are ahead of the game here and have created some grafana dashboards and Prometheus alerts you could share wrapped around this collector? Thank you. Pete. |
@VR6Pete: not very exciting, but I have one dashboard work-in-progress that's based off the 15 SQL Server Performance Counters to Monitor article I found. https://gist.github.com/szook/4d56573ff7529cfeaf7c0b67f9b85902 |
add mssql collector
add new collector
mssql
to collect select SQL Server metrics.WMI Classes collected:
That list was chosen because that was what bosun/scollector was collecting and we are in the process of migrating from Bosun to Pometheus and wanted to keep those metrics.
The collector attempts to get a list of SQL instances from the registry and query metrics for each one.
collector is off by default.