Add mssql collector #230

szook · 2018-07-25T16:06:45Z

add new collector mssql to collect select SQL Server metrics.

WMI Classes collected:

Win32_PerfRawData_{instance}_SQLServerAvailabilityReplica
Win32_PerfRawData_{instance}_SQLServerBufferManager
Win32_PerfRawData_{instance}_SQLServerDatabaseReplica
Win32_PerfRawData_{instance}_SQLServerDatabases
Win32_PerfRawData_{instance}_SQLServerGeneralStatistics
Win32_PerfRawData_{instance}_SQLServerLocks
Win32_PerfRawData_{instance}_SQLServerMemoryManager
Win32_PerfRawData_{instance}_SQLServerSQLStatistics

That list was chosen because that was what bosun/scollector was collecting and we are in the process of migrating from Bosun to Pometheus and wanted to keep those metrics.

The collector attempts to get a list of SQL instances from the registry and query metrics for each one.

collector is off by default.

We are transitioning from Bosun/scollector to Prometheus to monitor our systems. the scollector was collecting some mssql metrics and we didn't want to lose that. So this first attempt was to replicate that process. I studied what scollector was doing to gather mssql metrics (https://github.com/bosun-monitor/bosun/blob/master/cmd/scollector/collectors/sql_windows.go) and replicated that in a new `mssql` collector. It works, but there are some things that rub me the wrong way: * I used the collector-generator tool to template 8 WMI classes, then did a lot of cut-and-pasting to merge them into a single collector. * the metrics that were generated were based on SQL Server 2016. I suspect that older version won't have the same WMI fields and may not work. * our servers each only have one instance using the default "MSSQLSERVER" instance name. this will not work on servers with differnet/multiple names. I can't help but think that there's a better way. One thing I'm thinking: * Given a list of WMI class suffixes: ['SQLServerGeneralStatistics', 'SQLServerLocks', ...] * query WMI for a list of classes that match that `select * from Meta_Class WHERE __Class LIKE "Win32_PerfRawData_%_{suffix}` * for each class, dynamically create list of metrics and collect them (still requires PoC) the dynamic nature of that will make the code shorter (less cut-n-pasty) and should work with multiple versions of windows as we will only be generating metrics for classes/values that that instance of windows knows about

instead of looking at just the default "MSSQLSERVER" instance, we now try to find running SQL Sever instances from the registry and iterate over that list to return metrics for each instance.

carlpett · 2018-07-25T16:10:00Z

collector/mssql.go

+
+	for _, param := range params {
+		if val, _, err := k.GetStringValue(param); err == nil {
+			sqlInstances[param] = val


What are keys and values here?

carlpett · 2018-07-25T16:10:15Z

collector/mssql.go

+		}
+	}
+
+	log.Debugf("Detected MSSql Instnaces: %#v\n", sqlInstances)


Typo instances

carlpett · 2018-07-25T16:12:09Z

collector/mssql.go

+
+type sqlInstancesType map[string]string
+
+var sqlInstances sqlInstancesType


Any reason this is not a field on the MSSQLCollector struct?

carlpett · 2018-07-25T16:19:17Z

Very nice! I left some minor comments. Apart from those, there's also some general things I'd like to see:

It seems the WMI class reports in kB, but Prometheus exporters should use base units, so bytes in this case. Could you add some conversion on those fields (ie, multiply by 1024)?
The persec data from WMI is most often not actually per second rates, but rather counters (they are normalized to rates in Perfmon and friends when displayed later). So remove the _persec suffix from them.

I just had time to skim the list of metrics (quite a few new ones!), will give that a look-over later when I have a bit more time.

Again, thanks, this will be a great addition!

szook · 2018-07-25T19:53:28Z

@carlpett: made some more changes. I'm sure there's a few more things I'm missing, but I feel like it's getting close.

Cleaned up metric names * removed `_persec` and `_kb` * `KB` metrics being multiplied by 1024 to convert to bytes * `timems` metrics being divided by 1000 to convert to seconds * changed prometheus metrics to be consistently snake_cased * added the wmi class name to the metrics so as to avoid name collisions between classes. also makes it easier to intuit where a given metric is coming from. other stuff: * changed the [AvailabilityReplica, DatabaseReplica, Databases, Locks] classes to iterate over the (multiple) results and tag them accordingly * moved the `sqlInstances` var inside the `MSSQLCollector` struct * fixed typos and updated variable names to be better reflect

carlpett

Very nice progress! Went through all of it this time, and looks mostly good now. Just a few naming things, and some questions

carlpett · 2018-07-29T13:23:11Z

collector/mssql.go

 			[]string{"instance"},
 			nil,
 		),
 		ExtensionoutstandingIOcounter: prometheus.NewDesc(
-			prometheus.BuildFQName(Namespace, subsystem, "extensionoutstanding_i_ocounter"),
-			"(ExtensionoutstandingIOcounter)",
+			prometheus.BuildFQName(Namespace, subsystem, "bufman_extension_outstanding_io_counter"),


Drop the _counter

carlpett · 2018-07-29T13:26:22Z

collector/mssql.go

-			prometheus.BuildFQName(Namespace, subsystem, "repl_trans_rate"),
-			"(ReplTransRate)",
-			[]string{"instance"},
+			prometheus.BuildFQName(Namespace, subsystem, "databases_repl_trans_rate"),


What are typical values for this metric? This might also be a confusingly named counter

Looks like all the _rate metrics are analogous to the persec ones.
Changing them to Counters and tweaking names to be less confusing.

carlpett · 2018-07-29T13:28:13Z

collector/mssql.go

 			[]string{"instance"},
 			nil,
 		),
 		SQLAttentionrate: prometheus.NewDesc(
-			prometheus.BuildFQName(Namespace, subsystem, "sql_attentionrate"),
-			"(SQLAttentionrate)",
+			prometheus.BuildFQName(Namespace, subsystem, "sqlstats_sql_attention_rate"),


Check this for gauge/counter too

carlpett · 2018-07-29T13:30:35Z

collector/mssql.go


 	const subsystem = "mssql"
 	return &MSSQLCollector{
+
 		// Win32_PerfRawData_{instance}_SQLServerAvailabilityReplica
 		BytesReceivedfromReplicaPersec: prometheus.NewDesc(


Drop the Persec from the Desc field names as well

carlpett · 2018-07-29T13:31:51Z

collector/mssql.go

-			prometheus.BuildFQName(Namespace, subsystem, "data_files_size_kb"),
-			"(DataFilesSizeKB)",
-			[]string{"instance"},
+			prometheus.BuildFQName(Namespace, subsystem, "databases_data_files_size"),


Sorry for not being clearer about this... The convention is to keep the unit in the name, so _bytes on all of these

carlpett · 2018-07-29T13:34:37Z

collector/mssql.go

 			[]string{"instance"},
 			nil,
 		),
 		Extensionpageunreferencedtime: prometheus.NewDesc(
-			prometheus.BuildFQName(Namespace, subsystem, "extensionpageunreferencedtime"),
-			"(Extensionpageunreferencedtime)",
+			prometheus.BuildFQName(Namespace, subsystem, "bufman_extension_page_unreferenced_time"),


Do you know what this measures? Is it absolute time or relative?

According to Microsoft:
Average seconds a page will stay in the buffer pool extension without references to it.

So I think a Gauge type is appropriate.

Changed name to bufman_extension_page_unreferenced_seconds

* another round of tweaking metric names to better adher to prometheus [naming standards](https://prometheus.io/docs/practices/naming/). Mainly trying to get the appropriate units at the end of the metric names. * removed "PerSec" from description fields as requested. * removed metric `Databases.AvgDistFromEOLLPRequest` - it's not documented, not intuitive what it's measuring and metric values were all 0 in our environments.

szook · 2018-07-31T14:04:52Z

@carlpett: Just finished another round of metric name tweaks.

We have one SQL server in our anemic test environment that is taking over a minute for the exporter to run (prometheus timeout is set to 15s so we're not getting any metrics from that server).

So I'm thinking that it might be a good idea to break this up from one monolithic collector querying 8 WMI classes, into 8 collectors each querying a single WMI Class. Breaking this up would have the following benefits:

the WMI classes will be queried in parallel rather than serial so it should complete a bit faster.
We would have metrics on duration of each WMI class to figure out which is taking the longest.
it would give users granular control over which metrics to collect/exclude. (not everyone needs Availability Group Replication metrics)

Thoughts?

instead of having a single monolithic `mssql` collector querying 8 WMI classes, decided to break them up into 8 collectors each querying a single WMI Class. Benifits: * the WMI classes will be queried in parallel rather than serial so it should complete a bit faster. * We would have metrics on duration of each WMI class to figure out which is taking the longest. * it would give users granular control over which metrics to collect/exclude. (not everyone needs Availability Group Replication metrics. for example) added a `[mssql]` meta-collector that will include all mssql-* collections (e.g when starting with `--collectors.enabled="[mssql]"`)

carlpett · 2018-07-31T20:13:50Z

Hey,
Sorry for the slow feedback here, see you've already gone ahead and implemented your idea. I would prefer if we do not create a bunch of collectors, actually. What you could do instead would be to spin of each of the classes as separate go routines in the main collect function. This way, we do not get the sprawl of collectors to enable/disable (I'm fairly sure a request for being able to toggle them all would come pretty quick) and keep the related code together. While you can still easily add duration metrics over the sub-collectors, as well as hypothetically have some collector-specific flag for toggling individual sub-collectors on/off.
What do you think?

carlpett · 2018-07-31T20:16:57Z

For the other set of changes, looks good! This is coming together really nicely :)

This reverts commit c2c5d5d.

Added ability to run each child collector (WMI class) in parallel via goroutines. (Borrowing logic from the main `exporter.go` functions `Collect()` and `execute()`) Duplicated the `collector_duration_seconds` & `collector_success` metrics from the `main` package so I could expose child-collector duration and success stats. ie: ``` wmi_exporter_collector_duration_seconds{collector="mssql"} 0.0839942 wmi_exporter_collector_duration_seconds{collector="mssql_availreplica"} 0.0569965 wmi_exporter_collector_duration_seconds{collector="mssql_bufman"} 0.0659951 wmi_exporter_collector_duration_seconds{collector="mssql_dbreplica"} 0.0729989 wmi_exporter_collector_duration_seconds{collector="mssql_genstats"} 0.0819979 wmi_exporter_collector_duration_seconds{collector="mssql_locks"} 0.030997 wmi_exporter_collector_duration_seconds{collector="mssql_memmgr"} 0.0399949 wmi_exporter_collector_duration_seconds{collector="mssql_sqlstats"} 0.0489962 ``` Added `kingpin.Flag`s to show available mssql child classes and white|blacklist them: ``` C:\> wmi_exporter.exe --collectors.enabled="mssql" --collectors.mssql.class-print Available SQLServer Classes: - memmgr - sqlstats - availreplica - bufman - databases - dbreplica - genstats - locks ... wmi_exporter.exe --collectors.enabled="mssql" --collector.mssql.class-whitelist="(genstats|bufman|locks)" wmi_exporter.exe --collectors.enabled="mssql" --collector.mssql.class-blacklist="databases" ```

carlpett · 2018-08-01T18:34:25Z

collector/mssql.go


 	"github.com/StackExchange/wmi"
 	"github.com/prometheus/client_golang/prometheus"
 	"github.com/prometheus/common/log"
 	"golang.org/x/sys/windows/registry"
+	kingpin "gopkg.in/alecthomas/kingpin.v2"


The package name is kingpin anyway, no need to alias.

That was VSCode trying to be helpful. I'll remove that.

carlpett · 2018-08-01T18:36:41Z

collector/mssql.go

+		"If true, print available mssql WMI classes",
+	).Bool()
+
+	mssqlScrapeDurationDesc = prometheus.NewDesc(


Not sure this really works? We'll end up with two identically named metrics, which I'm fairly sure results in an error?
Any reason not to throw it in a dedicated metric wmi_mssql_subcollector_duration_seconds (or similar) instead?

it actually does work. the resulting metrics look like this:

# HELP wmi_exporter_collector_duration_seconds wmi_exporter: Duration of a collection. # TYPE wmi_exporter_collector_duration_seconds gauge wmi_exporter_collector_duration_seconds{collector="cpu"} 1.5950872999999999 wmi_exporter_collector_duration_seconds{collector="cs"} 1.5630859 wmi_exporter_collector_duration_seconds{collector="logical_disk"} 0.9350555 wmi_exporter_collector_duration_seconds{collector="mssql"} 2.6281417 wmi_exporter_collector_duration_seconds{collector="mssql_availreplica"} 1.6430905 wmi_exporter_collector_duration_seconds{collector="mssql_bufman"} 1.6100894000000001 wmi_exporter_collector_duration_seconds{collector="mssql_databases"} 1.9571038 wmi_exporter_collector_duration_seconds{collector="mssql_dbreplica"} 2.6281417 wmi_exporter_collector_duration_seconds{collector="mssql_genstats"} 2.4711424 wmi_exporter_collector_duration_seconds{collector="mssql_locks"} 2.5011408 wmi_exporter_collector_duration_seconds{collector="mssql_memmgr"} 1.970111 wmi_exporter_collector_duration_seconds{collector="mssql_sqlstats"} 1.6220949 wmi_exporter_collector_duration_seconds{collector="net"} 1.5520828 wmi_exporter_collector_duration_seconds{collector="os"} 1.5080826 wmi_exporter_collector_duration_seconds{collector="process"} 2.6321453 wmi_exporter_collector_duration_seconds{collector="service"} 2.4591326000000002 wmi_exporter_collector_duration_seconds{collector="system"} 2.5171416 wmi_exporter_collector_duration_seconds{collector="tcp"} 1.5230799

But if you prefer to have these in their own namespace, I can do that as well.

Interestnig. But I think that would be best to change. It might either be that the client library is clever and helpful, or it might even be some luck. Either way, I don't think it is a good idea to have multiple "different" metrics that end up with the same name.

carlpett · 2018-08-01T18:38:23Z

collector/mssql.go

+	mssqlCollectorWhitelist = kingpin.Flag(
+		"collector.mssql.class-whitelist",
+		"Regexp of mssql WMI classes to whitelist. Name must both match whitelist and not match blacklist to be included.",
+	).Default(".+").String()


Would you consider regexp better than comma-separated list here? Since the exact allowed names are known, regexp doesn't seem to add much? (You can even use the Enums kingpin type then to get validation early)

all the other white/blacklist options (iis, nic, logicaldisk) are using regex, so I'd figure I'd keep things consistent.

But I see your point.

I'll mimic the collectors.enabled option instead.

carlpett · 2018-08-01T18:38:51Z

collector/mssql.go

+		fmt.Printf("Available SQLServer Classes:\n")
+		for name := range mssqlCollectors {
+			fmt.Printf(" - %s\n", name)
+		}


Maybe it would be reasonable to exit here?

carlpett · 2018-08-01T18:41:19Z

collector/mssql.go

@@ -1324,79 +1390,83 @@ type win32PerfRawDataSQLServerAvailabilityReplica struct {
 	SendstoTransportPersec         uint64
 }

-func (c *MSSQLCollector) collectAvailabilityReplica(ch chan<- prometheus.Metric, sqlInstance string) (*prometheus.Desc, error) {
+func mssqlCollectAvailabilityReplica(c *MSSQLCollector, ch chan<- prometheus.Metric) (*prometheus.Desc, error) {


Is there a reason not to keep these new collect functions as bound to the MSSQLCollector struct?

I couldn't figure out how to do that AND have the ability to pass the function reference as a parameter to the mssqlExecute() function.

If you can find an example of someone going that, I'd be happy to change this.

Hm. Do you mean something like this? https://play.golang.org/p/HcgvRIBLCxe
Maybe I'm missing your point?

thanks for that nudge. I think I got it.

* changed command line options from regex white/blacklist to simple comma-separated list * added os.exit() when the `--collectors.mssql.class-print` option is given * bound functions back to the (*MSSQLCollector ) object. * spawning go routines for each wmi class & sqlInstance combo rather than just wmi class. * using new metrics to keep track os collector duration/success metrics rather then the root exporter one (wmi_exporter_collector_duration_seconds, etc)

carlpett · 2018-08-02T06:41:15Z

LGTM! 🎉
Thanks a lot for your efforts!

carlpett · 2018-08-02T06:42:22Z

I'll make a release with this right away :)

pete-leese · 2018-08-02T10:34:47Z

Great effort everyone. Looking toward to deploying this update when I return from paternity leave.

@szook - just wondering if you are ahead of the game here and have created some grafana dashboards and Prometheus alerts you could share wrapped around this collector?

Thank you.

Pete.

szook · 2018-08-03T18:20:29Z

@VR6Pete: not very exciting, but I have one dashboard work-in-progress that's based off the 15 SQL Server Performance Counters to Monitor article I found.

https://gist.github.com/szook/4d56573ff7529cfeaf7c0b67f9b85902

add mssql collector

Steve Zook added 2 commits July 24, 2018 08:51

iterate-sql-instances

d00fdcf

instead of looking at just the default "MSSQLSERVER" instance, we now try to find running SQL Sever instances from the registry and iterate over that list to return metrics for each instance.

carlpett reviewed Jul 25, 2018

View reviewed changes

carlpett reviewed Jul 29, 2018

View reviewed changes

Steve Zook added 2 commits August 1, 2018 10:01

Revert "split single collector into series of collectors"

6e10182

This reverts commit c2c5d5d.

carlpett reviewed Aug 1, 2018

View reviewed changes

carlpett merged commit fe4c61a into prometheus-community:master Aug 2, 2018

szook deleted the add-mssql-collector-take1 branch August 3, 2018 18:10

anubhavg-icpl pushed a commit to anubhavg-icpl/windows_exporter that referenced this pull request Sep 22, 2024

Add mssql collector (prometheus-community#230)

6fd78b1

add mssql collector


		type sqlInstancesType map[string]string

		var sqlInstances sqlInstancesType

Add mssql collector #230

Add mssql collector #230

Conversation

szook commented Jul 25, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

carlpett commented Jul 25, 2018 • edited Loading

szook commented Jul 25, 2018

carlpett left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szook commented Jul 31, 2018

carlpett commented Jul 31, 2018

carlpett commented Jul 31, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szook Aug 1, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

carlpett commented Aug 2, 2018

carlpett commented Aug 2, 2018

pete-leese commented Aug 2, 2018

szook commented Aug 3, 2018

carlpett commented Jul 25, 2018 •

edited

Loading

szook Aug 1, 2018 •

edited

Loading