Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add Terminal service & RemoteFx Collector #491

Merged

Conversation

asiyani
Copy link
Contributor

@asiyani asiyani commented Mar 29, 2020

Added collector for

  • Terminal service performance matrics
    Win32_PerfRawData_LocalSessionManager_TerminalServices & Win32_PerfRawData_TermService_TerminalServicesSession

  • Connection Broker Performance matrics, Total session counts in RDS farm (Only if host is a connection Broker)
    Win32_PerfRawData_RemoteDesktopConnectionBrokerPerformanceCounterProvider_RemoteDesktopConnectionBrokerCounterset

  • RemoteFx Network and Graphics Performance matrics
    Win32_PerfRawData_Counters_RemoteFXNetwork & Win32_PerfRawData_Counters_RemoteFXGraphics metrics

@breed808
Copy link
Contributor

Nice! I'm especially looking forward to using the terminal services collector at work.

From a brief look at the metrics available in perflib, I think the session metrics could be queried via perflib rather than WMI (though not the wmi_terminal_services_connection_broker_performance_total metric). It looks like the Remote FX Graphics/Network metrics can also be queried via perflib.

@asiyani
Copy link
Contributor Author

asiyani commented Mar 30, 2020

I am making required changed to use perflib.
just courius why perflib is prefered over wmi?

@breed808
Copy link
Contributor

The preference to use perflib is due to performance and stability reasons. WMI calls are expensive, and sometimes return inconsistent metric data (#89 is a good example).
We should document that perflib is the preferred metric source in the docs or a CONTRIBUTING.md template.

Copy link
Contributor

@breed808 breed808 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! I've added a few comments for some minor changes.
Let me know if you need any clarification.

ch <- prometheus.MustNewConstMetric(
c.BaseTCPRTT,
prometheus.GaugeValue,
float64(d.BaseTCPRTT),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove the float64() cast on all metrics as the perflib metrics are already defined as float64 in perflibRemoteFxNetwork and perflibRemoteFxGraphics.

d.Name,
)
ch <- prometheus.MustNewConstMetric(
c.TCPReceivedRate,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've run the exporter on my local machine, and a number of the metrics are monotonically increasing, and should be changed to CounterValue rather than the current GaugeValue.
I noted that this was the case for net_total_received_rate, net_total_sent_rate, tcp_sent_rate, tcp_received_rate, net_total_sent_bytes and net_total_received_bytes. There's likely more metrics than need to be changed.

Copy link
Contributor Author

@asiyani asiyani Mar 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to investigate this further but when i checked not all counters are increasing. (exceptions are services and console sessions. Since these 2 are not actually remote sessions I have added filters for them. )
I will do further testing on this my side. If you have some time I whould appreciate if you can help me with this.

//gfx
AverageEncodingTime: prometheus.NewDesc(
prometheus.BuildFQName(Namespace, subsystem, "gfx_average_encoding_time"),
"Average frame encoding time.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What unit does this metric represent? Seconds? Milliseconds?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Milliseconds updating help.

),
FECRate: prometheus.NewDesc(
prometheus.BuildFQName(Namespace, subsystem, "net_fec_rate"),
"Forward Error Correction (FEC) percentage",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an error percentage for all traffic? I'm not sure what this metric is representing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

documentation is not clear but all these counters are /session

),
LossRate: prometheus.NewDesc(
prometheus.BuildFQName(Namespace, subsystem, "net_loss_rate"),
"Loss percentage",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a loss percentage for all traffic in the session?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

documentation is not clear but all these counters are /session

|||
-|-
Metric name prefix | `terminal_services`
Data source | Perflib
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will need to be Perflib/WMI as both sources are queried for this collector.

ch <- prometheus.MustNewConstMetric(
c.HandleCount,
prometheus.GaugeValue,
float64(d.HandleCount),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant float64() cast can be removed, as variables are initialized as float64.

@breed808
Copy link
Contributor

breed808 commented Apr 1, 2020

@asiyani thanks for the changes! I'll investigate further on which metrics should be changed to counters. I'll try to compile a list of metrics that can then be reviewed by others.

@asiyani
Copy link
Contributor Author

asiyani commented Apr 1, 2020

Further investigation on counters. I will change counter type to following values. Please let me know if you find it different in your testing.

wmi_remote_fx_gfx_average_encoding_time | gauge - fluctuating values
wmi_remote_fx_gfx_frame_quality | gauge - fluctuating values
wmi_remote_fx_gfx_frames_skipped_persec_insufficient_clt_res | counter
wmi_remote_fx_gfx_frames_skipped_persec_insufficient_net_res | counter
wmi_remote_fx_gfx_frames_skipped_persec_insufficient_srv_res | counter
wmi_remote_fx_gfx_graphics_compression_ratio | gauge - fluctuating values
wmi_remote_fx_gfx_input_frames_persec | Counter
wmi_remote_fx_gfx_output_frames_persec | Counter
wmi_remote_fx_gfx_source_frames_persec | Counter

wmi_remote_fx_net_base_udp_rrt | unknown (should be gauge to match tcp rrt)
wmi_remote_fx_net_base_tcp_rrt | gauge - fluctuating values
wmi_remote_fx_net_current_tcp_bandwidth | gauge - fluctuating values
wmi_remote_fx_net_current_udp_bandwidth | unknown - (should be gauge match tcp bw)
wmi_remote_fx_net_current_tcp_rtt | gauge - fluctuating values
wmi_remote_fx_net_current_udp_rtt | unknown - values are zero ( should be same as tcp_rtt)
wmi_remote_fx_net_fec_rate | unknown - values are zero ( should be Counter match other rate values)
wmi_remote_fx_net_loss_rate | unknown - values are zero ( should be Counter match other rate values)
wmi_remote_fx_net_retransmission_rate | unknown - values are zero ( should be Counter match other rate values)
wmi_remote_fx_net_tcp_received_rate | Counter
wmi_remote_fx_net_tcp_sent_rate | Counter
wmi_remote_fx_net_total_received_rate | Counter
wmi_remote_fx_net_total_sent_rate | Counter
wmi_remote_fx_net_total_received_bytes | Counter
wmi_remote_fx_net_total_sent_bytes | Counter
wmi_remote_fx_net_udp_packets_received_persec | Counter
wmi_remote_fx_net_udp_packets_sent_persec | Counter
wmi_remote_fx_net_udp_received_rate | Counter
wmi_remote_fx_net_udp_sent_rate | Counter

wmi_terminal_services_local_session_count | Gauge
wmi_terminal_services_connection_broker_performance_total| counter
wmi_terminal_services_handle_count | Gauge
wmi_terminal_services_page_fault_per_sec | Gauge
wmi_terminal_services_page_file_bytes | Gauge
wmi_terminal_services_page_file_bytes_peak | Gauge
wmi_terminal_services_percent_privileged_time | Counter
wmi_terminal_services_percent_processor_time | Counter
wmi_terminal_services_percent_user_time | Counter
wmi_terminal_services_pool_non_paged_Bytes | Gauge
wmi_terminal_services_pool_paged_bytes | Gauge
wmi_terminal_services_private_bytes | Gauge
wmi_terminal_services_thread_count | Gauge
wmi_terminal_services_virtual_bytes | Gauge??
wmi_terminal_services_virtual_bytes_peak | Gauge??
wmi_terminal_services_workingset | Gauge
wmi_terminal_services_workingset_peak | Gauge ??

@breed808
Copy link
Contributor

breed808 commented Apr 1, 2020

I've done some testing and can clarify some of the metrics:

wmi_remote_fx_net_base_udp_rrt | Gauge
wmi_remote_fx_net_current_udp_bandwidth | Gauge
wmi_remote_fx_net_current_udp_bandwidth | Gauge
wmi_terminal_services_virtual_bytes | Gauge
wmi_terminal_services_virtual_bytes_peak | Gauge
wmi_terminal_services_workingset_peak | Gauge

I'm uncertain of wmi_remote_fx_net_fec_rate, wmi_remote_fx_net_loss_rate and wmi_remote_fx_net_retransmission_rate as I can't get any data for these metrics on my testing machine. From the name I assume they are counters, as perflib generally exposes a counter for any metric containing "rate".

@asiyani the only change left to make would be to rename the counter metrics. E.G. change wmi_remote_fx_gfx_output_frames_persec to wmi_remote_fx_gfx_output_frames_total so it's clear this metric is a counter.

@carlpett this looks almost complete, did you want to review?

@asiyani
Copy link
Contributor Author

asiyani commented Apr 6, 2020

@breed808 when you got time please review this last change and let me know if any other issues.
Thank you

Copy link
Contributor

@breed808 breed808 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@carlpett
Copy link
Collaborator

carlpett commented Apr 7, 2020

Thank you for this work @asiyani, really nice! And thanks @breed808 for reviewing! Apologies for slow reactions from me.
I think this looks mostly good to go, just some naming things that would be good to fix:

  • Some metrics have but others do not have units (ie missing _bytes or similar), and not all are in base form (milliseconds instead of seconds). Could we harmonize all of that? (See upstream best practices for more details)
  • There are a couple of percent or rate metrics - are those actually reporting percentages and rates? Those is very often named for how they are intended to be presented after calculations, not what the actual data is.

@asiyani
Copy link
Contributor Author

asiyani commented Apr 13, 2020

Regarding _Rate in remote_fx_net most of the _Rate are zero values. I am thinking of removing those and just keeping rtt, bandwidth and followings.
wmi_remote_fx_net_total_received_bytes
wmi_remote_fx_net_total_sent_bytes
wmi_remote_fx_net_udp_packets_received_total
wmi_remote_fx_net_udp_packets_sent_total

This will remove confusion about _rate and from _total we can calculate rate.
what you think? @breed808 @carlpett

@carlpett
Copy link
Collaborator

Nice! 👏
Agree, removing the _rate and keeping the _total sounds like a good path forward 👍

@asiyani
Copy link
Contributor Author

asiyani commented Apr 14, 2020

_rate metrics are removed from remote fx.
Let me know if there is any other changes needs to be done.

@asiyani
Copy link
Contributor Author

asiyani commented Apr 19, 2020

@carlpett when you have a time, Please review this for me.

Copy link
Collaborator

@carlpett carlpett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey,
Sorry, bit of a lack of time this week.
I took another look now, and came upon some things I missed last round. Mainly naming and units, tried to leave suggestions for those so you can hopefully batch that up to skip the more tedious parts if you agree with them.

nil,
),
TotalReceivedBytes: prometheus.NewDesc(
prometheus.BuildFQName(Namespace, subsystem, "net_total_received_bytes"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhat nitty, but by convention the _total should be at the end

Suggested change
prometheus.BuildFQName(Namespace, subsystem, "net_total_received_bytes"),
prometheus.BuildFQName(Namespace, subsystem, "net_received_bytes_total"),

nil,
),
TotalSentBytes: prometheus.NewDesc(
prometheus.BuildFQName(Namespace, subsystem, "net_total_sent_bytes"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move _total here too

Suggested change
prometheus.BuildFQName(Namespace, subsystem, "net_total_sent_bytes"),
prometheus.BuildFQName(Namespace, subsystem, "net_sent_bytes_total"),


//gfx
AverageEncodingTime: prometheus.NewDesc(
prometheus.BuildFQName(Namespace, subsystem, "gfx_average_encoding_time"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should have the unit as a suffix

Suggested change
prometheus.BuildFQName(Namespace, subsystem, "gfx_average_encoding_time"),
prometheus.BuildFQName(Namespace, subsystem, "gfx_average_encoding_time_seconds"),

Are we sure the raw data is actually the average and not a counter of total seconds, by the way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, its a GaugeValue

nil,
),
PageFaultsPersec: prometheus.NewDesc(
prometheus.BuildFQName(Namespace, subsystem, "page_fault_per_sec"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actually per_sec or is the raw data a counter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it looks like total and counter. i will change it.

nil,
),
PercentPrivilegedTime: prometheus.NewDesc(
prometheus.BuildFQName(Namespace, subsystem, "privileged_time_total"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per the conventions, we should have the unit here as well

Suggested change
prometheus.BuildFQName(Namespace, subsystem, "privileged_time_total"),
prometheus.BuildFQName(Namespace, subsystem, "privileged_time_seconds_total"),

return &RemoteFxCollector{
// net
BaseTCPRTT: prometheus.NewDesc(
prometheus.BuildFQName(Namespace, subsystem, "net_base_tcp_rtt"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing a unit

Suggested change
prometheus.BuildFQName(Namespace, subsystem, "net_base_tcp_rtt"),
prometheus.BuildFQName(Namespace, subsystem, "net_base_tcp_rtt_seconds"),

nil,
),
BaseUDPRTT: prometheus.NewDesc(
prometheus.BuildFQName(Namespace, subsystem, "net_base_udp_rtt"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
prometheus.BuildFQName(Namespace, subsystem, "net_base_udp_rtt"),
prometheus.BuildFQName(Namespace, subsystem, "net_base_udp_rtt_seconds"),

nil,
),
CurrentTCPRTT: prometheus.NewDesc(
prometheus.BuildFQName(Namespace, subsystem, "net_current_tcp_rtt"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
prometheus.BuildFQName(Namespace, subsystem, "net_current_tcp_rtt"),
prometheus.BuildFQName(Namespace, subsystem, "net_current_tcp_rtt_seconds"),

nil,
),
CurrentUDPRTT: prometheus.NewDesc(
prometheus.BuildFQName(Namespace, subsystem, "net_current_udp_rtt"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
prometheus.BuildFQName(Namespace, subsystem, "net_current_udp_rtt"),
prometheus.BuildFQName(Namespace, subsystem, "net_current_udp_rtt_seconds"),

Comment on lines 125 to 142
FramesSkippedPerSecondInsufficientClientResources: prometheus.NewDesc(
prometheus.BuildFQName(Namespace, subsystem, "gfx_frames_skipped_insufficient_clt_res_total"),
"Number of frames skipped per second due to insufficient client resources.",
[]string{"session_name"},
nil,
),
FramesSkippedPerSecondInsufficientNetworkResources: prometheus.NewDesc(
prometheus.BuildFQName(Namespace, subsystem, "gfx_frames_skipped_insufficient_net_res_total"),
"Number of frames skipped per second due to insufficient network resources.",
[]string{"session_name"},
nil,
),
FramesSkippedPerSecondInsufficientServerResources: prometheus.NewDesc(
prometheus.BuildFQName(Namespace, subsystem, "gfx_frames_skipped_insufficient_srv_res_total"),
"Number of frames skipped per second due to insufficient server resources.",
[]string{"session_name"},
nil,
),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to merge these into a gfx_frames_skipped_total with a reason label? From an alerting perspective that might make it easier to write a query?

@asiyani
Copy link
Contributor Author

asiyani commented Apr 23, 2020

@carlpett I have made changes as you suggested. Intead of commit suggestions, I have done same changes in last commit as it it requires changes to the docs as well.
Let me know if I missed something.

@carlpett
Copy link
Collaborator

Fantastic work @asiyani, thank you so very much for your contribution and patience! 🎉

@carlpett carlpett merged commit 17324b9 into prometheus-community:master Apr 23, 2020
anubhavg-icpl pushed a commit to anubhavg-icpl/windows_exporter that referenced this pull request Sep 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants