-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add BIND nameserver stats input plugin #2383
Conversation
@sparrc I realise you guys are busy with the plugin arch migration right now. I simply wanted to re-submit this PR from a branch in my fork (as I should have done from the start), so that I can more easily carry on with other stuff in that fork. |
(bump) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, but we need to add unittests before we can merge.
@danielnelson The most obvious unit tests would be to include some real-world XML dumps from production BIND servers, to verify the XML document parsing. I have two such files, one for v2 stats format, and one for v3 format. Each file is about 512 KB - is it OK to add these in the plugin dir? |
I had no idea these could be so large. I googled a bit and found this blog post where a bind server had 50MB response. Based on the contents of this blog post I read a bit in the 9.10 docs about split out xml and a JSON version as well. What do you think about trying to use these more specialized resources? It probably does not work with 9.6, but it might perform much better. Also, how does the JSON version compare from a size perspective for you? |
The XML stats v2 format is effectively deprecated, but since Debian wheezy ships BIND 9.8, and jessie / wheezy-backports ship BIND 9.9, I decided to implement support for it. The v3 format supported by BIND 9.9+ (if enabled with --enable-newstats) is slightly more efficient, and indeed supports requesting subsets of the whole document. This would require some non-trivial re-tooling of this plugin however, and I'm not sure it would really reduce the overall transfer size that much. If requesting individual sections, the Having said all that, XML has never been an efficient way to transport data, especially when it involves large hash-sets with hundreds of repeating tags. JSON would be preferable, and it was indeed my longer term plan to add support for that too. Debian / Ubuntu unfortunately does not currently build their BIND packages with JSON support enabled. Centos 6 ships BIND 9.8, which supports neither the v3 XML format, nor JSON. Centos 7 ships BIND 9.9. It doesn't appear that the http server in BIND supports gzip encoding, but a recent dump I took from a small, caching resolver with a couple of local zones was approx. 350 KB, i.e., slightly smaller than the dump I saved last year when writing this plugin. Is your concern with the size of the test data that I would need to add to the git repo, or with actual http transfer size each time Telegraf polls a stats URL? |
A little bit of both, as well as what I assume is a comple document. Anything we can do to reduce the contact surface is useful from a maintenance point of view. |
Ok, I will setup a new BIND instance with out-of-the-box config, rather than taking a stats dump from an (albeit small) production server. Perhaps a freshly-booted BIND daemon will have much smaller XML stats. I could also strip out nodes from the document that the plugin currently doesn't care about, although this feels a bit like cheating the test. FYI there is a Debian bug report, requesting that JSON support be enabled (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=856905), but these can take a while to be actioned, and there are going to be a ton of XML-only BIND servers still out there for a long time. |
I think this is a good idea, so long as the section is ignored it shouldn't matter. Sounds like JSON or even the fine-grained XML resources will be too new to use for at least several more years. |
@danielnelson Perhaps you can shed some light on a very strange issue I've encountered whilst writing the tests. If you look at https://github.com/dswarbrick/telegraf/blob/bind-input/plugins/inputs/bind/xml_stats_v3.go#L63-L76 for example, you can see that I am initialising the tag map with values that remain constant for all counters of a particular group. In the inner loop, I set / update the tag["name"] element, and then call AddCounter. I found that this worked as expected when running Telegraf in single-shot mode, but when running the unit tests, there were a ton of duplicate metrics in the acc.Metrics array. It's as if the Of course I could re-initialise the whole |
I believe this is because maps are a reference type, so all that is really copied is a reference to the map, which is why this prints 42: x := make(map[string]string)
y := x
x["value"] = "42"
if value, ok := y["value"]; ok {
fmt.Println(value)
} There is probably an extra copy being made somewhere in the non unittest code. I think you will have to reinitialize the map, but it's not any worse performance wise than a copy made implicitly in the function call due to passing by value. Perhaps at some point go will have a ChainMap like in python, or it might be possible to have two maps but I'm not prepared to do that at this point. |
Thanks @danielnelson for the hint about the maps being reference types. I should have realised... So I think I'm just about done here. My last two commits failed CI, but it was caused by two other plugins that depend on redis (socket conn refused). I'm not sure what's going on there, but I'm pretty sure it isn't my fault. Do you want me to squash these commits, or will you just do that when you merge the PR? |
It doesn't really matter, I always do "squash and merge". Test thing will probably clear up, I'll try to rerun it later. |
@danielnelson I implemented support for JSON statistics as we discussed earlier, and refactored the plugin to request the broken-out subset URLs for both JSON and XML v3 stats format. XML v2 only supports a monolithic XML document, but support for XML v2 was removed in BIND 9.10 anyway. CI tests failed again for my last commit, but once again it was caused by a different plugin. In any case, I think the BIND plugin is (finally) ready to merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looked over again and have a few more suggestions:
plugins/inputs/bind/README.md
Outdated
### Example Output: | ||
|
||
``` | ||
name: bind_counter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put the query output into the "Sample Queries" section and then add some examples of line protocol here. You can filter the output and just show what lines you think would be sufficient to allow the reader to get an idea of the output.
### Measurements & Fields: | ||
|
||
- bind_counter | ||
- value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the bind_counter instead of having a name tag + value if we should just have the value as a field. This way we can use InfluxQL functions and math operators across values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with that is that the fields are not really known in advance. As new RR types are added to DNS, this will result in new fields, with potentially no known limit to the number of possibilities. This is why I opted to make them tags, rather than fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this a problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if people want to group by a particular counter? This is is only possible if the counter is a tag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the benefit of a tag here would be an index, but since I think the counters report every interval I'm not sure how helpful it would be. I can't think of a query example where you would want to group by counter. @desa what do you think on this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I totally follow what the underlying issue is. I'd error on the side of having more fields, even if the number is indeterminate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danielnelson Does my last comment clarify the situation for you? Essentially, if everything is a field, then creating useful graphs is going to be a very tedious process involving dozens of queries, whereas a single query would suffice if they are stored as tags.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm no InfluxQL expert, but I think you can do this all with fields. You can select all fields with select * ...
or using a regex search select /tcp/ ...
. This can be combined with derivative select derivative(*, 5m) ...
or if you are grouping you will need select derivative(mean(*), 5m) ... group by time(5m)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, I wasn't aware that you could use a wildcard in the derivative function. Well that would produce resulting field names like derivative_A
, derivative_AAAA
, derivative_PTR
etc for the DNS query types. I'm not sure how convenient that is for Grafana however. It also would not be possible to target a specific qtype without effectively doing a sequential scan, since the qtype would not be indexed. If you want to use fields rather than tags, then I would suggest splitting the qtype, opcode, nsstat and sockstat into separate measurements, because these contain fields that bear no relevance to each other (apart from all being integer, counter types). However, since this is going to be yet more substantial refactoring of the plugin, and I really don't have a lot more patience or time for this, I'd prefer that we all get on the same page first. What are the pros for using fields rather than tags?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree about that the name thing is sort of ugly, and grafana doesn't give you a way to alias against the field name but I see why.
My understanding here is that showing a single field would be as fast as the current schema with tags. This is because InfluxDB stores each in it's own series: <measurement>,<tag_set>#<field_key>
I think it is probably okay to leave qtype, opcode, nsstat, and sockstat as tags, but either way seems fine.
The argument for having these as fields that I can think of:
- Able to use functions and operators; I'm imagining summing all the connection related fields.
- Plays nice with current Telegraf metric buffering
- Plays nice with Kapacitor which operates on a per point basis
If you don't have time to do this I can add a comment about what remains and someone else can take over, it's completely understandable especially if you are no longer in need of this plugin.
Is there anything I can do to help get this plugin into the next release? |
@keith4 Sure, can you finish up the remaining bits of work and open a new pull request? |
@danielnelson Can you please reiterate what the "remaining bits of work" are in your opinion? If it's about the data format, maybe you should watch what Paul Dix has to say about InfluxDB's future data model in https://youtu.be/BZkHlhautGk?t=14m40s. I firmly believe that inserting the counters as a type-tag plus value is the most sane way of handling new DNS RR types or other counters that BIND may expose in future. I will rebase this PR on the master branch, and address the documentation nitpicks, but I would be pretty reluctant to change the data insert format. |
I still think it is necessary to update the format as we discussed earlier, while in the future we will hopefully be able to do operations/functions across series this isn't something that exists today. Data model wise this isn't much of a change for InfluxDB, which already stores its data very much like the proposed new wire format anyway, but Telegraf is optimized for processing the current style, and sending many single field metrics will perform poorly especially if an output is slow to respond. Kapacitor also needs it's input formatted in this way. I think this will also work out best if/when we add support for any new output formats because we can use a uniform transformation across all plugins using current layout style. |
Checking in to see if there are any updates on this. If not, perhaps I can attempt to pick up any remaining work on the PR. Would love to see named plugin as soon as possible. |
@blieberman We could use the help, let me know if you have any questions on the above conversation. |
hello, |
An update on this PR ? :( |
Any updates? I also need this plugin... |
I'm thinking of picking up this subject, but it's dead for more than a year. |
The main piece that is unresolved is the metric model: #2383 (comment). I'm still looking for some example output though, maybe you could start by just running the existing plugin with |
Cool, I'll try to look at on next sprint (~ 2-3 weeks). |
@danielnelson here's the output (as is, just compiled from branch for now):
Output:
I've also merged it with current master and it still works ;-) |
The primary change is moving the name tag to the field key. - bind_counter,url=localhost:8053,type=opcode,name=QUERY,host=resolver2 value=84i
- bind_counter,url=localhost:8053,type=opcode,name=IQUERY,host=resolver2 value=0i
+ bind_counter,host=resolver2,url=localhost:8053,type=opcode query=84i,iquery=0i It might be helpful to use the SeriesGrouper when doing this transformation. I would also like to use the source tag style described #4413: - bind_counter,host=resolver2,url=localhost:8053,type=opcode query=84i,iquery=0i
+ bind_counter,host=resolver2,source=localhost,port=8053,type=opcode query=84i,iquery=0i This brings up some questions about handling of localhost sources, I'll mention it over on #4413 for discussion. |
And herein lies my main objection with that suggestion, and why I never "finished" this PR. There are six DNS opcodes currently defined. Obviously InfluxDB will create the field the first time that this plugin exposes a value for it, and then it's in the schema forever, despite how rarely it may occur. This is even before adding the fields for the 50+ RR codes currently defined (i.e. potential query types). If you are going to insist about each of those being their own field, I would suggest at the very least using a few different measurements. However I still think my original approach, or perhaps a compromise between the two, would be preferable, i.e. split out the The SeriesGrouper approach would not even be necessary if using a
Such functionality exists in the query language for a good reason. I don't see why it needs to be implemented in the plugin which is collecting the raw data. I think that making everything a field is going to result in a very messy schema, and as previously mentioned, make the naming of series in Grafana panel legends super clumsy. |
Adding everything to one measurement, indeed makes graph legend a mess, so I've tried some compromise (as @dswarbrick suggested) and came up with such output:
Graph from this data looks like this: @danielnelson what do you think about it? |
@danielllek Did you find a way to strip the ugly |
So the nice thing about tag name is that you can extract the tag without the non_negative_derivative prefix? I think the only way to work around this is to do the fields one at a time, do you expect most queries will be for all fields? |
Maybe it's best explained with a couple of screenshots. These were taken with data from DigitalOcean's bind_exporter for Prometheus. However, using a tag for the query type or response code in this plugin would also make such a thing trivial for InfluxDB. Having to specify fields individually in the query would be quite painful, whereas a simple |
You can alias a function, but it is required to have a prefix:
|
That still seems (IMHO) inferior to the tag approach, and selecting a wildcard like that looks fragile. For example, what if at some point a non-numeric field was added to the measurement? Getting rid of unwanted or accidentally added fields from an InfluxDB measurement is not exactly straightforward. |
I'll give you that it is nice that you can get a clean display name in this case, but there are still all the reasons to structure it using field names:
The arguments for using a tag for the name could apply to every plugin in Telegraf and the naming of non_negative_derivative affects any plugin with counter style fields. I think this just reduces to a question of should we have field keys or not. For now though, we are not going to restructure everything and I do not want to introduce schema inconsistency. If in the future some day we decide field keys were a mistake then it will be easier to make a change if the data all looks the same. I do want to have add a pivot processor #5629, we ought to be able to have an unpivot operation that could reverse the process. @danielllek On |
@danielnelson Thank you for your candor. I will bow out here, as we decided last year to go with Prometheus for our BIND metrics requirements. Hopefully this plugin will fulfill the needs of those who have been waiting for it a while. |
@danielnelson When all values are in one metric, legend on Grafana looks messy - all value keys are shown, not only these from specific |
No, unfortunately I didn't. |
I'm not following what you mean by The empty series is a quirk of I did check with some InfluxQL experts here and #2383 (comment) is the only option for handling the |
Please take a look at legend, columns with names
Thanks for this insight. I'll create change that puts all values in one measurement as you suggested from the beginning. |
Merged in #5653 |
Required for all PRs: