Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BIND nameserver stats input plugin #2383

Closed
wants to merge 16 commits into from
Closed

Add BIND nameserver stats input plugin #2383

wants to merge 16 commits into from

Conversation

dswarbrick
Copy link

Required for all PRs:

  • CHANGELOG.md updated (we recommend not updating this until the PR has been approved by a maintainer)
  • Sign CLA (if not already signed)
  • README.md updated (if adding a new plugin)

@dswarbrick
Copy link
Author

@sparrc I realise you guys are busy with the plugin arch migration right now. I simply wanted to re-submit this PR from a branch in my fork (as I should have done from the start), so that I can more easily carry on with other stuff in that fork.

@dswarbrick
Copy link
Author

dswarbrick commented Apr 20, 2017

(bump)

@danielnelson danielnelson added this to the 1.4.0 milestone Apr 20, 2017
Copy link
Contributor

@danielnelson danielnelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but we need to add unittests before we can merge.

plugins/inputs/bind/README.md Outdated Show resolved Hide resolved
plugins/inputs/bind/bind.go Outdated Show resolved Hide resolved
plugins/inputs/bind/bind.go Outdated Show resolved Hide resolved
@dswarbrick
Copy link
Author

@danielnelson The most obvious unit tests would be to include some real-world XML dumps from production BIND servers, to verify the XML document parsing. I have two such files, one for v2 stats format, and one for v3 format. Each file is about 512 KB - is it OK to add these in the plugin dir?

@danielnelson
Copy link
Contributor

I had no idea these could be so large. I googled a bit and found this blog post where a bind server had 50MB response. Based on the contents of this blog post I read a bit in the 9.10 docs about split out xml and a JSON version as well.

What do you think about trying to use these more specialized resources? It probably does not work with 9.6, but it might perform much better. Also, how does the JSON version compare from a size perspective for you?

@dswarbrick
Copy link
Author

dswarbrick commented Jul 25, 2017

The XML stats v2 format is effectively deprecated, but since Debian wheezy ships BIND 9.8, and jessie / wheezy-backports ship BIND 9.9, I decided to implement support for it. The v3 format supported by BIND 9.9+ (if enabled with --enable-newstats) is slightly more efficient, and indeed supports requesting subsets of the whole document. This would require some non-trivial re-tooling of this plugin however, and I'm not sure it would really reduce the overall transfer size that much. If requesting individual sections, the /xml/tasks URL is by far the largest in my server, followed by the /xml/mem URL, whilst the others are around 20 KB each or less. However, the situation may be reversed for a server that is configured with a lot of zones, or a lot of views. Due to the structure of the XML, most of the individual section URLs would still need to be requested in order to scrape the required data - with perhaps the exception of the /xml/tasks URL

Having said all that, XML has never been an efficient way to transport data, especially when it involves large hash-sets with hundreds of repeating tags. JSON would be preferable, and it was indeed my longer term plan to add support for that too. Debian / Ubuntu unfortunately does not currently build their BIND packages with JSON support enabled. Centos 6 ships BIND 9.8, which supports neither the v3 XML format, nor JSON. Centos 7 ships BIND 9.9.

It doesn't appear that the http server in BIND supports gzip encoding, but a recent dump I took from a small, caching resolver with a couple of local zones was approx. 350 KB, i.e., slightly smaller than the dump I saved last year when writing this plugin.

Is your concern with the size of the test data that I would need to add to the git repo, or with actual http transfer size each time Telegraf polls a stats URL?

@danielnelson
Copy link
Contributor

Is your concern with the size of the test data that I would need to add to the git repo, or with actual http transfer size each time Telegraf polls a stats URL?

A little bit of both, as well as what I assume is a comple document. Anything we can do to reduce the contact surface is useful from a maintenance point of view.

@dswarbrick
Copy link
Author

Ok, I will setup a new BIND instance with out-of-the-box config, rather than taking a stats dump from an (albeit small) production server. Perhaps a freshly-booted BIND daemon will have much smaller XML stats. I could also strip out nodes from the document that the plugin currently doesn't care about, although this feels a bit like cheating the test.

FYI there is a Debian bug report, requesting that JSON support be enabled (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=856905), but these can take a while to be actioned, and there are going to be a ton of XML-only BIND servers still out there for a long time.

@danielnelson
Copy link
Contributor

I could also strip out nodes from the document that the plugin currently doesn't care about, although this feels a bit like cheating the test.

I think this is a good idea, so long as the section is ignored it shouldn't matter.

Sounds like JSON or even the fine-grained XML resources will be too new to use for at least several more years.

@dswarbrick
Copy link
Author

dswarbrick commented Jul 25, 2017

@danielnelson Perhaps you can shed some light on a very strange issue I've encountered whilst writing the tests. If you look at https://github.com/dswarbrick/telegraf/blob/bind-input/plugins/inputs/bind/xml_stats_v3.go#L63-L76 for example, you can see that I am initialising the tag map with values that remain constant for all counters of a particular group. In the inner loop, I set / update the tag["name"] element, and then call AddCounter.

I found that this worked as expected when running Telegraf in single-shot mode, but when running the unit tests, there were a ton of duplicate metrics in the acc.Metrics array. It's as if the tags map is being assigned by pointer to each metric, hence why when I modify a tag element in-place, all resulting metrics always have the very last state of the tags. I don't understand why this behaviour only shows up when running the unit tests. The same symptoms occur with the values map, since I am re-using the "value" element over and over. As far as I can see in the Telegraf source, Metric objects are added to the accumulator by pointer, but the actual Tags member of each Metric object is not assigned by pointer.

Of course I could re-initialise the whole tags map and values map with each iteration of the innermost loop, but this seems inefficient and will generate a lot more allocations for the gc to deal with.

@danielnelson
Copy link
Contributor

danielnelson commented Jul 25, 2017

I believe this is because maps are a reference type, so all that is really copied is a reference to the map, which is why this prints 42:

x := make(map[string]string)
y := x

x["value"] = "42"
if value, ok := y["value"]; ok {
	fmt.Println(value)
}

There is probably an extra copy being made somewhere in the non unittest code. I think you will have to reinitialize the map, but it's not any worse performance wise than a copy made implicitly in the function call due to passing by value. Perhaps at some point go will have a ChainMap like in python, or it might be possible to have two maps but I'm not prepared to do that at this point.

@dswarbrick
Copy link
Author

Thanks @danielnelson for the hint about the maps being reference types. I should have realised...

So I think I'm just about done here. My last two commits failed CI, but it was caused by two other plugins that depend on redis (socket conn refused). I'm not sure what's going on there, but I'm pretty sure it isn't my fault.

Do you want me to squash these commits, or will you just do that when you merge the PR?

@danielnelson
Copy link
Contributor

It doesn't really matter, I always do "squash and merge". Test thing will probably clear up, I'll try to rerun it later.

@dswarbrick
Copy link
Author

@danielnelson I implemented support for JSON statistics as we discussed earlier, and refactored the plugin to request the broken-out subset URLs for both JSON and XML v3 stats format. XML v2 only supports a monolithic XML document, but support for XML v2 was removed in BIND 9.10 anyway.

CI tests failed again for my last commit, but once again it was caused by a different plugin. In any case, I think the BIND plugin is (finally) ready to merge.

Copy link
Contributor

@danielnelson danielnelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked over again and have a few more suggestions:

plugins/inputs/bind/README.md Outdated Show resolved Hide resolved
### Example Output:

```
name: bind_counter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put the query output into the "Sample Queries" section and then add some examples of line protocol here. You can filter the output and just show what lines you think would be sufficient to allow the reader to get an idea of the output.

plugins/inputs/bind/README.md Outdated Show resolved Hide resolved
plugins/inputs/bind/bind_test.go Outdated Show resolved Hide resolved
### Measurements & Fields:

- bind_counter
- value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the bind_counter instead of having a name tag + value if we should just have the value as a field. This way we can use InfluxQL functions and math operators across values.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with that is that the fields are not really known in advance. As new RR types are added to DNS, this will result in new fields, with potentially no known limit to the number of possibilities. This is why I opted to make them tags, rather than fields.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a problem?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if people want to group by a particular counter? This is is only possible if the counter is a tag.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the benefit of a tag here would be an index, but since I think the counters report every interval I'm not sure how helpful it would be. I can't think of a query example where you would want to group by counter. @desa what do you think on this one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I totally follow what the underlying issue is. I'd error on the side of having more fields, even if the number is indeterminate.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danielnelson Does my last comment clarify the situation for you? Essentially, if everything is a field, then creating useful graphs is going to be a very tedious process involving dozens of queries, whereas a single query would suffice if they are stored as tags.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm no InfluxQL expert, but I think you can do this all with fields. You can select all fields with select * ... or using a regex search select /tcp/ .... This can be combined with derivative select derivative(*, 5m) ... or if you are grouping you will need select derivative(mean(*), 5m) ... group by time(5m)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, I wasn't aware that you could use a wildcard in the derivative function. Well that would produce resulting field names like derivative_A, derivative_AAAA, derivative_PTR etc for the DNS query types. I'm not sure how convenient that is for Grafana however. It also would not be possible to target a specific qtype without effectively doing a sequential scan, since the qtype would not be indexed. If you want to use fields rather than tags, then I would suggest splitting the qtype, opcode, nsstat and sockstat into separate measurements, because these contain fields that bear no relevance to each other (apart from all being integer, counter types). However, since this is going to be yet more substantial refactoring of the plugin, and I really don't have a lot more patience or time for this, I'd prefer that we all get on the same page first. What are the pros for using fields rather than tags?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree about that the name thing is sort of ugly, and grafana doesn't give you a way to alias against the field name but I see why.

My understanding here is that showing a single field would be as fast as the current schema with tags. This is because InfluxDB stores each in it's own series: <measurement>,<tag_set>#<field_key>

I think it is probably okay to leave qtype, opcode, nsstat, and sockstat as tags, but either way seems fine.

The argument for having these as fields that I can think of:

  • Able to use functions and operators; I'm imagining summing all the connection related fields.
  • Plays nice with current Telegraf metric buffering
  • Plays nice with Kapacitor which operates on a per point basis

If you don't have time to do this I can add a comment about what remains and someone else can take over, it's completely understandable especially if you are no longer in need of this plugin.

@danielnelson danielnelson added feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin and removed feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin plugin request labels Aug 12, 2017
@danielnelson danielnelson modified the milestones: 1.5.0, 1.4.0 Aug 16, 2017
@danielnelson danielnelson modified the milestones: 1.5.0, 1.6.0 Nov 30, 2017
@keith4
Copy link

keith4 commented Dec 15, 2017

Is there anything I can do to help get this plugin into the next release?

@danielnelson
Copy link
Contributor

@keith4 Sure, can you finish up the remaining bits of work and open a new pull request?

@dswarbrick
Copy link
Author

dswarbrick commented Dec 16, 2017

@danielnelson Can you please reiterate what the "remaining bits of work" are in your opinion? If it's about the data format, maybe you should watch what Paul Dix has to say about InfluxDB's future data model in https://youtu.be/BZkHlhautGk?t=14m40s.

I firmly believe that inserting the counters as a type-tag plus value is the most sane way of handling new DNS RR types or other counters that BIND may expose in future.

I will rebase this PR on the master branch, and address the documentation nitpicks, but I would be pretty reluctant to change the data insert format.

@danielnelson
Copy link
Contributor

I still think it is necessary to update the format as we discussed earlier, while in the future we will hopefully be able to do operations/functions across series this isn't something that exists today. Data model wise this isn't much of a change for InfluxDB, which already stores its data very much like the proposed new wire format anyway, but Telegraf is optimized for processing the current style, and sending many single field metrics will perform poorly especially if an output is slow to respond. Kapacitor also needs it's input formatted in this way. I think this will also work out best if/when we add support for any new output formats because we can use a uniform transformation across all plugins using current layout style.

@blieberman
Copy link

blieberman commented Jan 30, 2018

Checking in to see if there are any updates on this. If not, perhaps I can attempt to pick up any remaining work on the PR. Would love to see named plugin as soon as possible.

@danielnelson
Copy link
Contributor

@blieberman We could use the help, let me know if you have any questions on the above conversation.

@danielnelson danielnelson removed this from the 1.6.0 milestone Feb 1, 2018
@nerzhul
Copy link
Contributor

nerzhul commented Aug 6, 2018

hello,
i also need a such plugin to drop my collectd instances from my server farm, it's one of the missing feature we need to ensure we monitor all our components properly.
What is missing on this PR ?

@nerzhul
Copy link
Contributor

nerzhul commented Dec 13, 2018

An update on this PR ? :(

@Alcarin
Copy link

Alcarin commented Feb 14, 2019

Any updates? I also need this plugin...

@danielllek
Copy link
Contributor

I'm thinking of picking up this subject, but it's dead for more than a year.
@danielnelson can you summarize what has to be done (maybe just pointing the right comment)?

@danielnelson
Copy link
Contributor

The main piece that is unresolved is the metric model: #2383 (comment).

I'm still looking for some example output though, maybe you could start by just running the existing plugin with --test and posting the output?

@danielllek
Copy link
Contributor

Cool, I'll try to look at on next sprint (~ 2-3 weeks).

@danielllek
Copy link
Contributor

danielllek commented Mar 26, 2019

@danielnelson here's the output (as is, just compiled from branch for now):

$ telegraf version
Telegraf v1.6.0~43368423 (git: bind-input 43368423)

Output:

* Plugin: inputs.bind, Collection 1
> bind_counter,url=localhost:8053,type=opcode,name=QUERY,host=resolver2 value=84i 1553601110000000000
> bind_counter,url=localhost:8053,type=opcode,name=IQUERY,host=resolver2 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=opcode,name=STATUS,host=resolver2 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=opcode,name=NOTIFY,host=resolver2 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=opcode,name=UPDATE,host=resolver2 value=0i 1553601110000000000
> bind_counter,type=rcode,name=NOERROR,host=resolver2,url=localhost:8053 value=84i 1553601110000000000
> bind_counter,url=localhost:8053,type=rcode,name=FORMERR,host=resolver2 value=0i 1553601110000000000
> bind_counter,host=resolver2,url=localhost:8053,type=rcode,name=SERVFAIL value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=rcode,name=NXDOMAIN,host=resolver2 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=rcode,name=NOTIMP,host=resolver2 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=rcode,name=REFUSED,host=resolver2 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=rcode,name=YXDOMAIN,host=resolver2 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=rcode,name=YXRRSET,host=resolver2 value=0i 1553601110000000000
> bind_counter,name=NXRRSET,host=resolver2,url=localhost:8053,type=rcode value=0i 1553601110000000000
> bind_counter,host=resolver2,url=localhost:8053,type=rcode,name=NOTAUTH value=0i 1553601110000000000
> bind_counter,name=NOTZONE,host=resolver2,url=localhost:8053,type=rcode value=0i 1553601110000000000
> bind_counter,type=rcode,name=RESERVED11,host=resolver2,url=localhost:8053 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=rcode,name=RESERVED12,host=resolver2 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=rcode,name=RESERVED13,host=resolver2 value=0i 1553601110000000000
> bind_counter,host=resolver2,url=localhost:8053,type=rcode,name=RESERVED14 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=rcode,name=RESERVED15,host=resolver2 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=rcode,name=BADVERS,host=resolver2 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=rcode,name=17,host=resolver2 value=0i 1553601110000000000
> bind_counter,name=18,host=resolver2,url=localhost:8053,type=rcode value=0i 1553601110000000000
> bind_counter,type=rcode,name=19,host=resolver2,url=localhost:8053 value=0i 1553601110000000000
> bind_counter,host=resolver2,url=localhost:8053,type=rcode,name=20 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=rcode,name=21,host=resolver2 value=0i 1553601110000000000
> bind_counter,type=rcode,name=22,host=resolver2,url=localhost:8053 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=rcode,name=BADCOOKIE,host=resolver2 value=0i 1553601110000000000
> bind_counter,type=qtype,name=A,host=resolver2,url=localhost:8053 value=28i 1553601110000000000
> bind_counter,url=localhost:8053,type=qtype,name=AAAA,host=resolver2 value=56i 1553601110000000000
> bind_counter,url=localhost:8053,type=nsstat,name=Requestv4,host=resolver2 value=84i 1553601110000000000
> bind_counter,url=localhost:8053,type=nsstat,name=Requestv6,host=resolver2 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=nsstat,name=ReqEdns0,host=resolver2 value=43i 1553601110000000000
> bind_counter,type=nsstat,name=ReqBadEDNSVer,host=resolver2,url=localhost:8053 value=0i 1553601110000000000
> bind_counter,type=nsstat,name=ReqTSIG,host=resolver2,url=localhost:8053 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=nsstat,name=ReqSIG0,host=resolver2 value=0i 1553601110000000000
> bind_counter,type=nsstat,name=ReqBadSIG,host=resolver2,url=localhost:8053 value=0i 1553601110000000000
> bind_counter,name=ReqTCP,host=resolver2,url=localhost:8053,type=nsstat value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=nsstat,name=AuthQryRej,host=resolver2 value=0i 1553601110000000000
> bind_counter,host=resolver2,url=localhost:8053,type=nsstat,name=RecQryRej value=0i 1553601110000000000
> bind_counter,name=XfrRej,host=resolver2,url=localhost:8053,type=nsstat value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=nsstat,name=UpdateRej,host=resolver2 value=0i 1553601110000000000
> bind_counter,type=nsstat,name=Response,host=resolver2,url=localhost:8053 value=84i 1553601110000000000
> bind_counter,url=localhost:8053,type=nsstat,name=TruncatedResp,host=resolver2 value=0i 1553601110000000000
> bind_counter,name=RespEDNS0,host=resolver2,url=localhost:8053,type=nsstat value=43i 1553601110000000000
> bind_counter,url=localhost:8053,type=nsstat,name=RespTSIG,host=resolver2 value=0i 1553601110000000000
> bind_counter,type=nsstat,name=RespSIG0,host=resolver2,url=localhost:8053 value=0i 1553601110000000000
> bind_counter,host=resolver2,url=localhost:8053,type=nsstat,name=QrySuccess value=40i 1553601110000000000
> bind_counter,host=resolver2,url=localhost:8053,type=nsstat,name=QryAuthAns value=36i 1553601110000000000
> bind_counter,host=resolver2,url=localhost:8053,type=nsstat,name=QryNoauthAns value=48i 1553601110000000000
> bind_counter,url=localhost:8053,type=nsstat,name=QryReferral,host=resolver2 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=nsstat,name=QryNxrrset,host=resolver2 value=44i 1553601110000000000
> bind_counter,host=resolver2,url=localhost:8053,type=nsstat,name=QrySERVFAIL value=0i 1553601110000000000
> bind_counter,type=nsstat,name=QryFORMERR,host=resolver2,url=localhost:8053 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=nsstat,name=QryNXDOMAIN,host=resolver2 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=nsstat,name=QryRecursion,host=resolver2 value=45i 1553601110000000000
> bind_counter,url=localhost:8053,type=nsstat,name=QryDuplicate,host=resolver2 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=nsstat,name=QryDropped,host=resolver2 value=0i 1553601110000000000
> bind_counter,name=QryFailure,host=resolver2,url=localhost:8053,type=nsstat value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=nsstat,name=XfrReqDone,host=resolver2 value=0i 1553601110000000000
> bind_counter,host=resolver2,url=localhost:8053,type=nsstat,name=UpdateReqFwd value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=nsstat,name=UpdateRespFwd,host=resolver2 value=0i 1553601110000000000
> bind_counter,host=resolver2,url=localhost:8053,type=nsstat,name=UpdateFwdFail value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=nsstat,name=UpdateDone,host=resolver2 value=0i 1553601110000000000
> bind_counter,name=UpdateFail,host=resolver2,url=localhost:8053,type=nsstat value=0i 1553601110000000000
> bind_counter,host=resolver2,url=localhost:8053,type=nsstat,name=UpdateBadPrereq value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=nsstat,name=RecursClients,host=resolver2 value=0i 1553601110000000000
> bind_counter,name=DNS64,host=resolver2,url=localhost:8053,type=nsstat value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=nsstat,name=RateDropped,host=resolver2 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=nsstat,name=RateSlipped,host=resolver2 value=0i 1553601110000000000
> bind_counter,host=resolver2,url=localhost:8053,type=nsstat,name=RPZRewrites value=0i 1553601110000000000
> bind_counter,name=QryUDP,host=resolver2,url=localhost:8053,type=nsstat value=84i 1553601110000000000
> bind_counter,type=nsstat,name=QryTCP,host=resolver2,url=localhost:8053 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=nsstat,name=NSIDOpt,host=resolver2 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=nsstat,name=ExpireOpt,host=resolver2 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=nsstat,name=OtherOpt,host=resolver2 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=nsstat,name=CookieIn,host=resolver2 value=0i 1553601110000000000
> bind_counter,host=resolver2,url=localhost:8053,type=nsstat,name=CookieNew value=0i 1553601110000000000
> bind_counter,type=nsstat,name=CookieBadSize,host=resolver2,url=localhost:8053 value=0i 1553601110000000000
> bind_counter,type=nsstat,name=CookieBadTime,host=resolver2,url=localhost:8053 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=nsstat,name=CookieNoMatch,host=resolver2 value=0i 1553601110000000000
> bind_counter,host=resolver2,url=localhost:8053,type=nsstat,name=CookieMatch value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=nsstat,name=ECSOpt,host=resolver2 value=0i 1553601110000000000
> bind_counter,host=resolver2,url=localhost:8053,type=nsstat,name=QryNXRedir value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=nsstat,name=QryNXRedirRLookup,host=resolver2 value=0i 1553601110000000000
> bind_counter,type=nsstat,name=QryBADCOOKIE,host=resolver2,url=localhost:8053 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=nsstat,name=KeyTagOpt,host=resolver2 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=zonestat,name=NotifyOutv4,host=resolver2 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=zonestat,name=NotifyOutv6,host=resolver2 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=zonestat,name=NotifyInv4,host=resolver2 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=zonestat,name=NotifyInv6,host=resolver2 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=zonestat,name=NotifyRej,host=resolver2 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=zonestat,name=SOAOutv4,host=resolver2 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=zonestat,name=SOAOutv6,host=resolver2 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=zonestat,name=AXFRReqv4,host=resolver2 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=zonestat,name=AXFRReqv6,host=resolver2 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=zonestat,name=IXFRReqv4,host=resolver2 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=zonestat,name=IXFRReqv6,host=resolver2 value=0i 1553601110000000000
> bind_counter,type=zonestat,name=XfrSuccess,host=resolver2,url=localhost:8053 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=zonestat,name=XfrFail,host=resolver2 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=sockstat,name=UDP4Open,host=resolver2 value=74i 1553601110000000000
> bind_counter,host=resolver2,url=localhost:8053,type=sockstat,name=UDP6Open value=0i 1553601110000000000
> bind_counter,name=TCP4Open,host=resolver2,url=localhost:8053,type=sockstat value=8i 1553601110000000000
> bind_counter,url=localhost:8053,type=sockstat,name=TCP6Open,host=resolver2 value=0i 1553601110000000000
> bind_counter,name=UnixOpen,host=resolver2,url=localhost:8053,type=sockstat value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=sockstat,name=RawOpen,host=resolver2 value=1i 1553601110000000000
> bind_counter,host=resolver2,url=localhost:8053,type=sockstat,name=UDP4OpenFail value=0i 1553601110000000000
> bind_counter,name=UDP6OpenFail,host=resolver2,url=localhost:8053,type=sockstat value=0i 1553601110000000000
> bind_counter,name=TCP4OpenFail,host=resolver2,url=localhost:8053,type=sockstat value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=sockstat,name=TCP6OpenFail,host=resolver2 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=sockstat,name=UnixOpenFail,host=resolver2 value=0i 1553601110000000000
> bind_counter,host=resolver2,url=localhost:8053,type=sockstat,name=RawOpenFail value=0i 1553601110000000000
> bind_counter,name=UDP4Close,host=resolver2,url=localhost:8053,type=sockstat value=72i 1553601110000000000
> bind_counter,url=localhost:8053,type=sockstat,name=UDP6Close,host=resolver2 value=0i 1553601110000000000
> bind_counter,type=sockstat,name=TCP4Close,host=resolver2,url=localhost:8053 value=15i 1553601110000000000
> bind_counter,name=TCP6Close,host=resolver2,url=localhost:8053,type=sockstat value=0i 1553601110000000000
> bind_counter,name=UnixClose,host=resolver2,url=localhost:8053,type=sockstat value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=sockstat,name=FDWatchClose,host=resolver2 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=sockstat,name=RawClose,host=resolver2 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=sockstat,name=UDP4BindFail,host=resolver2 value=0i 1553601110000000000
> bind_counter,host=resolver2,url=localhost:8053,type=sockstat,name=UDP6BindFail value=0i 1553601110000000000
> bind_counter,type=sockstat,name=TCP4BindFail,host=resolver2,url=localhost:8053 value=0i 1553601110000000000
> bind_counter,name=TCP6BindFail,host=resolver2,url=localhost:8053,type=sockstat value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=sockstat,name=UnixBindFail,host=resolver2 value=0i 1553601110000000000
> bind_counter,name=FdwatchBindFail,host=resolver2,url=localhost:8053,type=sockstat value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=sockstat,name=UDP4ConnFail,host=resolver2 value=0i 1553601110000000000
> bind_counter,type=sockstat,name=UDP6ConnFail,host=resolver2,url=localhost:8053 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=sockstat,name=TCP4ConnFail,host=resolver2 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=sockstat,name=TCP6ConnFail,host=resolver2 value=0i 1553601110000000000
> bind_counter,type=sockstat,name=UnixConnFail,host=resolver2,url=localhost:8053 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=sockstat,name=FDwatchConnFail,host=resolver2 value=0i 1553601110000000000
> bind_counter,host=resolver2,url=localhost:8053,type=sockstat,name=UDP4Conn value=72i 1553601110000000000
> bind_counter,url=localhost:8053,type=sockstat,name=UDP6Conn,host=resolver2 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=sockstat,name=TCP4Conn,host=resolver2 value=2i 1553601110000000000
> bind_counter,url=localhost:8053,type=sockstat,name=TCP6Conn,host=resolver2 value=0i 1553601110000000000
> bind_counter,name=UnixConn,host=resolver2,url=localhost:8053,type=sockstat value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=sockstat,name=FDwatchConn,host=resolver2 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=sockstat,name=TCP4AcceptFail,host=resolver2 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=sockstat,name=TCP6AcceptFail,host=resolver2 value=0i 1553601110000000000
> bind_counter,type=sockstat,name=UnixAcceptFail,host=resolver2,url=localhost:8053 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=sockstat,name=TCP4Accept,host=resolver2 value=13i 1553601110000000000
> bind_counter,name=TCP6Accept,host=resolver2,url=localhost:8053,type=sockstat value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=sockstat,name=UnixAccept,host=resolver2 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=sockstat,name=UDP4SendErr,host=resolver2 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=sockstat,name=UDP6SendErr,host=resolver2 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=sockstat,name=TCP4SendErr,host=resolver2 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=sockstat,name=TCP6SendErr,host=resolver2 value=0i 1553601110000000000
> bind_counter,type=sockstat,name=UnixSendErr,host=resolver2,url=localhost:8053 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=sockstat,name=FDwatchSendErr,host=resolver2 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=sockstat,name=UDP4RecvErr,host=resolver2 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=sockstat,name=UDP6RecvErr,host=resolver2 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=sockstat,name=TCP4RecvErr,host=resolver2 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=sockstat,name=TCP6RecvErr,host=resolver2 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=sockstat,name=UnixRecvErr,host=resolver2 value=0i 1553601110000000000
> bind_counter,name=FDwatchRecvErr,host=resolver2,url=localhost:8053,type=sockstat value=0i 1553601110000000000
> bind_counter,host=resolver2,url=localhost:8053,type=sockstat,name=RawRecvErr value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=sockstat,name=UDP4Active,host=resolver2 value=2i 1553601110000000000
> bind_counter,name=UDP6Active,host=resolver2,url=localhost:8053,type=sockstat value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=sockstat,name=TCP4Active,host=resolver2 value=6i 1553601110000000000
> bind_counter,type=sockstat,name=TCP6Active,host=resolver2,url=localhost:8053 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=sockstat,name=UnixActive,host=resolver2 value=0i 1553601110000000000
> bind_counter,url=localhost:8053,type=sockstat,name=RawActive,host=resolver2 value=1i 1553601110000000000
> bind_memory,host=resolver2,url=localhost:8053 total_use=16604652i,in_use=2600629i,block_size=8912896i,context_size=3507392i,lost=0i 1553601110000000000

I've also merged it with current master and it still works ;-)
https://github.com/danielllek/telegraf/tree/bind-input

@danielnelson
Copy link
Contributor

The primary change is moving the name tag to the field key.

- bind_counter,url=localhost:8053,type=opcode,name=QUERY,host=resolver2 value=84i
- bind_counter,url=localhost:8053,type=opcode,name=IQUERY,host=resolver2 value=0i
+ bind_counter,host=resolver2,url=localhost:8053,type=opcode query=84i,iquery=0i

It might be helpful to use the SeriesGrouper when doing this transformation.

I would also like to use the source tag style described #4413:

- bind_counter,host=resolver2,url=localhost:8053,type=opcode query=84i,iquery=0i
+ bind_counter,host=resolver2,source=localhost,port=8053,type=opcode query=84i,iquery=0i

This brings up some questions about handling of localhost sources, I'll mention it over on #4413 for discussion.

@dswarbrick
Copy link
Author

dswarbrick commented Mar 26, 2019

The primary change is moving the name tag to the field key.

- bind_counter,url=localhost:8053,type=opcode,name=QUERY,host=resolver2 value=84i
- bind_counter,url=localhost:8053,type=opcode,name=IQUERY,host=resolver2 value=0i
+ bind_counter,host=resolver2,url=localhost:8053,type=opcode query=84i,iquery=0i

And herein lies my main objection with that suggestion, and why I never "finished" this PR. There are six DNS opcodes currently defined. Obviously InfluxDB will create the field the first time that this plugin exposes a value for it, and then it's in the schema forever, despite how rarely it may occur. This is even before adding the fields for the 50+ RR codes currently defined (i.e. potential query types).

If you are going to insist about each of those being their own field, I would suggest at the very least using a few different measurements.

However I still think my original approach, or perhaps a compromise between the two, would be preferable, i.e. split out the bind_counter measurement by "type", e.g. bind_rcode, bind_opcode, bind_sockstat, bind_qtype etc, whilst still using a tag for the "name" of the counter.

The SeriesGrouper approach would not even be necessary if using a name tag as originally proposed, because things like this would be trivial:

SELECT sum(value) FROM bind_qtype WHERE name =~ /^A|AAAA|PTR$/

Such functionality exists in the query language for a good reason. I don't see why it needs to be implemented in the plugin which is collecting the raw data.

I think that making everything a field is going to result in a very messy schema, and as previously mentioned, make the naming of series in Grafana panel legends super clumsy.

@danielllek
Copy link
Contributor

danielllek commented Mar 27, 2019

Adding everything to one measurement, indeed makes graph legend a mess, so I've tried some compromise (as @dswarbrick suggested) and came up with such output:

> bind_memory,host=dns1,port=8053,source=localhost,url=localhost:8053 block_size=17563648i,context_size=4262728i,in_use=3345798i,lost=0i,total_use=9670505409i 1553691559000000000
> bind_counter_opcode,host=dns1,port=8053,source=localhost,url=localhost:8053 IQUERY=0i,NOTIFY=0i,QUERY=17768591i,STATUS=0i,UPDATE=0i 1553691559000000000
> bind_counter_rcode,host=dns1,port=8053,source=localhost,url=localhost:8053 17=0i,18=0i,19=0i,20=0i,21=0i,22=0i,BADCOOKIE=0i,BADVERS=0i,FORMERR=0i,NOERROR=10831531i,NOTAUTH=0i,NOTIMP=0i,NOTZONE=0i,NXDOMAIN=6865082i,NXRRSET=0i,REFUSED=0i,RESERVED11=0i,RESERVED12=0i,RESERVED13=0i,RESERVED14=0i,RESERVED15=0i,SERVFAIL=71798i,YXDOMAIN=0i,YXRRSET=0i 1553691559000000000
> bind_counter_qtype,host=dns1,port=8053,source=localhost,url=localhost:8053 A=4231528i,AAAA=3952727i,AXFR=1i,CNAME=32i,IXFR=56i,NS=19i,PTR=9572965i,SOA=215i,SRV=11023i,TXT=25i 1553691559000000000
> bind_counter_nsstat,host=dns1,port=8053,source=localhost,url=localhost:8053 AuthQryRej=0i,CookieBadSize=0i,CookieBadTime=0i,CookieIn=0i,CookieMatch=0i,CookieNew=0i,CookieNoMatch=0i,DNS64=0i,ECSOpt=0i,ExpireOpt=183i,KeyTagOpt=0i,NSIDOpt=0i,OtherOpt=14i,QryAuthAns=4121421i,QryBADCOOKIE=0i,QryDropped=0i,QryDuplicate=132i,QryFORMERR=0i,QryFailure=0i,QryNXDOMAIN=6865082i,QryNXRedir=0i,QryNXRedirRLookup=0i,QryNoauthAns=13575183i,QryNxrrset=3226876i,QryRecursion=1424737i,QryReferral=0i,QrySERVFAIL=71798i,QrySuccess=7604646i,QryTCP=1i,QryUDP=17712024i,RPZRewrites=0i,RateDropped=0i,RateSlipped=0i,RecQryRej=0i,RecursClients=0i,ReqBadEDNSVer=0i,ReqBadSIG=0i,ReqEdns0=565844i,ReqSIG0=0i,ReqTCP=49i,ReqTSIG=0i,Requestv4=17768591i,Requestv6=0i,RespEDNS0=565816i,RespSIG0=0i,RespTSIG=0i,Response=17768411i,TruncatedResp=3i,UpdateBadPrereq=0i,UpdateDone=0i,UpdateFail=0i,UpdateFwdFail=0i,UpdateRej=0i,UpdateReqFwd=0i,UpdateRespFwd=0i,XfrRej=0i,XfrReqDone=48i 1553691559000000000
> bind_counter_zonestat,host=dns1,port=8053,source=localhost,url=localhost:8053 AXFRReqv4=0i,AXFRReqv6=0i,IXFRReqv4=0i,IXFRReqv6=0i,NotifyInv4=0i,NotifyInv6=0i,NotifyOutv4=52i,NotifyOutv6=0i,NotifyRej=0i,SOAOutv4=0i,SOAOutv6=0i,XfrFail=0i,XfrSuccess=0i 1553691559000000000
> bind_counter_sockstat,host=dns1,port=8053,source=localhost,url=localhost:8053 FDWatchClose=0i,FDwatchConn=0i,FDwatchConnFail=0i,FDwatchRecvErr=0i,FDwatchSendErr=0i,FdwatchBindFail=0i,RawActive=1i,RawClose=0i,RawOpen=1i,RawOpenFail=0i,RawRecvErr=0i,TCP4Accept=166i,TCP4AcceptFail=0i,TCP4Active=6i,TCP4BindFail=0i,TCP4Close=6682i,TCP4Conn=3386i,TCP4ConnFail=0i,TCP4Open=6522i,TCP4OpenFail=0i,TCP4RecvErr=66i,TCP4SendErr=0i,TCP6Accept=0i,TCP6AcceptFail=0i,TCP6Active=0i,TCP6BindFail=0i,TCP6Close=0i,TCP6Conn=0i,TCP6ConnFail=0i,TCP6Open=0i,TCP6OpenFail=0i,TCP6RecvErr=0i,TCP6SendErr=0i,UDP4Active=2i,UDP4BindFail=28i,UDP4Close=1972164i,UDP4Conn=1972116i,UDP4ConnFail=0i,UDP4Open=1972166i,UDP4OpenFail=0i,UDP4RecvErr=0i,UDP4SendErr=0i,UDP6Active=0i,UDP6BindFail=0i,UDP6Close=0i,UDP6Conn=0i,UDP6ConnFail=0i,UDP6Open=0i,UDP6OpenFail=0i,UDP6RecvErr=0i,UDP6SendErr=0i,UnixAccept=0i,UnixAcceptFail=0i,UnixActive=0i,UnixBindFail=0i,UnixClose=0i,UnixConn=0i,UnixConnFail=0i,UnixOpen=0i,UnixOpenFail=0i,UnixRecvErr=0i,UnixSendErr=0i 1553691559000000000

Graph from this data looks like this:
obraz

@danielnelson what do you think about it?

@dswarbrick
Copy link
Author

@danielllek Did you find a way to strip the ugly non_negative_derivative_ prefix from the graph legend?

@danielnelson
Copy link
Contributor

So the nice thing about tag name is that you can extract the tag without the non_negative_derivative prefix? I think the only way to work around this is to do the fields one at a time, do you expect most queries will be for all fields?

@dswarbrick
Copy link
Author

dswarbrick commented Mar 27, 2019

Maybe it's best explained with a couple of screenshots.

bind_exporter1
bind_exporter2

These were taken with data from DigitalOcean's bind_exporter for Prometheus. However, using a tag for the query type or response code in this plugin would also make such a thing trivial for InfluxDB.

Having to specify fields individually in the query would be quite painful, whereas a simple GROUP BY tag would streamline such a thing.

@danielnelson
Copy link
Contributor

You can alias a function, but it is required to have a prefix:

select non_negative_derivative(*, 10s) as "delta" from bind_counter

@dswarbrick
Copy link
Author

That still seems (IMHO) inferior to the tag approach, and selecting a wildcard like that looks fragile. For example, what if at some point a non-numeric field was added to the measurement? Getting rid of unwanted or accidentally added fields from an InfluxDB measurement is not exactly straightforward.

@danielnelson
Copy link
Contributor

I'll give you that it is nice that you can get a clean display name in this case, but there are still all the reasons to structure it using field names:

  • Better write performance and memory usage
  • Multiple value operations in InfluxQL and Kapacitor
  • Plays nicer with Telegraf buffer sizes
  • Uses the same style as other metrics. This is probably the most important argument for me, because it provides a consistent experience and allows outputs to transform the data in a way that will work in their database.

The arguments for using a tag for the name could apply to every plugin in Telegraf and the naming of non_negative_derivative affects any plugin with counter style fields. I think this just reduces to a question of should we have field keys or not. For now though, we are not going to restructure everything and I do not want to introduce schema inconsistency. If in the future some day we decide field keys were a mistake then it will be easier to make a change if the data all looks the same.

I do want to have add a pivot processor #5629, we ought to be able to have an unpivot operation that could reverse the process.

@danielllek On bind_counter_sockstat vs bind_counter,type=sockstat, I'm not sure I see it as an improvement. Is there anything this allows you to do more easily?

@dswarbrick
Copy link
Author

@danielnelson Thank you for your candor. I will bow out here, as we decided last year to go with Prometheus for our BIND metrics requirements.

Hopefully this plugin will fulfill the needs of those who have been waiting for it a while.

@danielllek
Copy link
Contributor

@danielnelson When all values are in one metric, legend on Grafana looks messy - all value keys are shown, not only these from specific type.
Of course I can select hide series with only nulls/zeros, but still much more data is fetched from datasource.
I know it's not a strong argument, so I'll change it as you advise :)
Example without hiding series with zeroes/nulls:
obraz

@danielllek
Copy link
Contributor

@danielllek Did you find a way to strip the ugly non_negative_derivative_ prefix from the graph legend?

No, unfortunately I didn't.

@danielnelson
Copy link
Contributor

I'm not following what you mean by type here, you already have where type = 'qtype', are you saying that it is not working?

The empty series is a quirk of select *, might be better to give a list of fields depending on how much you want to know about more obscure queries. The complexity of this query vs one with tags is essentially the same, InfluxDB can do this type of query efficiently.

I did check with some InfluxQL experts here and #2383 (comment) is the only option for handling the non_negative_derivative_ prefix in a query. Sorry, InfluxQL has limitations and that is why we have been working on Flux.

@danielllek
Copy link
Contributor

I'm not following what you mean by type here, you already have where type = 'qtype', are you saying that it is not working?

Please take a look at legend, columns with names 17-22 are not used with type = qtype, they will be always null.

The empty series is a quirk of select *, might be better to give a list of fields depending on how much you want to know about more obscure queries. The complexity of this query vs one with tags is essentially the same, InfluxDB can do this type of query efficiently.

Thanks for this insight.
It's more convenient for me to use select * in graph, but hiding series with null solves problem with messy legend, so it's ok for me.

I'll create change that puts all values in one measurement as you suggested from the beginning.

@danielllek danielllek mentioned this pull request Mar 29, 2019
3 tasks
@danielnelson
Copy link
Contributor

Merged in #5653

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin new plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants