Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SAR: Add telemetry data for SAR usage #50

Open
ravikesarwani opened this issue Dec 15, 2021 · 30 comments
Open

SAR: Add telemetry data for SAR usage #50

ravikesarwani opened this issue Dec 15, 2021 · 30 comments
Assignees
Labels
Team:Cloud-Monitoring Label for the Cloud Monitoring team

Comments

@ravikesarwani
Copy link

ravikesarwani commented Dec 15, 2021

Add telemetry data for Elastic Serverless Forwarder usage (available in SAR for users to use from). Initial focus is on the usage of the Lambda function itself and the inputs being used (we can further limit to Elastic cloud, if really needed). Things like:

  • Some unique identifier of the deployed Lambda from SAR
  • Version of the Lambda is use
  • What input is being used (SQS S3, Kinesis, CloudWatch)
    Maybe we should collect and send data only once per SAR execution (Lambda function by default usually runs for 15 minutes).

We should be able to graph things like:

  • How many unique usage of Lambda function in the last X days
  • Distribution of version
  • Distribution of input usage

I want to start with something small and then grow based on specific needs.

Few additional details for considerations:

  • Telemetry data collection is a secondary operation and should be collected on the best effort basis. Meaning we okay to send only once in the lifetime of a single Lambda execution (which is 15 min max). If their network security policy didn’t allow the connection then we fail, write a simple log message and that’s okay and I don’t think we need to stress about it too much. If we fail once then we don’t need to retry. This should remove concerns around extra cost to users etc.
  • We will provide in our documentation what telemetry data we are collecting and allow the user to disable it via way of setting an environment variable. Here we are focusing to collect data around the usage of the Lambda and of course will avoid any PII information. This should alleviate some concerns around security and also give users complete control to disable the telemetry data collection if they so choose.
@ravikesarwani ravikesarwani added the Team:Cloud-Monitoring Label for the Cloud Monitoring team label Apr 5, 2022
@ruflin
Copy link
Member

ruflin commented Apr 7, 2022

The challenge is how we collect and ship the data. For the lambda use case, I think we would also need to ship to the telemetry cluster directly. Pulling in @afharo here as he had ideas on a related thread.

@damien
Copy link

damien commented Apr 7, 2022

As a user of ESF, would this telemetry be collected through Elastic Cloud or are you folks looking to send telemetry directly from the Lambda function?

If it's the latter I'd suggest making this opt-out or ensuring that it fails safely. My organization, and likely others, keeps a close eye on egress traffic from our cloud infrastructure. This means any lambda making connections to new endpoints would trigger an alert.

Furthermore, a lot of deployments have whitelisted endpoints that traffic is allowed to flow to. Instrumentation connecting to unknown hosts will be blocked in many contexts, which if not accommodated for will raise errors for ESF deployments.

@afharo
Copy link
Member

afharo commented Apr 11, 2022

@ruflin thank you for the ping. To be fair, I'm surprised AWS does not provide any stats around how many "installations" there are for a public Lambda. IMO, knowing how popular an application is would serve both the developer and the users. I Amazon provided that feature, it would cover most of the insights we are after.

Given that AWS does not help here, we need to design our own solution to this.

Premises

Minimal impact on costs to the users

I agree with @damien that we should look for a solution that does not incur additional costs to the users (or the impact is minimal). That's a must in all telemetry we collect.

Minimize external requests

It's highly possible that the elastic-server-forwarder is connected to a private VPC. It implies that the users will need to explicitly allow it to reach the Internet (guide). The same would apply if they maintain an allow-list, as also mentioned in the previous comment.

We should aim to collect this information without making any additional requests to other URLs.

Minimize Lambda running times

If we added an additional HTTP request to ship any insights to our remote Telemetry Cluster, it could extend the time it takes for the lambda function to run. Ideally, whatever information we want to provide, it should be appended to the current request.

Proposal

How to "ship" the information?

Use of headers: When ingesting the data to Elasticsearch, the client could provide in the header that x-elastic-product-origin: elastic-serverless-forwarder@version. We could also leverage another header to provide the input.type.

How to consume this information?

On Cloud, we could extract those headers from the proxy logs. We follow a similar approach when analyzing the use of our Elasticsearch client libraries.

If Cloud usage is not enough, we could suggest a follow-up phase 2 to cover the usage of this forwarder for on-premises use cases: we could provide some header-based stats from Elasticsearch. I'm not entirely sure we have this info yet, but I think we can work with the Elasticsearch devs. IMO, it can help us in many ways: not only for this use case but also for understanding different environments. i.e.: We could identify many different Elastic-provided clients vs. 3rd-party ones and their type of usage: read vs. write.

I think it's important to know what the Platform Analytics team (@elastic/platform-analytics) thinks about this.

Also, cc'ing @alexfrancoeur @arisonl because I know their interest in knowing what type of data is shipped into Elasticsearch and how it is ingested.

@ravikesarwani
Copy link
Author

ravikesarwani commented Apr 11, 2022

Thanks folks for your comments and providing your suggestions on how to move this forward.
The elastic-serverless-forwarder uses the Elasticsearch Python client library. It would be nice to see if we can piggy back on the proxy logs and tag all the extra details (to identify that it's from forwarder and all the other metadata) that can be shown in dashboard for Elastic cloud.
@aspacca Can you take a look when you get a chance and coordinate with @afharo & others to define a technical path for this.

To keep the focus and scope constrained, I am good to start with Elastic cloud. I do understand that self-managed is more tricky and has implications like security, users acceptance etc. to be considered if it were to send data to a different telemetry cluster. We can tackle that later on as a separate issue.

@mindbat
Copy link

mindbat commented Apr 28, 2022

Hey, all!

I'd like to offer my suggestions, but I'm afraid I'm missing a large amount of context here 😅 Beginning with the most basic of questions: What is SAR?

@mindbat
Copy link

mindbat commented Apr 28, 2022

On Cloud, we could extract those headers from the proxy logs.

Based on my experience with seeing how the proxy logs are used for lang-client-meta telemetry, I'd recommend against this approach. Using the logs led to a lot of convoluted and brittle logic being needed on the indexing side.

I'm still unclear on what SAR is and what data we want to collect (and why), but in general if we have the ability to send data directly to the v3 endpoints, I'd go that route.

@afharo
Copy link
Member

afharo commented Apr 29, 2022

What is SAR?

@mindbat I can help with that: AWS Serverless Application Repository 🙃

It is the repository where the elastic-serverless-forwarder is installed from.

in general if we have the ability to send data directly to the v3 endpoints, I'd go that route.

I think this is the key question: while we can definitely implement the additional HTTP request, do we really have the ability to send the data? Please, refer to my previous comment that highlights the limitations we may face from the AWS Lambda POV. TL;DR, if it's able to reach our V3 endpoints, it will increase both: the output traffic and the execution time (both of them resulting in additional costs to our users).

Bearing in mind the type of user for this utility: infra admins whose salaries depend on the costs of the infra they maintain, they'll likely opt out from this telemetry as soon as they realize it's are increasing costs on their end.

Using the logs led to a lot of convoluted and brittle logic being needed on the indexing side.

I'd compare it with the level of effort to implement the logic to control the HTTP requests to make sure we don't keep an AWS Lambda hanging because we cannot reach the remote end and times out. Bear in mind that AWS Lambda automatically retries mechanism might cause the same execution to be invoked twice (more costs and potential duplication of the data this utility is indexing).

Of course, this depends on what data we want to collect because headers in the logs can only serve very basic information.

@aspacca
Copy link
Contributor

aspacca commented May 4, 2022

@mindbat any feedback after @afharo clarification? thanks :)

@ravikesarwani
Copy link
Author

ravikesarwani commented May 4, 2022

Few comments I would like to put out here:

  • Telemetry data collection is a secondary operation and should be collected on the best effort basis. Meaning we try to send only once in the lifetime of a single Lambda execution (which is 15 min max). If their network security policy didn’t allow the connection then we fail, write a simple log message and that’s okay and I don’t think we need to stress about it too much. If we fail once then we don’t need to retry. This should remove concerns around extra cost to users etc.
  • We will provide in our documentation what telemetry data we are collecting and allow the user to disable it via way of setting an environment variable. Here we are focusing to collect data around the usage of the Lambda and of course will avoid any PII information. This should alleviate some concerns around security and also give users complete control to disable the telemetry data collection if they so choose.

I was wondering if it makes sense to give some fast prototyping a go so that we can really nail down the concerns with more details rather than talking in a very high level general sense. It would be ideal if we can send the data to the telemetry server as it covers broader set of use cases and easy to dashboard etc without lots of intermediate processing (and hopefully easy to maintain and extend in the longer run).

@mindbat
Copy link

mindbat commented May 4, 2022

I can help with that:

@afharo Thanks for the link, which leads directly to my next question:

It is the repository where the elastic-serverless-forwarder

What is the elastic-serverless-forwarder?

...as you can see, I'm missing basically all of the context motivating this issue. Could someone link me to some general project docs, so I can get up to speed?

if it's able to reach our V3 endpoints, it will increase both: the output traffic and the execution time (both of them resulting in additional costs to our users).

I hear you, but the current legacy system is forwarding traffic to the v3 endpoints and indexing to ES, so it seems this concern might be premature?

we try to send only once in the lifetime of a single Lambda execution (which is 15 min max). If their network security policy didn’t allow the connection then we fail

@ravikesarwani +1, this all makes good sense to me.

We will provide in our documentation what telemetry data we are collecting and allow the user to disable it via way of setting an environment variable. Here we are focusing to collect data around the usage of the Lambda and of course will avoid any PII information.

Also +1, we should always give users the ability to both see and control what data is being sent.

fast prototyping a go so that we can really nail down the concerns with more details

I'm +1 on this as well! ❤️ Let's begin with the assumption that we'll send to the v3 endpoints, and pull back if prototyping reveals any issues.

@ravikesarwani
Copy link
Author

@mindbat Here's a 2 line description of what Elastic Serverless Forwarder is:
Elastic Serverless Forwarder is a Lambda application (available in AWS Serverless Application Repository for users to use from) that helps customers ingest logs from their AWS environment into Elastic in a complete serverless fashion. Users don’t need to setup & maintain compute resources, simplifying getting started experience and reducing data onboarding friction.

Here the documention of elastic-serverless-forwarder. If you want to just get a high level view this blog may be a good start as well.

@afharo
Copy link
Member

afharo commented May 5, 2022

if it's able to reach our V3 endpoints, it will increase both: the output traffic and the execution time (both of them resulting in additional costs to our users).

I hear you, but the current legacy system is forwarding traffic to the v3 endpoints and indexing to ES, so it seems this concern might be premature?

@mindbat we own that infra and that's the purpose of it. We are gladly paying for it. I don't think it's comparable. But it's a good measuring point: what's the average execution time for those lambdas? How much do we pay monthly for them? I think the answers to those questions will highlight the added costs we are asking our users to assume (for no direct value to them).

  • Telemetry data collection is a secondary operation and should be collected on the best effort basis.

Absolutely, ++!

Meaning we try to send only once in the lifetime of a single Lambda execution (which is 15 min max). If their network security policy didn’t allow the connection then we fail, write a simple log message and that’s okay and I don’t think we need to stress about it too much. If we fail once then we don’t need to retry. This should remove concerns around extra cost to users etc.

@ravikesarwani AFAIK, AWS Lambdas should be treated as stateless functions.

When building Lambda functions, you should assume that the environment exists only for a single invocation.
☝️ from https://docs.aws.amazon.com/lambda/latest/operatorguide/statelessness-functions.html

We could store the connection availability in a global variable so calls to the same warm functions could benefit from the previous success/failure. However, we don't know for how long that global variable will stay in the context: After the execution completes, the execution environment is frozen. To improve resource management and performance, the Lambda service retains the execution environment for a non-deterministic period of time. During this time, if another request arrives for the same function, the service may reuse the environment.
☝️ from https://docs.aws.amazon.com/lambda/latest/operatorguide/execution-environments.html

Looking at our docs, our recommended deployment strategy is to react to AWS events from multiple services (SQS, S3, Cloudwatch Logs and Kinesis Data Streams). This means that we can get an event to index the content of an S3 Object. Depending on the size of the object, we may complete this in a couple of seconds, but we risk extending the execution time to the Telemetry HTTP timeout. Bear in mind that HTTP clients typically have 2 timeouts: one to establish the connection socket and another one to transfer the data. If we add to that that our HTTP endpoints are located somewhere in the USA, there are also the cross-region latency and costs.

All that said, you are the experts in telemetry and AWS Lambdas here. I will stop challenging your decisions 😇

@aspacca
Copy link
Contributor

aspacca commented May 5, 2022

  • Telemetry data collection is a secondary operation and should be collected on the best effort basis. Meaning we try to send only once in the lifetime of a single Lambda execution (which is 15 min max). If their network security policy didn’t allow the connection then we fail

I'd like to stress the fact that the implications of this are broader than only letting the telemetry fail
assuming we send the telemetry call as last step of the lambda execution, when anything else is left to do for the lambda, if the call will take more than the remaining time of the lifetime lambda execution, the whole lambda will timeout, marking it as failed.

also errors could occur when sending the telemetry data: we can try to catch them and gracefully handle them, but some unexpected ones could leak. this again will lead to the whole lambda execution fail.

@mindbat
Copy link

mindbat commented May 9, 2022

Thanks for the pointer to the blog post and docs, @ravikesarwani!

After reading those, I agree with @afharo that sending telemetry directly from the lambda is not really appropriate. Lambdas should be stateless and perform as little work as possible, so they have a greater chance of succeeding in their small execution window. Since we (presumably) want people to take advantage of this official lambda, we should keep its scope tight, and not send direct telemetry.

That said, can we collect this telemetry some other way through the stack? Could we tack on metadata to each doc that says where it came from, so we could report that in the general snapshot telemetry as an ingestion source? Or perhaps some other method, similar to how we get stats on beats usage?

@aspacca
Copy link
Contributor

aspacca commented May 12, 2022

@mindbat

That said, can we collect this telemetry some other way through the stack? Could we tack on metadata to each doc that says where it came from, so we could report that in the general snapshot telemetry as an ingestion source?

currently the forwarder doesn't populate the agent field, we could add it and leverage the ephemeral_id to include the information we need to collect. only concern from my side is that this information has to be serialised in some way and we will need to deserialise it for collecting the telemetry metrics

cc @afharo

@aspacca
Copy link
Contributor

aspacca commented May 19, 2022

@mindbat @afharo
what's your final point of view on the matter?
so far we have identified two possible options:

  • send extra headers to the cluster, to intercept in the proxy
  • serialise a tuple of telemetry data as agent.ephemeral_id

can we narrow to those two and take a final decision, or do we still need to investigate?

thanks :)

@afharo
Copy link
Member

afharo commented May 19, 2022

That said, can we collect this telemetry some other way through the stack? Could we tack on metadata to each doc that says where it came from, so we could report that in the general snapshot telemetry as an ingestion source? Or perhaps some other method, similar to how we get stats on beats usage?

I'd like to understand this approach: is the suggestion to populate a field in the document (agent.ephemeral_id) and let telemetry (potentially Kibana's Snapshot Telemetry) report it? If that's accurate, I'd like to raise that Kibana's telemetry collection mechanisms cannot read data from user indices for obvious reasons 😇 So we wouldn't be able to forward/parse the ephemeral_id indexed in the documents.

If we want to follow a similar route, Kibana does read the metadata of the indices, to find any potential Elastic shippers in the properties _meta.beat and _meta.package.name. However, if there are multiple sources (multiple versions of ESW, other beats, ...) writing to the same index, that piece of info might be partial.

@aspacca
Copy link
Contributor

aspacca commented May 20, 2022

I'd like to understand this approach: is the suggestion to populate a field in the document (agent.ephemeral_id) and let telemetry (potentially Kibana's Snapshot Telemetry) report it?

My suggestion to use agent.ephemeral_id was an answer to what @mindbat asked:

Could we tack on metadata to each doc that says where it came from, so we could report that in the general snapshot telemetry as an ingestion source?

agent.ephemeral_id was just a candidate as metadata of each doc.
I don't know what could be the process of telemetry reporting it: @mindbat , was yours an open question (which answer from @afharo is negative) or did you have a process in mind?

the only caveat about using agent.ephemeral_id is that this field is a string, while we have a tuple of telemetry properties, so in case we need to serialise them into a string

If we want to follow a similar route, Kibana does read the metadata of the indices, to find any potential Elastic shippers in the properties _meta.beat and _meta.package.name. However, if there are multiple sources (multiple versions of ESW, other beats, ...) writing to the same index, that piece of info might be partial.

yes, I looked into using the metadata of the indices, but as you said it's not applicable: not because of multiples sources (that could also be the case), but in general because the metadata about telemetry will vary from doc to doc in the indices, and there might be concurrent ingestions where they differ.

if the agent.ephemeral_id is not viable the only option left is to send extra headers to the cluster, to intercept in the proxy, as @afharo initially proposed.

if we are all on the same line we have to align with the team owning the proxy in order to read those headers, could you help me to include someone from the team in the discussion? thanks

@tommyers-elastic
Copy link

just joining in this conversation...

Telemetry data collection is a secondary operation and should be collected on the best effort basis

@ravikesarwani - agree, but if we have low confidence it will work for many ESF users (e.g. if customers are incurring extra costs, or have to manually configure egress to allow it), then i think it really hinders our ability to draw any conclusions from the data at all. at least with the header approach, we know that every deployment of ESF is sending telemetry data somewhere, even if our current methods for analyzing it are not ideal.

On Cloud, we could extract those headers from the proxy logs. We follow a similar approach when analyzing the use of our Elasticsearch client libraries.

@afharo would you be able to point me to the code doing this so i can take a look? i'm interested to see how "convoluted and brittle" it really is :)

thanks!

@afharo
Copy link
Member

afharo commented Jun 21, 2022

@afharo would you be able to point me to the code doing this so i can take a look? i'm interested to see how "convoluted and brittle" it really is :)

@tommyers-elastic I think the best person to point you at that is @mindbat. I only heard that it exists but he's been involved in the enablement of the team that implemented it 😇

@mindbat
Copy link

mindbat commented Jun 22, 2022

@tommyers-elastic The Clients team pulls telemetry data out of the proxy logs; you can see the logic they had to build here. Just check out that initial regex!

In addition to having to write all that logic to extract data from the proxy logs, their data channel has been the source of multiple resource scaling issues, both on the development side (adding their indexer increased our CI test times by an order of magnitude) and the production side (hitting memory limits on GCP Dataflow). All due to the size of the events being sent; I'm not certain any filtering is being on the proxy logs before they're sent to the telemetry endpoints, but even their individual event docs are larger than anyone else's.

To anticipate thoughts of the telemetry service not being production ready, I'll just mention that the Security team has been sending endpoint alerts to the system since December 2020 without issue (along with other channels added over time). That's every endpoint sending every alert, and it doesn't come close to what's coming in via the proxy logs.

In short, while pulling this data from the Cloud proxy logs is certainly possible, I cannot recommend it if there is literally any other option available. To borrow a term from functional programming, being able to get this info from the proxy logs is a side-effect; the logging service does not seem to have been built with this use-case as a primary function (nor should it).

@mindbat
Copy link

mindbat commented Jun 22, 2022

I'd like to understand this approach: is the suggestion to populate a field in the document (agent.ephemeral_id) and let telemetry (potentially Kibana's Snapshot Telemetry) report it?

@afharo As @aspacca suspected, my question was an open one. Basically: If filebeat is used for ingestion, right now, how do we know? Do we have a mechanism in place for knowing that? And if so, can we take a similar approach to knowing data has been ingested by the SAR?

@aspacca
Copy link
Contributor

aspacca commented Jun 24, 2022

If filebeat is used for ingestion, right now, how do we know? Do we have a mechanism in place for knowing that?

@afharo do you have an answer for that? :)

@afharo
Copy link
Member

afharo commented Jun 24, 2022

If filebeat is used for ingestion, right now, how do we know? Do we have a mechanism in place for knowing that?

Kibana has a mechanism to report how data is ingested to ES (elastic/kibana#64935). Although, It's a bit of a guessing game because it's a mix of:

  1. reading known mappings._meta fields the internal Elastic shippers populate (we miss if there is more than 1 shipper ingesting into the same index, or the index's mappings are not altered by the shipper)
  2. matching known index patterns (which doesn't apply here because users can set their Elastic Serverless Forwarders to index to any index).

IMO, if we want to properly report what ingests data into ES, the best way to keep track of it is by reading the headers of the indexing requests to identify the agent submitting the requests. IMO, that can be done either on...

  • ... ES, where it accumulates stats of the different clients that submit requests and serves those stats under the GET /_cluster/stats API (Kibana consumes that information when submitting the daily Telemetry Snapshot). This does not exist today, and we'd need to involve the Elasticsearch team to implement such aggregation.
  • ... the proxy logs. Currently, they keep selected headers like User-Agent or X-Elastic-Client-Meta. We could also index the existing X-Elastic-Product-Origin and a new one we agree upon and leverage them to provide the information we are after in this issue.

FWIW, I think the ESF's use case will produce way less volume, compared to the Client team's use case: the ESF indexer would filter the documents to retrieve only the requests performed by the ESF client. And the regexp does not need to be as complex (they are trying to parse the header X-Elastic-Client-Meta they populate in this form "es=8.2.0p,js=16.14.2,t=8.0.2,hc=16.14.2"). We might not even need a regexp.


Things like:

* Some unique identifier of the deployed Lambda from SAR
* Version of the Lambda is use
* What input is being used (SQS S3, Kinesis, CloudWatch)
  Maybe we should collect and send data only once per SAR execution (Lambda function by default usually runs for 15 minutes).

We should be able to graph things like:

* How many unique usage of Lambda function in the last X days
* Distribution of version
* Distribution of input usage

Looking back at the description, I still think a headers-based analysis is the most appropriate form of analysis. We can agree on a header (or a set of headers) to provide the unique identifier, version of the product and input (SQS, S3, Kinesis, CloudWatch). The existing X-Elastic-Product-Origin could be elastic-serverless-forwarder to help us prefiltering the data. If we go with X-Elastic-Product-Origin + one custom header holding the identifier, version and input, we could leverage the Grok Ingest Processor to break that info down into individual fields for later analysis.

IMO, the good thing about headers is that we can get more info like the request and response length so we can measure the average ingested volume per deployment and compare it against the number of requests and track potential improvements like using filter_path in the requests to minimize ES's responses. It also includes the response times, so we can optimize the queries over time (maybe there's an exponential correlation between the request length and the response time 🤷 ?). We also get the status codes, which is also important to understand the success rate of the forwarder

I think it's important to highlight that, for PII reasons, the proxy logs don't index anything from the request/response's body.

My suggestion for the next steps:

I would suggest taking a look at the proxy logs and creating some visualizations with the data available. Even when we don't have specific data for ESF, I think it's a good exercise to confirm if this would be a good path forward or if we'd rather look for alternatives.

If it seems like it may work, we can move ahead with adding the additional headers to the ESF and ask the Cloud team to index the content from those headers. Finally, if the retention policy in the proxy logs is not enough for the SAR analysis, we can apply the extra step of forwarding the SAR-related logs to a telemetry cluster with a longer retention policy for those entries (with, possibly, some pre-filtering and post-processing to make the data even easier to consume).

However, I don't want to be the person that dictates the way to do this... in the end, I'm the last person that should make the decision since I don't belong to the ESF nor the Analytics teams 😅

What do you think?

@tommyers-elastic
Copy link

Thanks so much for the input @mindbat & @afharo.

There's lots of information here, but here are my main takeaways.

being able to get this info from the proxy logs is a side-effect

@mindbat I do agree - however one thing I feel differently about is that I think utilizing request headers to capture this information is totally appropriate. Perhaps we just need to work out a more appropriate way (long term) to handle the header data once it gets to Elasticsearch.

@afharo i agree with the comments above. the big missing piece of this is non-cloud deployments, but as long as we keep that in mind, i think it's much better to make a start getting data from cloud than get stuck with finding a generic solution right now.

thanks again for all your input on this.

@zmoog
Copy link
Contributor

zmoog commented May 5, 2023

Since we're nearing the end of this task, I want to update this issue on a couple of topics:

  1. Calling the Telemetry API vs. passing custom HTTP headers
  2. Telemetry API event rate

Calling the telemetry API vs. passing custom HTTP headers

We decided to opt for the Telemetry API service. The main reason is this service offers more flexibility and the opportunity to iterate on both sending and indexing sides independently and quickly.

Telemetry API event rate

After checking with @ddillinger, I learned that the Telemetry API was not designed to sustain a telemetry event rate like ESF's one event for every lambda function execution. At scale, this may generate up to hundreds of thousands of events per minute.

According to Dan, The Telemetry API expected telemetry event rate for a single application is closer to "once per cluster per 24h".

@afharo, please chime in if you want to add more about the expected telemetry event rate for a single application like ESF!

@tommyers-elastic and @aspacca, please let me know what you think.

Next steps

If the telemetry event rate is confirmed, we need to reconfigure the ESF sending strategy and adjust the data model slightly. It's a quick change.

But first, I want to hear from all of you.

@afharo
Copy link
Member

afharo commented May 18, 2023

According to Dan, The Telemetry API expected telemetry event rate for a single application is closer to "once per cluster per 24h".

TBH, these are news to me. AFAIK, Kibana's EBT is hitting this API way more often. Are we talking about different APIs?

@ddillinger
Copy link

ddillinger commented May 18, 2023

That's the snapshot event rate.

What we were discussing at that time was something more like "what if a telemetry event is sent for everything that ever happens on every cluster ever" -- resulting in estimated transaction rates over 100k/second. Which is well into "rearchitect the whole thing for 2 or 3 more orders of magnitude of scale" territory, and is well beyond the design considerations of what was discussed and built for what we have today.

The actual event rate is definitely higher than that, but it is not like... x100 higher :)

@zmoog
Copy link
Contributor

zmoog commented Jun 5, 2023

What's the recommended event rate for a single deployment hitting the endpoint https://telemetry.elastic.co/v3/send/esf ?

Currently, we have 1.8K ESF deployments on SAR; similar applications from competitors can reach 4-8K deployments.

@ddillinger
Copy link

I sampled a few of our more well-known channels for yesterday's data to get some rough transactions-per-second figures to work with.

channel tps report
kibana-browser 197
kibana-snapshot 16
kibana-server 63
lang-client-meta 179
security-lists-v2 17
alerts-endpoint 9

This puts us at 481 so across all channels it's safe to assume 500-600 events per second are arriving typically. So for safety assume peaks (they do not arrive uniformly; telemetry tends to "follow the sun" based on where in the world is active) in the 1000/s range.

This is for all incoming telemetry, not per-cluster.

Now the real question I imagine everyone wants to ask isn't this, but rather, "how much CAN we have?"

The answer is: we haven't load tested at 10x or 100x scale from here. We don't have especially hard numbers about that.

Now there's another distinction that is worth making here too, and that is that there's a wide difference between how fast we can collect events, and how fast we can ingest events into whichever back-end datastore is desired for any particular purpose.

Even without load testing I feel comfortable saying that we can collect events quite a bit faster than this. I don't know about 100x without trying it first, but 10x feels within reach. It's more a question of how much we want to spend on google cloud functions, google dataflow, and google cloud storage, than about if it's possible or not. I don't have immediate figures on our current spend (maybe rnd-hosting has some dashboards for it?).

Ingestion/ETL to make it visible and actionable is the bottleneck. We would have to make some clear cut decisions about it, to handle order(s) of magnitude more data. The main reason is that right now, when we asked people for requirements, the answer was essentially "we want everything, and we want it forever" and so that's the system that was built. It's not built for speed, but for user flexibility, with almost no technically imposed limitations. For example, we deduplicate by document id even through an alias (multiple indices being the only way to ALSO reasonably achieve the "forever" part, which would otherwise blow max primary shard sizes out of the water). This is because duplicates make humans sad a lot about having to deal with them, even though deduplicating through aliases is a fantastically performance-intensive thing to do, since it involves pre-fetching ids queries for every single event every time so we can map concrete _indexes into the bulk ops.

We're still in POC on bigquery ingestion and so I have almost nothing to say about performance on that backend yet because we just don't have the operational experience without doing more science.

To be clear: this is not to say "no" to anything in particular. It does seem possible or even likely that we will need to introduce some constraints as trade-offs for performance, or else set some realistic expectations about it, if the business need here ends up being magnitudes higher. Some of these constraints may be simple ones, like: this firehose is too big to deduplicate, so introduce the ability to turn that feature off, index it with unique ids (and/or accept some duplicate tolerance), and back it with a data-stream. That streamlines away a lot of those performance constraints.

This is all probably way, way more detail than anyone was expecting! 🤣 Anyway, point is, there is tolerance for collection rate, and there is some tolerance for ingestion but we may need to make a few changes there, either to give up some expensive nice-to-haves in the face of the reality of the incoming data, or accepting a spend increase of some kind in order to handle it all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Cloud-Monitoring Label for the Cloud Monitoring team
Projects
None yet
Development

No branches or pull requests

9 participants