-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SAR: Add telemetry data for SAR usage #50
Comments
The challenge is how we collect and ship the data. For the lambda use case, I think we would also need to ship to the telemetry cluster directly. Pulling in @afharo here as he had ideas on a related thread. |
As a user of ESF, would this telemetry be collected through Elastic Cloud or are you folks looking to send telemetry directly from the Lambda function? If it's the latter I'd suggest making this opt-out or ensuring that it fails safely. My organization, and likely others, keeps a close eye on egress traffic from our cloud infrastructure. This means any lambda making connections to new endpoints would trigger an alert. Furthermore, a lot of deployments have whitelisted endpoints that traffic is allowed to flow to. Instrumentation connecting to unknown hosts will be blocked in many contexts, which if not accommodated for will raise errors for ESF deployments. |
@ruflin thank you for the ping. To be fair, I'm surprised AWS does not provide any stats around how many "installations" there are for a public Lambda. IMO, knowing how popular an application is would serve both the developer and the users. I Amazon provided that feature, it would cover most of the insights we are after. Given that AWS does not help here, we need to design our own solution to this. PremisesMinimal impact on costs to the usersI agree with @damien that we should look for a solution that does not incur additional costs to the users (or the impact is minimal). That's a must in all telemetry we collect. Minimize external requestsIt's highly possible that the We should aim to collect this information without making any additional requests to other URLs. Minimize Lambda running timesIf we added an additional HTTP request to ship any insights to our remote Telemetry Cluster, it could extend the time it takes for the lambda function to run. Ideally, whatever information we want to provide, it should be appended to the current request. ProposalHow to "ship" the information?Use of headers: When ingesting the data to Elasticsearch, the client could provide in the header that How to consume this information?On Cloud, we could extract those headers from the proxy logs. We follow a similar approach when analyzing the use of our Elasticsearch client libraries. If Cloud usage is not enough, we could suggest a follow-up phase 2 to cover the usage of this forwarder for on-premises use cases: we could provide some header-based stats from Elasticsearch. I'm not entirely sure we have this info yet, but I think we can work with the Elasticsearch devs. IMO, it can help us in many ways: not only for this use case but also for understanding different environments. i.e.: We could identify many different Elastic-provided clients vs. 3rd-party ones and their type of usage: read vs. write. I think it's important to know what the Platform Analytics team (@elastic/platform-analytics) thinks about this. Also, cc'ing @alexfrancoeur @arisonl because I know their interest in knowing what type of data is shipped into Elasticsearch and how it is ingested. |
Thanks folks for your comments and providing your suggestions on how to move this forward. To keep the focus and scope constrained, I am good to start with Elastic cloud. I do understand that self-managed is more tricky and has implications like security, users acceptance etc. to be considered if it were to send data to a different telemetry cluster. We can tackle that later on as a separate issue. |
Hey, all! I'd like to offer my suggestions, but I'm afraid I'm missing a large amount of context here 😅 Beginning with the most basic of questions: What is SAR? |
Based on my experience with seeing how the proxy logs are used for lang-client-meta telemetry, I'd recommend against this approach. Using the logs led to a lot of convoluted and brittle logic being needed on the indexing side. I'm still unclear on what SAR is and what data we want to collect (and why), but in general if we have the ability to send data directly to the v3 endpoints, I'd go that route. |
@mindbat I can help with that: AWS Serverless Application Repository 🙃 It is the repository where the
I think this is the key question: while we can definitely implement the additional HTTP request, do we really have the ability to send the data? Please, refer to my previous comment that highlights the limitations we may face from the AWS Lambda POV. TL;DR, if it's able to reach our V3 endpoints, it will increase both: the output traffic and the execution time (both of them resulting in additional costs to our users). Bearing in mind the type of user for this utility: infra admins whose salaries depend on the costs of the infra they maintain, they'll likely opt out from this telemetry as soon as they realize it's are increasing costs on their end.
I'd compare it with the level of effort to implement the logic to control the HTTP requests to make sure we don't keep an AWS Lambda hanging because we cannot reach the remote end and times out. Bear in mind that AWS Lambda automatically retries mechanism might cause the same execution to be invoked twice (more costs and potential duplication of the data this utility is indexing). Of course, this depends on what data we want to collect because headers in the logs can only serve very basic information. |
Few comments I would like to put out here:
I was wondering if it makes sense to give some fast prototyping a go so that we can really nail down the concerns with more details rather than talking in a very high level general sense. It would be ideal if we can send the data to the telemetry server as it covers broader set of use cases and easy to dashboard etc without lots of intermediate processing (and hopefully easy to maintain and extend in the longer run). |
@afharo Thanks for the link, which leads directly to my next question:
What is the ...as you can see, I'm missing basically all of the context motivating this issue. Could someone link me to some general project docs, so I can get up to speed?
I hear you, but the current legacy system is forwarding traffic to the v3 endpoints and indexing to ES, so it seems this concern might be premature?
@ravikesarwani +1, this all makes good sense to me.
Also +1, we should always give users the ability to both see and control what data is being sent.
I'm +1 on this as well! ❤️ Let's begin with the assumption that we'll send to the v3 endpoints, and pull back if prototyping reveals any issues. |
@mindbat Here's a 2 line description of what Elastic Serverless Forwarder is: Here the documention of elastic-serverless-forwarder. If you want to just get a high level view this blog may be a good start as well. |
@mindbat we own that infra and that's the purpose of it. We are gladly paying for it. I don't think it's comparable. But it's a good measuring point: what's the average execution time for those lambdas? How much do we pay monthly for them? I think the answers to those questions will highlight the added costs we are asking our users to assume (for no direct value to them).
Absolutely, ++!
@ravikesarwani AFAIK, AWS Lambdas should be treated as stateless functions. When building Lambda functions, you should assume that the environment exists only for a single invocation. We could store the connection availability in a global variable so calls to the same warm functions could benefit from the previous success/failure. However, we don't know for how long that global variable will stay in the context: After the execution completes, the execution environment is frozen. To improve resource management and performance, the Lambda service retains the execution environment for a non-deterministic period of time. During this time, if another request arrives for the same function, the service may reuse the environment. Looking at our docs, our recommended deployment strategy is to react to AWS events from multiple services (SQS, S3, Cloudwatch Logs and Kinesis Data Streams). This means that we can get an event to index the content of an S3 Object. Depending on the size of the object, we may complete this in a couple of seconds, but we risk extending the execution time to the Telemetry HTTP timeout. Bear in mind that HTTP clients typically have 2 timeouts: one to establish the connection socket and another one to transfer the data. If we add to that that our HTTP endpoints are located somewhere in the USA, there are also the cross-region latency and costs. All that said, you are the experts in telemetry and AWS Lambdas here. I will stop challenging your decisions 😇 |
I'd like to stress the fact that the implications of this are broader than only letting the telemetry fail also errors could occur when sending the telemetry data: we can try to catch them and gracefully handle them, but some unexpected ones could leak. this again will lead to the whole lambda execution fail. |
Thanks for the pointer to the blog post and docs, @ravikesarwani! After reading those, I agree with @afharo that sending telemetry directly from the lambda is not really appropriate. Lambdas should be stateless and perform as little work as possible, so they have a greater chance of succeeding in their small execution window. Since we (presumably) want people to take advantage of this official lambda, we should keep its scope tight, and not send direct telemetry. That said, can we collect this telemetry some other way through the stack? Could we tack on metadata to each doc that says where it came from, so we could report that in the general snapshot telemetry as an ingestion source? Or perhaps some other method, similar to how we get stats on beats usage? |
currently the forwarder doesn't populate the agent field, we could add it and leverage the cc @afharo |
@mindbat @afharo
can we narrow to those two and take a final decision, or do we still need to investigate? thanks :) |
I'd like to understand this approach: is the suggestion to populate a field in the document ( If we want to follow a similar route, Kibana does read the metadata of the indices, to find any potential Elastic shippers in the properties |
My suggestion to use
the only caveat about using
yes, I looked into using the metadata of the indices, but as you said it's not applicable: not because of multiples sources (that could also be the case), but in general because the metadata about telemetry will vary from doc to doc in the indices, and there might be concurrent ingestions where they differ. if the if we are all on the same line we have to align with the team owning the proxy in order to read those headers, could you help me to include someone from the team in the discussion? thanks |
just joining in this conversation...
@ravikesarwani - agree, but if we have low confidence it will work for many ESF users (e.g. if customers are incurring extra costs, or have to manually configure egress to allow it), then i think it really hinders our ability to draw any conclusions from the data at all. at least with the header approach, we know that every deployment of ESF is sending telemetry data somewhere, even if our current methods for analyzing it are not ideal.
@afharo would you be able to point me to the code doing this so i can take a look? i'm interested to see how "convoluted and brittle" it really is :) thanks! |
@tommyers-elastic I think the best person to point you at that is @mindbat. I only heard that it exists but he's been involved in the enablement of the team that implemented it 😇 |
@tommyers-elastic The Clients team pulls telemetry data out of the proxy logs; you can see the logic they had to build here. Just check out that initial regex! In addition to having to write all that logic to extract data from the proxy logs, their data channel has been the source of multiple resource scaling issues, both on the development side (adding their indexer increased our CI test times by an order of magnitude) and the production side (hitting memory limits on GCP Dataflow). All due to the size of the events being sent; I'm not certain any filtering is being on the proxy logs before they're sent to the telemetry endpoints, but even their individual event docs are larger than anyone else's. To anticipate thoughts of the telemetry service not being production ready, I'll just mention that the Security team has been sending endpoint alerts to the system since December 2020 without issue (along with other channels added over time). That's every endpoint sending every alert, and it doesn't come close to what's coming in via the proxy logs. In short, while pulling this data from the Cloud proxy logs is certainly possible, I cannot recommend it if there is literally any other option available. To borrow a term from functional programming, being able to get this info from the proxy logs is a side-effect; the logging service does not seem to have been built with this use-case as a primary function (nor should it). |
@afharo As @aspacca suspected, my question was an open one. Basically: If filebeat is used for ingestion, right now, how do we know? Do we have a mechanism in place for knowing that? And if so, can we take a similar approach to knowing data has been ingested by the SAR? |
@afharo do you have an answer for that? :) |
Kibana has a mechanism to report how data is ingested to ES (elastic/kibana#64935). Although, It's a bit of a guessing game because it's a mix of:
IMO, if we want to properly report what ingests data into ES, the best way to keep track of it is by reading the headers of the indexing requests to identify the agent submitting the requests. IMO, that can be done either on...
FWIW, I think the ESF's use case will produce way less volume, compared to the Client team's use case: the ESF indexer would filter the documents to retrieve only the requests performed by the ESF client. And the regexp does not need to be as complex (they are trying to parse the header
Looking back at the description, I still think a headers-based analysis is the most appropriate form of analysis. We can agree on a header (or a set of headers) to provide the unique identifier, version of the product and input (SQS, S3, Kinesis, CloudWatch). The existing IMO, the good thing about headers is that we can get more info like the request and response length so we can measure the average ingested volume per deployment and compare it against the number of requests and track potential improvements like using I think it's important to highlight that, for PII reasons, the proxy logs don't index anything from the request/response's body. My suggestion for the next steps: I would suggest taking a look at the proxy logs and creating some visualizations with the data available. Even when we don't have specific data for ESF, I think it's a good exercise to confirm if this would be a good path forward or if we'd rather look for alternatives. If it seems like it may work, we can move ahead with adding the additional headers to the ESF and ask the Cloud team to index the content from those headers. Finally, if the retention policy in the proxy logs is not enough for the SAR analysis, we can apply the extra step of forwarding the SAR-related logs to a telemetry cluster with a longer retention policy for those entries (with, possibly, some pre-filtering and post-processing to make the data even easier to consume). However, I don't want to be the person that dictates the way to do this... in the end, I'm the last person that should make the decision since I don't belong to the ESF nor the Analytics teams 😅 What do you think? |
Thanks so much for the input @mindbat & @afharo. There's lots of information here, but here are my main takeaways.
@mindbat I do agree - however one thing I feel differently about is that I think utilizing request headers to capture this information is totally appropriate. Perhaps we just need to work out a more appropriate way (long term) to handle the header data once it gets to Elasticsearch. @afharo i agree with the comments above. the big missing piece of this is non-cloud deployments, but as long as we keep that in mind, i think it's much better to make a start getting data from cloud than get stuck with finding a generic solution right now. thanks again for all your input on this. |
Since we're nearing the end of this task, I want to update this issue on a couple of topics:
Calling the telemetry API vs. passing custom HTTP headersWe decided to opt for the Telemetry API service. The main reason is this service offers more flexibility and the opportunity to iterate on both sending and indexing sides independently and quickly. Telemetry API event rateAfter checking with @ddillinger, I learned that the Telemetry API was not designed to sustain a telemetry event rate like ESF's one event for every lambda function execution. At scale, this may generate up to hundreds of thousands of events per minute. According to Dan, The Telemetry API expected telemetry event rate for a single application is closer to "once per cluster per 24h". @afharo, please chime in if you want to add more about the expected telemetry event rate for a single application like ESF! @tommyers-elastic and @aspacca, please let me know what you think. Next stepsIf the telemetry event rate is confirmed, we need to reconfigure the ESF sending strategy and adjust the data model slightly. It's a quick change. But first, I want to hear from all of you. |
TBH, these are news to me. AFAIK, Kibana's EBT is hitting this API way more often. Are we talking about different APIs? |
That's the snapshot event rate. What we were discussing at that time was something more like "what if a telemetry event is sent for everything that ever happens on every cluster ever" -- resulting in estimated transaction rates over 100k/second. Which is well into "rearchitect the whole thing for 2 or 3 more orders of magnitude of scale" territory, and is well beyond the design considerations of what was discussed and built for what we have today. The actual event rate is definitely higher than that, but it is not like... x100 higher :) |
What's the recommended event rate for a single deployment hitting the endpoint https://telemetry.elastic.co/v3/send/esf ? Currently, we have 1.8K ESF deployments on SAR; similar applications from competitors can reach 4-8K deployments. |
I sampled a few of our more well-known channels for yesterday's data to get some rough transactions-per-second figures to work with.
This puts us at 481 so across all channels it's safe to assume 500-600 events per second are arriving typically. So for safety assume peaks (they do not arrive uniformly; telemetry tends to "follow the sun" based on where in the world is active) in the 1000/s range. This is for all incoming telemetry, not per-cluster. Now the real question I imagine everyone wants to ask isn't this, but rather, "how much CAN we have?" The answer is: we haven't load tested at 10x or 100x scale from here. We don't have especially hard numbers about that. Now there's another distinction that is worth making here too, and that is that there's a wide difference between how fast we can collect events, and how fast we can ingest events into whichever back-end datastore is desired for any particular purpose. Even without load testing I feel comfortable saying that we can collect events quite a bit faster than this. I don't know about 100x without trying it first, but 10x feels within reach. It's more a question of how much we want to spend on google cloud functions, google dataflow, and google cloud storage, than about if it's possible or not. I don't have immediate figures on our current spend (maybe rnd-hosting has some dashboards for it?). Ingestion/ETL to make it visible and actionable is the bottleneck. We would have to make some clear cut decisions about it, to handle order(s) of magnitude more data. The main reason is that right now, when we asked people for requirements, the answer was essentially "we want everything, and we want it forever" and so that's the system that was built. It's not built for speed, but for user flexibility, with almost no technically imposed limitations. For example, we deduplicate by document id even through an alias (multiple indices being the only way to ALSO reasonably achieve the "forever" part, which would otherwise blow max primary shard sizes out of the water). This is because duplicates make humans sad a lot about having to deal with them, even though deduplicating through aliases is a fantastically performance-intensive thing to do, since it involves pre-fetching ids queries for every single event every time so we can map concrete We're still in POC on bigquery ingestion and so I have almost nothing to say about performance on that backend yet because we just don't have the operational experience without doing more science. To be clear: this is not to say "no" to anything in particular. It does seem possible or even likely that we will need to introduce some constraints as trade-offs for performance, or else set some realistic expectations about it, if the business need here ends up being magnitudes higher. Some of these constraints may be simple ones, like: this firehose is too big to deduplicate, so introduce the ability to turn that feature off, index it with unique ids (and/or accept some duplicate tolerance), and back it with a data-stream. That streamlines away a lot of those performance constraints. This is all probably way, way more detail than anyone was expecting! 🤣 Anyway, point is, there is tolerance for collection rate, and there is some tolerance for ingestion but we may need to make a few changes there, either to give up some expensive nice-to-haves in the face of the reality of the incoming data, or accepting a spend increase of some kind in order to handle it all. |
Add telemetry data for Elastic Serverless Forwarder usage (available in SAR for users to use from). Initial focus is on the usage of the Lambda function itself and the inputs being used (we can further limit to Elastic cloud, if really needed). Things like:
Maybe we should collect and send data only once per SAR execution (Lambda function by default usually runs for 15 minutes).
We should be able to graph things like:
I want to start with something small and then grow based on specific needs.
Few additional details for considerations:
The text was updated successfully, but these errors were encountered: