What is the meaning of KEY FEATURE[ Anonymization ] in README? #541

jatinmehrotra · 2023-07-05T06:04:28Z

Checklist

I've searched for similar issues and couldn't find anything matching
I've included steps to reproduce the behavior

Affected Components

K8sGPT (CLI)
K8sGPT Operator

K8sGPT Version

No response

Kubernetes Version

No response

Host OS and its Version

No response

Steps to reproduce

Searched the entire REPO for the meaning of events where anonymization would not apply.

Expected behaviour

I am trying to understand this particular line which is mentioned in README -> Key Features -> Anonymization

Anonymization does not currently apply to events.

What kind of events are we talking about? and what are the details will be shared with AI backend in case of such events?

Actual behaviour

No response

Additional Information

No response

arbreezy · 2023-07-11T16:47:11Z

hey @jatinmehrotra,
K8sGPT has analysers for certain K8s resources (Sts, Deployments,Pods, etc..)

In a few analysers like Pod, we feed to the AI backend the event messages which are not known beforehand thus we are not masking them for the time being.

Further research has to be made to understand the patterns and be able to mask the sensitive parts of an event like pod name, namespace.

The majority of the analysers are producing customer errors that we have created and we are able to mask.

By masking I mean, swapping sensitive strings( e.g namespace and pod names ) of the error messages with random hashes which then is shared with the backend AI of your choice then in the analysis report we swap them back again and present the initial pod and namespace string to the user; hope that makes more sense.

jatinmehrotra · 2023-07-12T06:52:37Z

Thank you @arbreezy for the explanation. It is really helpful and definitely makes sense 💯

Based on the above explanation I want to confirm a few things as I am planning to introduce k8gpt in one of the projects.

In a few analysers like Pod, we feed to the AI backend the event messages which are not known beforehand thus we are not masking them for the time being.

can you provide me with the list of analysers in which masking is not taking place and in which it is taking place?
Can you provide me with the list of parameters which are not being masked and sent to backend AI?
by when the k8gpt time is planning to implement masking for unknown events like is it in the pipeline? ( I know this can be difficult to answer )

By masking I mean, swapping sensitive strings( e.g namespace and pod names ) of the error messages

What are the details which are currently being masked and sent to backend AI, can you provide me with the list of strings being masked and sent to backend AI?

Unrelated to the above explanation

In the docs it was mentioned that k8gpt is being used in production for customer projects, if that's the case and like it is mentioned that unknown events are not being masked, does it not pose a security risk to the customer sensitive information? In other words is it safe to assume that information which is not being masked will not pose any security risk to customer sensitive information? Please clarify on this.

AlexsJones · 2023-07-12T09:48:18Z

Hi, thanks for your interest and thanks for @arbreezy for answering some of the questions.

I am one the founder of the project, I am thrilled to see involvement and discussion here.
I also wanted to extend some of the answers and hopefully give some satisfactory responses!

Thank you @arbreezy for the explanation. It is really helpful and definitely makes sense 💯

Based on the above explanation I want to confirm a few things as I am planning to introduce k8gpt in one of the projects.

In a few analysers like Pod, we feed to the AI backend the event messages which are not known beforehand thus we are not masking them for the time being.

can you provide me with the list of analysers in which masking is not taking place and in which it is taking place?

Masking

Statefulset
Service
PodDisruptionBudget
Node
NetworkPolicy
Ingress
HPA
Deployment
Cronjob

We typically wil not mask the below because we don't send any identifying information, just that one of these things has been detected to be incorrect

No Masking

RepicaSet
PersistentVolumeClaim
Pod
Events

Can you provide me with the list of parameters which are not being masked and sent to backend AI?

Fields:

Describe
ObjectStatus
Replicas
ContainerStatus
Event Message
ReplicaStatus
Count (Pod)

by when the k8gpt time is planning to implement masking for unknown events like is it in the pipeline? ( I know this can be difficult to answer )

It's for V2, which will be later this year Q4

By masking I mean, swapping sensitive strings( e.g namespace and pod names ) of the error messages

What are the details which are currently being masked and sent to backend AI, can you provide me with the list of strings being masked and sent to backend AI?

Please see https://docs.k8sgpt.ai/reference/guidelines/privacy/

I don't have an exact list of strings being sent, maybe I misunderstand

Unrelated to the above explanation

In the docs it was mentioned that k8gpt is being used in production for customer projects, if that's the case and like it is mentioned that unknown events are not being masked, does it not pose a security risk to the customer sensitive information? In other words is it safe to assume that information which is not being masked will not pose any security risk to customer sensitive information? Please clarify on this.

The bottom line is that in critical production environments (like one of the banks I used to work at) I would recommend an entirely different backend -> use a local model. Then you can rest easily that its inside your DMZ and nothing is leaking.
If there is even a hint of uncertainty sending up data that might be business critical, I would not advising doing so to a public LLM, it's very nebulous how the corpus of data will be grown from your questions by some of them.

If would like an example of how to use LocalAI ( one of our providers ) that lets you use your own models, we would be happy to share docs, blogs, posts.

jatinmehrotra · 2023-07-13T00:38:12Z

@AlexsJones Thank you so much for the explanation and your conclusion to use local AI.

We typically will not mask the below because we don't send any identifying information, just that one of these things has been detected to be incorrect

Fields:
Describe
ObjectStatus
Replicas
ContainerStatus
Event Message
ReplicaStatus
Count (Pod)

If my understanding is correct out of the unmasked field Event Message field is the one ( one of these things has been detected to be incorrect) which contains identifying information isn't it? If I am correct is there any example for the Event Message so that I can refer and gauge to what extent identifying information is being sent to backend AI?

AlexsJones · 2023-07-13T08:27:28Z

I always err on the side of caution - so yes, it is quite possible the payload of the event might have something like "super-secret-project-pod-X crashed" which we don't currently redact.

As an example - if you use k8sgpt integration enable trivy you'll see events unredacted like this ->

 Message:             Created pod: scan-vulnerabilityreport-9c4c6f747-4g879

jatinmehrotra · 2023-07-14T01:46:47Z

Thank you so much @AlexsJones for your explanation. Really helpful.

I would like to send a PR to update the README for Anonymization based on our discussion as I am sure there might be others who might be wondering the same. By the end of the day I will push a PR

AlexsJones · 2023-07-14T09:25:37Z

Thank you so much @AlexsJones for your explanation. Really helpful.

I would like to send a PR to update the README for Anonymization based on our discussion as I am sure there might be others who might be wondering the same. By the end of the day I will push a PR

Sounds great, I will close this issue for now but please feel free to reference/re-open if needed

github-project-automation bot added this to Backlog Jul 5, 2023

github-project-automation bot moved this to Todo in Backlog Jul 5, 2023

AlexsJones closed this as completed Jul 14, 2023

github-project-automation bot moved this from Todo to Done in Backlog Jul 14, 2023

jatinmehrotra mentioned this issue Jul 18, 2023

docs: fix readme for anonymization #559

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is the meaning of KEY FEATURE[ Anonymization ] in README? #541

What is the meaning of KEY FEATURE[ Anonymization ] in README? #541

jatinmehrotra commented Jul 5, 2023

arbreezy commented Jul 11, 2023

jatinmehrotra commented Jul 12, 2023

AlexsJones commented Jul 12, 2023 •

edited

Loading

jatinmehrotra commented Jul 13, 2023

AlexsJones commented Jul 13, 2023

jatinmehrotra commented Jul 14, 2023

AlexsJones commented Jul 14, 2023

What is the meaning of KEY FEATURE[ Anonymization ] in README? #541

What is the meaning of KEY FEATURE[ Anonymization ] in README? #541

Comments

jatinmehrotra commented Jul 5, 2023

Checklist

Affected Components

K8sGPT Version

Kubernetes Version

Host OS and its Version

Steps to reproduce

Expected behaviour

Actual behaviour

Additional Information

arbreezy commented Jul 11, 2023

jatinmehrotra commented Jul 12, 2023

AlexsJones commented Jul 12, 2023 • edited Loading

jatinmehrotra commented Jul 13, 2023

AlexsJones commented Jul 13, 2023

jatinmehrotra commented Jul 14, 2023

AlexsJones commented Jul 14, 2023

AlexsJones commented Jul 12, 2023 •

edited

Loading