-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how to interpolate smooth distributions? #28
Comments
Hey there! Is that example running for you? It spits an error at me because the inputs to the pandas DataFrame are not equal in length. I know exactly the issue you're having. One way is to make each possibility appear at least once in the dataframe you provide to |
So in my case I would have to make sure that all the 24 hours are available before I can make a prediction. |
However this still happens: import sorobn as hh
def example_dag():
# simple equivalent notion
bn = hh.BayesNet(
('Host', 'Alarm'),
('Alarm', 'True Positive'),
('Alarm', 'False Positive'),
seed=42,
)
bn = hh.BayesNet((["Host"],"Alarm"),
("Alarm",["True Positive","False Positive"]),seed=42)
def example_learning():
import pandas as pd
samples = pd.DataFrame({"Host":["carl","ermano","jon","albert"],
"Detection":["PsExec","PsExec","PsExec","Quarantine"],
"Outcome":["TP","FP","FP","TP"],
"HourOfDay":[5,10,13,6]})
print(samples)
structure = hh.structure.chow_liu(samples)
bn = hh.BayesNet(*structure)
bn = bn.fit(samples)
bn.prepare()
'''
dot = bn.graphviz()
path = dot.render('asia', directory='figures', format='svg', cleanup=True)
'''
print("Probability of detection")
print(bn.P["Detection"])
print("Probability of outcome")
print(bn.P["Outcome"])
print("Probability of FP at 5 am")
event = {"Host":"carl","Detection":"PsExec","HourOfDay":5}
bn.predict_proba(event)
print("Probability of FP at 6 am")
# this will fail because is unseen: how do we generalize?
event = {"Host":"carl","Detection":"PsExec","HourOfDay":6}
bn.predict_proba(event)
example_learning() The 6 am is now available but if I don't provide an exact similar example from the dataset it complains. |
So just to be clear the first event works because is identical of what is in the dataset but the second one fails as it doesn't seem to interpolate the probability... # this is fine but is exactly the same event ....
event = {"Host":"albert","Detection":"Quarantine","HourOfDay":6}
bn.predict_proba(event)
# this still fails...
event = {"Host":"carl","Detection":"PsExec","HourOfDay":6}
bn.predict_proba(event) |
What I was saying is that you can ensure every case is seen by the BN by calculating a Cartesian product between all values. This way, each occurrence appears at least once. This works: import itertools
import sorobn as hh
import pandas as pd
samples = pd.DataFrame({"Host":["carl","ermano","jon","albert"],
"Detection":["PsExec","PsExec","PsExec","Quarantine"],
"Outcome":["TP","FP","FP","TP"],
"HourOfDay":[5,10,13,6]})
unique_values = [samples[col].unique() for col in samples.columns]
cartesian_product = list(itertools.product(*unique_values))
cartesian_df = pd.DataFrame(cartesian_product, columns=samples.columns)
structure = hh.structure.chow_liu(samples)
bn = hh.BayesNet(*structure)
bn = bn.fit(pd.concat([samples, cartesian_df]))
bn.prepare()
'''
dot = bn.graphviz()
path = dot.render('asia', directory='figures', format='svg', cleanup=True)
'''
print("Probability of detection")
print(bn.P["Detection"])
print("Probability of outcome")
print(bn.P["Outcome"])
print("Probability of FP at 5 am")
event = {"Host":"carl","Detection":"PsExec","HourOfDay":5}
bn.predict_proba(event)
print("Probability of FP at 6 am")
# this will fail because is unseen: how do we generalize?
event = {"Host":"carl","Detection":"PsExec","HourOfDay":6}
bn.predict_proba(event) I agree that this should be a smoother experience. The BN could use an a priori and output a (very) low probability for cases not seen in the training data. I don't have to work on this right now, but I will. In the meantime, this Cartesian product trick should work. |
Hi there,
I guess I may have already hit a limitation with the library.
Any help would be great, maybe I have to move to a more complex solution.
Anyway here's my issue:
I want to predict the probably of a false positive at 6 am which was not observed in the training set.
I am not sure what is the correct approach here is there a way to assign a smooth distribution across the 24hours so that it will assign a tiny probability that is unobserved?
How other libraries like Pomegrenade handle this kind of situations?
Cheers!
The text was updated successfully, but these errors were encountered: