Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to interpolate smooth distributions? #28

Open
priamai opened this issue Nov 4, 2023 · 5 comments
Open

how to interpolate smooth distributions? #28

priamai opened this issue Nov 4, 2023 · 5 comments

Comments

@priamai
Copy link

priamai commented Nov 4, 2023

Hi there,
I guess I may have already hit a limitation with the library.
Any help would be great, maybe I have to move to a more complex solution.
Anyway here's my issue:

def example_learning():

    import pandas as pd

    samples = pd.DataFrame({"Host":["carl","ermano","jon"],
                       "Detection":["PsExec","PsExec","PsExec"],
                        "Outcome":["TP","FP"],
                       "HourOfDay":[5,10,13]})
    print(samples)

    structure = hh.structure.chow_liu(samples)

    bn = hh.BayesNet(*structure)
    bn = bn.fit(samples)
    bn.prepare()
    '''
    dot = bn.graphviz()

    path = dot.render('asia', directory='figures', format='svg', cleanup=True)
    '''
    print("Probability of detection")
    print(bn.P["Detection"])

    print("Probability of outcome")
    print(bn.P["Outcome"])
    print("Probability of FP at 5 am")
    event = {"Host":"carl","Detection":"PsExec","HourOfDay":5}
    bn.predict_proba(event)
    print("Probability of FP at 6 am")
    # this will fail because is unseen: how do we generalize?
    event = {"Host":"carl","Detection":"PsExec","HourOfDay":6}
    bn.predict_proba(event)

I want to predict the probably of a false positive at 6 am which was not observed in the training set.
I am not sure what is the correct approach here is there a way to assign a smooth distribution across the 24hours so that it will assign a tiny probability that is unobserved?

How other libraries like Pomegrenade handle this kind of situations?
Cheers!

@MaxHalford
Copy link
Owner

Hey there! Is that example running for you? It spits an error at me because the inputs to the pandas DataFrame are not equal in length.

I know exactly the issue you're having. One way is to make each possibility appear at least once in the dataframe you provide to fit. That way, each possibility has been seen at least one, so the probability of any even will be greater than 0.

@priamai
Copy link
Author

priamai commented Nov 8, 2023

So in my case I would have to make sure that all the 24 hours are available before I can make a prediction.

@priamai
Copy link
Author

priamai commented Nov 8, 2023

However this still happens:

import sorobn as hh

def example_dag():
    # simple equivalent notion
    bn = hh.BayesNet(
    ('Host', 'Alarm'),
    ('Alarm', 'True Positive'),
    ('Alarm', 'False Positive'),
    seed=42,
    )

    bn = hh.BayesNet((["Host"],"Alarm"),
                     ("Alarm",["True Positive","False Positive"]),seed=42)

def example_learning():

    import pandas as pd

    samples = pd.DataFrame({"Host":["carl","ermano","jon","albert"],
                       "Detection":["PsExec","PsExec","PsExec","Quarantine"],
                        "Outcome":["TP","FP","FP","TP"],
                       "HourOfDay":[5,10,13,6]})
    print(samples)

    structure = hh.structure.chow_liu(samples)

    bn = hh.BayesNet(*structure)
    bn = bn.fit(samples)
    bn.prepare()
    '''
    dot = bn.graphviz()

    path = dot.render('asia', directory='figures', format='svg', cleanup=True)
    '''
    print("Probability of detection")
    print(bn.P["Detection"])

    print("Probability of outcome")
    print(bn.P["Outcome"])
    print("Probability of FP at 5 am")
    event = {"Host":"carl","Detection":"PsExec","HourOfDay":5}
    bn.predict_proba(event)
    print("Probability of FP at 6 am")
    # this will fail because is unseen: how do we generalize?
    event = {"Host":"carl","Detection":"PsExec","HourOfDay":6}
    bn.predict_proba(event)

example_learning()

The 6 am is now available but if I don't provide an exact similar example from the dataset it complains.
I am a bit sceptical of the applicability, one would expect a simple level of generalization.

@priamai
Copy link
Author

priamai commented Nov 8, 2023

So just to be clear the first event works because is identical of what is in the dataset but the second one fails as it doesn't seem to interpolate the probability...

    # this is fine but is exactly the same event ....
    event = {"Host":"albert","Detection":"Quarantine","HourOfDay":6}
    bn.predict_proba(event)

    # this still fails...
    event = {"Host":"carl","Detection":"PsExec","HourOfDay":6}
    bn.predict_proba(event)

@MaxHalford
Copy link
Owner

What I was saying is that you can ensure every case is seen by the BN by calculating a Cartesian product between all values. This way, each occurrence appears at least once. This works:

import itertools
import sorobn as hh
import pandas as pd

samples = pd.DataFrame({"Host":["carl","ermano","jon","albert"],
                    "Detection":["PsExec","PsExec","PsExec","Quarantine"],
                    "Outcome":["TP","FP","FP","TP"],
                    "HourOfDay":[5,10,13,6]})

unique_values = [samples[col].unique() for col in samples.columns]
cartesian_product = list(itertools.product(*unique_values))
cartesian_df = pd.DataFrame(cartesian_product, columns=samples.columns)

structure = hh.structure.chow_liu(samples)

bn = hh.BayesNet(*structure)
bn = bn.fit(pd.concat([samples, cartesian_df]))
bn.prepare()
'''
dot = bn.graphviz()

path = dot.render('asia', directory='figures', format='svg', cleanup=True)
'''
print("Probability of detection")
print(bn.P["Detection"])

print("Probability of outcome")
print(bn.P["Outcome"])
print("Probability of FP at 5 am")
event = {"Host":"carl","Detection":"PsExec","HourOfDay":5}
bn.predict_proba(event)
print("Probability of FP at 6 am")
# this will fail because is unseen: how do we generalize?
event = {"Host":"carl","Detection":"PsExec","HourOfDay":6}
bn.predict_proba(event)

I agree that this should be a smoother experience. The BN could use an a priori and output a (very) low probability for cases not seen in the training data. I don't have to work on this right now, but I will. In the meantime, this Cartesian product trick should work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants