Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Major code refactor to unify quasi experiment classes #381

Merged
merged 72 commits into from
Aug 22, 2024
Merged

Conversation

drbenvincent
Copy link
Collaborator

@drbenvincent drbenvincent commented Jul 2, 2024

This is a relatively major code refactor with minor breaking changes to the API. The main purpose is to eliminate the parallel class hierarchy we had. Namely, we had virtually identical experiment classes which worked with either the PyMC or scikit-learn models. There were only slight differences here to deal with the fact that PyMC models produce InferenceData objects and the scikit-learn models would produce numpy arrays, for example.

We don't have an immediate intention of expanding beyond PyMC or scikit-learn models, however the new code structure would make it much much easier to expand the kinds of models used. The main appeal of this is to focus on high level description of quasi-experimental methods and to abstract away from model-related implementation issues. So you could add in non-PyMC Bayesian models (see #116), or use statsmodels (see #8) to use OLS but also get confidence intervals (which you don't get from scikit-learn models).

We should have 100% passing doctests and tests, and I re-ran all the notebooks to check that we have stable performance.

Before

classes

After (at time of initial PR)

classes

After (after dealing with review comments)

classes

So now we just have a single set of quasi experiment classes, all inheriting from BaseExperiment.

Other changes

  • I renamed ModelBuilder to PyMCModel. This seems to make more sense as it contrasts better with a new ScikitLearnAdaptor class/mixin which gives some extra functionality to scikit-learn models.
  • I increased test coverage
  • Plotting is done by experiment classes, though either the bayes_plot or ols_plot methods, though some experiment classes have custom plot methods.

API changes

The change in API for the user is relatively small. The only change should really be how the experiment classes are imported. For example:

Before

import causalpy as cp
df = cp.load_data("did")
result = cp.pymc_experiments.DifferenceInDifferences(
    df,
    formula="y ~ 1 + group*post_treatment",
    time_variable_name="t",
    group_variable_name="group",
    model=cp.pymc_models.LinearRegression(sample_kwargs={"random_seed": seed}),
)

After

The import changes from cp.pymc_experiments.DifferenceInDifferences to cp.DifferenceInDifferences.

import causalpy as cp
df = cp.load_data("did")
result = cp.DifferenceInDifferences(
    df,
    formula="y ~ 1 + group*post_treatment",
    time_variable_name="t",
    group_variable_name="group",
    model=cp.pymc_models.LinearRegression(sample_kwargs={"random_seed": seed}),
)

The old API will still work, but will emit a deprecation warning. At some point in the future we may remove the old API, so it is best to make this minor update to existing workflows.

TODO's

  • Fix up the not quite perfect use of if isinstance in the experiment classes
  • Add missing module level docstrings to improve the auto generated API docs

📚 Documentation preview 📚: https://causalpy--381.org.readthedocs.build/en/381/

@drbenvincent
Copy link
Collaborator Author

drbenvincent commented Aug 7, 2024

Thanks @wd60622. Still working through the comments, but I didn't follow what you meant about the deprecation warnings not working.

When I run with the old API (e.g. cp.pymc_experiments.DifferenceInDifferences) then we do get a warning, followed with the rest

Screenshot 2024-08-07 at 20 52 36

I'm also seeing the deprecation tests pass (at the bottom of test_integration_pymc_examples.py)

The import of experiments in the warnings do not work. Do you want people to use cp.DifferenceInDifferences or cp.experiments.DifferenceInDifferences or both?

So the new thing is cp.DifferenceInDifferences, but in order to not break things cp.experiments.DifferenceInDifferences will still work but issue a deprecation warning. I could issue the warning then error out at that point, but right now the old API still works. And it does so by just routing through to cp.DifferenceInDifferences.

@drbenvincent
Copy link
Collaborator Author

  1. Would it be convenient to be able to pass scikit-learn models as well and use the create_causalpy_compatible_class behind within the class? It seems like that was the behavior before and might be break some previous workflows. I see most of the tests just do it once at the top which could then be avoided.

This is a great suggestion @wd60622, and stopped me from being a bit lazy! I've addressed this in 02dacb2

The first attempt I tried turned out to be a dead end. It basically came down to the fact that we are passing in an instantiated model object, not a model class. This means that if the model was built with non-trivial kwargs, it was getting highly complex to create another class with the mixin approach.

So the solution I went with was to simply take the user-provided model instance and to use a helper function to attach the methods required by CausalPy to that model instance. I did use GPT to help with this function :)

def add_mixin_methods(model_instance, mixin_class):
    for attr_name in dir(mixin_class):
        attr = getattr(mixin_class, attr_name)
        if callable(attr) and not attr_name.startswith("__"):
            # Bind the method to the instance
            method = attr.__get__(model_instance, model_instance.__class__)
            setattr(model_instance, attr_name, method)
    return model_instance

So now users can just provide unadulterated scikit-learn model instances. The experiment base class then adds the required methods to this behind the scenes. So there is zero API change for the user.

@drbenvincent
Copy link
Collaborator Author

  1. The fit method should ideally have the same signature. This leads to the if elif blocks that are all over the place. Though this might be another refactor.

So this is an example in DifferencesInDifferences:

# fit model
if isinstance(self.model, PyMCModel):
    COORDS = {"coeffs": self.labels, "obs_indx": np.arange(self.X.shape[0])}
    self.model.fit(X=self.X, y=self.y, coords=COORDS)
elif isinstance(self.model, RegressorMixin):
    self.model.fit(X=self.X, y=self.y)
else:
    raise ValueError("Model type not recognized")

You are right, we only need all this type checking because of the different signatures of the fit methods. I can't not provide coords to the pymc fit method. I think the only other alternative is to create and pass in coords to both fit methods - so I'd have to override the default fit method of the scikit learn objects to make it accept **kwargs and the coords would basically disappear into a black hole.

It's maybe a bit clunky because you are needlessly defining and passing coords when you have scikit-learn models, but that seems like the easiest way to go.

However, this would only affect internal code and not affect the user experience. So I'm happy if we sit and think about this one and address it at a later date. But open to any other ideas.

@drbenvincent
Copy link
Collaborator Author

Thanks for the final points @wd60622. Fingers crossed, that should be it now?

@drbenvincent drbenvincent added the enhancement New feature or request label Aug 20, 2024
Copy link
Contributor

@wd60622 wd60622 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just two things I noticed

Comment on lines 16 to 18
import causalpy.pymc_experiments as pymc_experiments # to be depricated
import causalpy.pymc_models as pymc_models
import causalpy.skl_experiments as skl_experiments # to be depricated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
import causalpy.pymc_experiments as pymc_experiments # to be depricated
import causalpy.pymc_models as pymc_models
import causalpy.skl_experiments as skl_experiments # to be depricated
import causalpy.pymc_experiments as pymc_experiments # to be deprecated
import causalpy.pymc_models as pymc_models
import causalpy.skl_experiments as skl_experiments # to be deprecated

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - did a global find/replace

Comment on lines 18 to 23
import causalpy as cp

sample_kwargs = {"tune": 20, "draws": 20, "chains": 2, "cores": 2}


def test_regression_kink_gradient_change():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
import causalpy as cp
sample_kwargs = {"tune": 20, "draws": 20, "chains": 2, "cores": 2}
def test_regression_kink_gradient_change():
import causalpy as cp
sample_kwargs = {"tune": 20, "draws": 20, "chains": 2, "cores": 2}
def test_regression_kink_gradient_change():
Suggested change
import causalpy as cp
sample_kwargs = {"tune": 20, "draws": 20, "chains": 2, "cores": 2}
def test_regression_kink_gradient_change():
import causalpy as cp
def test_regression_kink_gradient_change():

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, have removed sample_kwargs

@drbenvincent drbenvincent merged commit e55f23b into main Aug 22, 2024
8 checks passed
@drbenvincent drbenvincent deleted the refactor branch August 22, 2024 08:59
@twiecki
Copy link
Contributor

twiecki commented Aug 22, 2024

Congrats!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request major
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Have more coherent testing of the summary method
4 participants