Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No support for weights #136

Closed
pakuipers opened this issue Jan 25, 2019 · 32 comments · Fixed by #692
Closed

No support for weights #136

pakuipers opened this issue Jan 25, 2019 · 32 comments · Fixed by #692
Labels
feature a feature request or enhancement

Comments

@pakuipers
Copy link

The weights argument in fit does nothing. Most modeling packages allow for case weights, so is this going to be supported in the future?

@alexpghayes
Copy link

I'm guessing it's getting silently ignored. This can be surprising for sure. @DavisVaughan What about using ellipsis to warn when this happens?

@DavisVaughan
Copy link
Member

It's definitely getting silently ignored. ellipsis isn't on CRAN and idk of any plans to get it there. It's more for misspellings (when ... are ok, but the user messed up). In this case, we dont allow any args through to the ... of fit() so we could probably just use a variation of parsnip:::check_empty_ellipse to ensure that there are no dots.

Regarding the actual problem of adding weights, I was going to suggest the use of a data descriptor like so:

x <-c(rnorm(10))
df <- data.frame(y=1+2*x+rnorm(10)/2, x=x, wght1=1:10)

library(parsnip)

linear_reg() %>%
  set_engine("lm", weights = .dat()$wght1) %>%
  fit(y ~ x, data = df)
#> Warning: The following arguments cannot be manually modified and were
#> removed: weights
#> parsnip model object
#> 
#> 
#> Call:
#> stats::lm(formula = formula, data = data)
#> 
#> Coefficients:
#> (Intercept)            x  
#>       0.776        1.992

Created on 2019-01-25 by the reprex package (v0.2.1.9000)

but I forgot we "protect" the weights arg of lm(). @max are we planning on adding a weighted lm as a separate model_spec? Is that why weights is protected?

@alexpghayes
Copy link

@DavisVaughan
Copy link
Member

Ah, the readme lies!

@topepo
Copy link
Member

topepo commented Jan 25, 2019

We will support case weights (soon) but it doesn't quite work yet.

@ryankarel
Copy link

Hello, is there any update on this issue? Is it still the plan to include case weights?

@konradsemsch
Copy link

Up. Was looking for the same recently :)

@JakeRuss
Copy link

JakeRuss commented Feb 7, 2020

@topepo @DavisVaughan Would either of you mind providing an update here about current status? I perused through commits looking for any movement here, but did not spot it. Was hoping to supply case.weights to ranger but ran into this warning: Warning: The following arguments cannot be manually modified and were removed: case.weights

@beaulucas
Copy link

Running into this issue with logistic regression as well. I tried to set it during set_engine() call prior to fit().

@topepo
Copy link
Member

topepo commented Mar 16, 2020

We are focused on documentation for the next 1-3 months. After that, we'll bump up the priority on this.

@lepennec
Copy link

I would be really interested to help with this weight feature. Please let me known if you are interested.

@juliasilge juliasilge added the feature a feature request or enhancement label Apr 3, 2020
@Chris-Engelhardt
Copy link

+1 for implementing case weights.

@JakeRuss
Copy link

Is there a technical challenge that needs to be hurdled to allow weighting in various models? I would have interest in submitting a PR to close this issue, but I'm not sure I understand what @DavisVaughan is referring to by weights being a protected argument. This singular issue is preventing me from deploying a tidymodels workflow to our production environment.

@topepo
Copy link
Member

topepo commented Sep 20, 2020

It's not really a singular issue. There are a several things that need to get done and some are not straightforward.

For parsnip, we need to determine the best api to pass the case weight argument to fit() and fit_xy() functions and then to the appropriate plumbing to pass them to the underlying model fit function. Also, the models/engines that take case weights need to have their model definitions adjusted (along with all of the parsnip-adjacent packages).

For workflows, more api decisions have to be made about where the case weights are added. I suspect that the recipe and perhaps the new add_variables() preprocessors will be the only way to do things to ensure that the case weights are appropriately carried around during resampling and passed to the fit functions. Alternatively, a workflows function called add_case_weights() might be the way to go (but I doubt it).

For tune, more api work and the required plumbing is needed.

I'm not sure I understand what @DavisVaughan is referring to by weights being a protected argument.

In parsnip, we have a means of stopping people from passing in certain arguments using a field called protect.

@JakeRuss
Copy link

JakeRuss commented Sep 21, 2020

@topepo I appreciate you laying that out for us; by singular issue, I just meant this one feature (weights) is blocking my personal progress with tidymodels. But, with all of the uncertainty/API decisions it doesn't sound like this is a place for me to contribute. I'll keep waiting patiently for you and the team to make those decisions. My personal status quo is fine for the time being.

@topepo
Copy link
Member

topepo commented Sep 25, 2020

Some other things that I've thought about since the last response (collected here only as a record)...

I see two different use cases here:

  1. Case weights are n for replicated covariate patterns. So a weight of 20 means that there were 20 rows with this same data pattern and we want to save memory by eliminating 19 redundant rows.

  2. Case weights signify how much importance each value should have. You might weight certain important data points higher for some reason. For example, in my last job, we would sometimes weight compounds inversely with their age (older data is less relevant). Fractional values make sense here.

We tend to think of case weights only in terms of the model but, depending on which of these use cases you are in, it could/should impact other computations, such as:

  • Data Splitting (case 1 only): should all of the replicate configurations be in the training or test set? Should bootstrapping or other resampling methods account for the case weights?

  • Preprocessing (both cases): Arguably, the weights should matter here too. PCA, centering, and scaling are three basic example where the preprocessing should respect these weights but their underlying functions (prcomp(), mean(), sd()) have no capacity for case weights.

  • Performance determination (case 1 only): If a row is a placeholder for X number of data points, it should have a higher weight in the metric calculations. Otherwise it is under-valued in the statistics.

caret, for example, only uses the weights in the model fit (mostly because I had not thought it through). I don't think that that is how we want to proceed but this has a much larger scope than just adding options to parsnip

@mpelath
Copy link

mpelath commented Dec 9, 2020

I would be perfectly happy with "just adding options to parsnip", since for my purposes it's not nearly as important for data splitting or preprocessing to respect weights, and I presume (?) I can roll my own weighted metric.

So is there a workaround for using weights (or offsets, for that matter) at all? Is it even possible through just writing a new engine, or does this require a more fundamental change to parsnip? I tried to write an engine for weighted logistic regression and failed in all kinds of ways, then gave up, but if I knew it were at least possible, I'd persevere.

In any case, +1 for implementing case weights and/or offsets.

@JakeRuss
Copy link

JakeRuss commented Dec 9, 2020

I'll add an example from my own work, where we run models using survey results and weights are needed for Max's Case 2, where we try to adjust the survey to align with our our target population.

@LeenSonneveld
Copy link

Do you have any idea when this feature (weights) will be implemented? (we want to switch from caret to tidymodels. but this is a blocking issue for us)

@topepo
Copy link
Member

topepo commented Jan 14, 2021

This is one of our top 3 priorities for this year. At least 6 months for all of the packages to be updated. But, for example, if you just want parsnip it will probably be sooner.

Just to reiterate, it is a fairly big change and I'm not sure that everyone understands how this affects things. I think that we are used to just using lm() and then not doing anything else. A simple function like that already uses the case weights in the performance metrics and so on. Just updating parsnip is not the same as getting people everything that they want (and that isn't obvious).

@DavZim
Copy link

DavZim commented Mar 9, 2021

We are in the same situation, that we would like to switch our modeling to tidymodels. But we use offsets and/or weights in almost all our models, therefore, this issue is (currently) a dealbraker for us.
I love the tidymodels packages and the effort you put into it, it really is some great work!

Having said that, is there anything we, as the community can do to help implementing weights (and offsets) into tidymodels? I may not have deep insights into the inner workings of the packages, but would be willing to help with documentation, examples, ideas, or code where possible.

@anadiedrichs
Copy link

I suppose this issue is related, not only to parameter weights in lm, but also to other algorithms such as:

Algorithm / method / package Argument
randomForest classwt
ranger case.weights , class.weights
keras class_weight , sample_weight

Is there any update on this issue? Is it still the plan to include sample / case weights?

+1 for implementing this issue.

@haydo1117
Copy link

haydo1117 commented Mar 19, 2021

In my work, the data comes with the time exposure (i.e. weights and offset), which enters the model as a separate parameter, while X takes matrix form (in the xgboost model setup). It seems the pre-processing step cannot separate the weights from the remaining predictors before the matrix conversion.
After a few trial of customised models, it seems the main issue is on the XY format for the modelling, which is too rigid, e.g. compared to the recipes format (I love the recipe logic :) ). To get around this issue, I created a whole set of model blueprint (say XYE) to take my new customised model, as well as another version of methods applied to the model blueprint (as specified in hardhat). While this works in my own case (after a few days of work), I would love to see an official update. I love the tidiness of the tidymodels.

In any case, +1 for implementing case weights and/or offsets.

@edreddick
Copy link

I really like the tidymodel philosophy. But, like others in this thread, I'm unable to utilize it day to day because of weights not being supported. I work in the insurance industry and much of our models use case weights.

While waiting for case weight support in parsnip, I developed a DYI solution that works for my purposes (GLMs and lightGBM). It uses some tidymodels packages (not parsnip) and tries to follow tidymodel and functional programming principles.

The objective of the code is to carry out nested cross validation to properly assess automatic tuning parameter protocol, without leakage.

For anyone interested, a demo of the script can be found here: https://github.com/edreddick/nested-cv

Looking forward to weight being supported by tidymodels, and thank you for building and maintaining tidymodels!

@ThomasWolf0701
Copy link

Is this just influencing case-weights or also class-weights ? While tidymodels seems to be great not being able to use class weights would make it pretty much unusable for every classification task I´ve ever worked with. Hope that´s not the case. For the boosting parsnip class weights are not even mentioned.

@AdrianCKent
Copy link

Is there any update on this feature? Like some of the commenters above, this is a must-have for my team (and anyone in the insurance business). The rest of the 'verse is excellent, so it would be great to know if there's a potential release date. Thanks!

@topepo
Copy link
Member

topepo commented Jul 8, 2021

This is the next big feature that we are working on. I'm hoping to start before the end of the year.

Just to reiterate, it is a fairly big change and I'm not sure that everyone understands how this affects things. I think that we are used to just using lm() and then not doing anything else. A simple function like that already uses the case weights in the performance metrics and so on. Just updating parsnip is not the same as getting people everything that they want (and that isn't obvious).

@George-dr
Copy link

Tidymodels is attempting to be a paradigm shifting suite for data science and machine learning tools. All of the buzz is about the "tidyverse". Academia--in the social sciences--is slow to adopt any of these tools because they do not readily provide the basics. Unfortunately, data weighted in a manner to make population statements is a basic in the social sciences. I have stalled my research agenda since January based on topepo's January 14th comment of "at least 6 months but if only wanted in parsnip sooner". Then, I see where others involved in the tidyverse project are still gauging interest in using weights to make population statements. I just have to wonder if this type of development is actually taking place or if "tidyverse" is only being using in the corporate world were these types of population statements are less important. I recognize this has to be done correctly for efficiency and accuracy. I, also, wonder what the preparation work was done before "tidyverse" was launched. Seems some simple focus groups with researchers would have uncovered this issue before the major push. I keep saying major push because many of the better developers have left there own creations to join in the development of tidyverse. My chief frustration lies in others in my field are publishing using weighted data to make population statements via tidyverse and other machine learning methods (e.g.,H2o), and claiming to be using weights. I cannot see where this is viable with tidyverse nor with H2o. Thus, I am wondering when or if this issue will be fixed? I need to get moving with my research agenda. Thank you.

@topepo
Copy link
Member

topepo commented Aug 19, 2021

Thus, I am wondering when or if this issue will be fixed?

It will be and we will release it when it is ready; it isn't vaporware. There are several public roadmaps/specifications of what has to be done for case weights. As you can see from the comments I've made above, it isn't just a single issue. AFAIK, not other R (or python) based modeling framework does a comprehensive job of handling case weights at every step of the process.

I'm sorry if the timeline is not conducive to your work. We are a handful of people working hard to create a lot of free modeling software. We are balancing many new features. In our developer survey from last year, only a single person indicated that they wanted us to prioritize case weights. Despite this, we have multiple people working on it now across multiple packages.

@George-dr
Copy link

I am glad to see that you are monitoring this thread as you have since January 2019 (your posts n=6 during this time). I only say this because this is a indication of the OUTCRY for this type of development in tidymodels. After reviewing the results of your developer survey--thank you for the thread, it is unclear to me as to the industries the 300 respondents represented. Perhaps, the respondents are not truly representative of the population of actual users. This thread seems to demonstrate a larger need for this type of incorporation of case weights. Oh before I close, I was only quoting YOUR timeline. Of course, it will be ready whenever you all say that it will be ready. I appreciate you and your work in this area.

@jkafka
Copy link

jkafka commented Aug 25, 2021

Thank you so much for working to integrate case weights in Tidymodels. Just echoing how helpful that would be. I'm a public health researcher and using weights is critical to allow us to make population-level inferences. I mostly work with categorical data and having support for weights in applying algorithms like recursive partitioning trees and ranger would be really helpful. Appreciate you all making this a priority!

@topepo
Copy link
Member

topepo commented May 5, 2022

Please see https://www.tidyverse.org/blog/2022/05/case-weights/ and add comments there.

I'll lock this thread now.

@topepo topepo closed this as completed May 5, 2022
@tidymodels tidymodels locked as resolved and limited conversation to collaborators May 5, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature a feature request or enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.