No support for weights #136

pakuipers · 2019-01-25T20:42:52Z

The weights argument in fit does nothing. Most modeling packages allow for case weights, so is this going to be supported in the future?

The text was updated successfully, but these errors were encountered:

alexpghayes · 2019-01-25T20:54:34Z

I'm guessing it's getting silently ignored. This can be surprising for sure. @DavisVaughan What about using ellipsis to warn when this happens?

DavisVaughan · 2019-01-25T21:17:11Z

It's definitely getting silently ignored. ellipsis isn't on CRAN and idk of any plans to get it there. It's more for misspellings (when ... are ok, but the user messed up). In this case, we dont allow any args through to the ... of fit() so we could probably just use a variation of parsnip:::check_empty_ellipse to ensure that there are no dots.

Regarding the actual problem of adding weights, I was going to suggest the use of a data descriptor like so:

x <-c(rnorm(10))
df <- data.frame(y=1+2*x+rnorm(10)/2, x=x, wght1=1:10)

library(parsnip)

linear_reg() %>%
  set_engine("lm", weights = .dat()$wght1) %>%
  fit(y ~ x, data = df)
#> Warning: The following arguments cannot be manually modified and were
#> removed: weights
#> parsnip model object
#> 
#> 
#> Call:
#> stats::lm(formula = formula, data = data)
#> 
#> Coefficients:
#> (Intercept)            x  
#>       0.776        1.992

^{Created on 2019-01-25 by the reprex package (v0.2.1.9000)}

but I forgot we "protect" the weights arg of lm(). @max are we planning on adding a weighted lm as a separate model_spec? Is that why weights is protected?

alexpghayes · 2019-01-25T21:19:10Z

Ellipsis is on CRAN! https://cran.r-project.org/web/packages/ellipsis/index.html

DavisVaughan · 2019-01-25T21:19:45Z

Ah, the readme lies!

topepo · 2019-01-25T23:50:10Z

We will support case weights (soon) but it doesn't quite work yet.

ryankarel · 2019-09-17T15:44:02Z

Hello, is there any update on this issue? Is it still the plan to include case weights?

konradsemsch · 2019-10-11T08:30:43Z

Up. Was looking for the same recently :)

JakeRuss · 2020-02-07T17:04:14Z

@topepo @DavisVaughan Would either of you mind providing an update here about current status? I perused through commits looking for any movement here, but did not spot it. Was hoping to supply case.weights to ranger but ran into this warning: Warning: The following arguments cannot be manually modified and were removed: case.weights

beaulucas · 2020-03-09T20:49:49Z

Running into this issue with logistic regression as well. I tried to set it during set_engine() call prior to fit().

topepo · 2020-03-16T16:48:48Z

We are focused on documentation for the next 1-3 months. After that, we'll bump up the priority on this.

lepennec · 2020-03-23T13:45:56Z

I would be really interested to help with this weight feature. Please let me known if you are interested.

Chris-Engelhardt · 2020-08-06T17:36:01Z

+1 for implementing case weights.

JakeRuss · 2020-09-16T17:54:11Z

Is there a technical challenge that needs to be hurdled to allow weighting in various models? I would have interest in submitting a PR to close this issue, but I'm not sure I understand what @DavisVaughan is referring to by weights being a protected argument. This singular issue is preventing me from deploying a tidymodels workflow to our production environment.

topepo · 2020-09-20T22:38:48Z

It's not really a singular issue. There are a several things that need to get done and some are not straightforward.

For parsnip, we need to determine the best api to pass the case weight argument to fit() and fit_xy() functions and then to the appropriate plumbing to pass them to the underlying model fit function. Also, the models/engines that take case weights need to have their model definitions adjusted (along with all of the parsnip-adjacent packages).

For workflows, more api decisions have to be made about where the case weights are added. I suspect that the recipe and perhaps the new add_variables() preprocessors will be the only way to do things to ensure that the case weights are appropriately carried around during resampling and passed to the fit functions. Alternatively, a workflows function called add_case_weights() might be the way to go (but I doubt it).

For tune, more api work and the required plumbing is needed.

I'm not sure I understand what @DavisVaughan is referring to by weights being a protected argument.

In parsnip, we have a means of stopping people from passing in certain arguments using a field called protect.

JakeRuss · 2020-09-21T18:00:21Z

@topepo I appreciate you laying that out for us; by singular issue, I just meant this one feature (weights) is blocking my personal progress with tidymodels. But, with all of the uncertainty/API decisions it doesn't sound like this is a place for me to contribute. I'll keep waiting patiently for you and the team to make those decisions. My personal status quo is fine for the time being.

topepo · 2020-09-25T01:35:52Z

Some other things that I've thought about since the last response (collected here only as a record)...

I see two different use cases here:

Case weights are n for replicated covariate patterns. So a weight of 20 means that there were 20 rows with this same data pattern and we want to save memory by eliminating 19 redundant rows.
Case weights signify how much importance each value should have. You might weight certain important data points higher for some reason. For example, in my last job, we would sometimes weight compounds inversely with their age (older data is less relevant). Fractional values make sense here.

We tend to think of case weights only in terms of the model but, depending on which of these use cases you are in, it could/should impact other computations, such as:

Data Splitting (case 1 only): should all of the replicate configurations be in the training or test set? Should bootstrapping or other resampling methods account for the case weights?
Preprocessing (both cases): Arguably, the weights should matter here too. PCA, centering, and scaling are three basic example where the preprocessing should respect these weights but their underlying functions (prcomp(), mean(), sd()) have no capacity for case weights.
Performance determination (case 1 only): If a row is a placeholder for X number of data points, it should have a higher weight in the metric calculations. Otherwise it is under-valued in the statistics.

caret, for example, only uses the weights in the model fit (mostly because I had not thought it through). I don't think that that is how we want to proceed but this has a much larger scope than just adding options to parsnip

mpelath · 2020-12-09T17:59:01Z

I would be perfectly happy with "just adding options to parsnip", since for my purposes it's not nearly as important for data splitting or preprocessing to respect weights, and I presume (?) I can roll my own weighted metric.

So is there a workaround for using weights (or offsets, for that matter) at all? Is it even possible through just writing a new engine, or does this require a more fundamental change to parsnip? I tried to write an engine for weighted logistic regression and failed in all kinds of ways, then gave up, but if I knew it were at least possible, I'd persevere.

In any case, +1 for implementing case weights and/or offsets.

JakeRuss · 2020-12-09T18:12:49Z

I'll add an example from my own work, where we run models using survey results and weights are needed for Max's Case 2, where we try to adjust the survey to align with our our target population.

LeenSonneveld · 2021-01-14T14:29:23Z

Do you have any idea when this feature (weights) will be implemented? (we want to switch from caret to tidymodels. but this is a blocking issue for us)

topepo · 2021-01-14T20:01:05Z

This is one of our top 3 priorities for this year. At least 6 months for all of the packages to be updated. But, for example, if you just want parsnip it will probably be sooner.

Just to reiterate, it is a fairly big change and I'm not sure that everyone understands how this affects things. I think that we are used to just using lm() and then not doing anything else. A simple function like that already uses the case weights in the performance metrics and so on. Just updating parsnip is not the same as getting people everything that they want (and that isn't obvious).

DavZim · 2021-03-09T08:03:50Z

We are in the same situation, that we would like to switch our modeling to tidymodels. But we use offsets and/or weights in almost all our models, therefore, this issue is (currently) a dealbraker for us.
I love the tidymodels packages and the effort you put into it, it really is some great work!

Having said that, is there anything we, as the community can do to help implementing weights (and offsets) into tidymodels? I may not have deep insights into the inner workings of the packages, but would be willing to help with documentation, examples, ideas, or code where possible.

anadiedrichs · 2021-03-12T18:25:45Z

I suppose this issue is related, not only to parameter weights in lm, but also to other algorithms such as:

Algorithm / method / package	Argument
randomForest	classwt
ranger	case.weights , class.weights
keras	class_weight , sample_weight

Is there any update on this issue? Is it still the plan to include sample / case weights?

+1 for implementing this issue.

haydo1117 · 2021-03-19T07:37:05Z

In my work, the data comes with the time exposure (i.e. weights and offset), which enters the model as a separate parameter, while X takes matrix form (in the xgboost model setup). It seems the pre-processing step cannot separate the weights from the remaining predictors before the matrix conversion.
After a few trial of customised models, it seems the main issue is on the XY format for the modelling, which is too rigid, e.g. compared to the recipes format (I love the recipe logic :) ). To get around this issue, I created a whole set of model blueprint (say XYE) to take my new customised model, as well as another version of methods applied to the model blueprint (as specified in hardhat). While this works in my own case (after a few days of work), I would love to see an official update. I love the tidiness of the tidymodels.

In any case, +1 for implementing case weights and/or offsets.

edreddick · 2021-03-25T17:59:26Z

I really like the tidymodel philosophy. But, like others in this thread, I'm unable to utilize it day to day because of weights not being supported. I work in the insurance industry and much of our models use case weights.

While waiting for case weight support in parsnip, I developed a DYI solution that works for my purposes (GLMs and lightGBM). It uses some tidymodels packages (not parsnip) and tries to follow tidymodel and functional programming principles.

The objective of the code is to carry out nested cross validation to properly assess automatic tuning parameter protocol, without leakage.

For anyone interested, a demo of the script can be found here: https://github.com/edreddick/nested-cv

Looking forward to weight being supported by tidymodels, and thank you for building and maintaining tidymodels!

ThomasWolf0701 · 2021-04-03T19:20:28Z

Is this just influencing case-weights or also class-weights ? While tidymodels seems to be great not being able to use class weights would make it pretty much unusable for every classification task I´ve ever worked with. Hope that´s not the case. For the boosting parsnip class weights are not even mentioned.

AdrianCKent · 2021-07-06T08:11:56Z

Is there any update on this feature? Like some of the commenters above, this is a must-have for my team (and anyone in the insurance business). The rest of the 'verse is excellent, so it would be great to know if there's a potential release date. Thanks!

topepo · 2021-07-08T14:33:31Z

This is the next big feature that we are working on. I'm hoping to start before the end of the year.

Just to reiterate, it is a fairly big change and I'm not sure that everyone understands how this affects things. I think that we are used to just using lm() and then not doing anything else. A simple function like that already uses the case weights in the performance metrics and so on. Just updating parsnip is not the same as getting people everything that they want (and that isn't obvious).

George-dr · 2021-08-19T15:46:33Z

Tidymodels is attempting to be a paradigm shifting suite for data science and machine learning tools. All of the buzz is about the "tidyverse". Academia--in the social sciences--is slow to adopt any of these tools because they do not readily provide the basics. Unfortunately, data weighted in a manner to make population statements is a basic in the social sciences. I have stalled my research agenda since January based on topepo's January 14th comment of "at least 6 months but if only wanted in parsnip sooner". Then, I see where others involved in the tidyverse project are still gauging interest in using weights to make population statements. I just have to wonder if this type of development is actually taking place or if "tidyverse" is only being using in the corporate world were these types of population statements are less important. I recognize this has to be done correctly for efficiency and accuracy. I, also, wonder what the preparation work was done before "tidyverse" was launched. Seems some simple focus groups with researchers would have uncovered this issue before the major push. I keep saying major push because many of the better developers have left there own creations to join in the development of tidyverse. My chief frustration lies in others in my field are publishing using weighted data to make population statements via tidyverse and other machine learning methods (e.g.,H2o), and claiming to be using weights. I cannot see where this is viable with tidyverse nor with H2o. Thus, I am wondering when or if this issue will be fixed? I need to get moving with my research agenda. Thank you.

topepo · 2021-08-19T16:50:41Z

Thus, I am wondering when or if this issue will be fixed?

It will be and we will release it when it is ready; it isn't vaporware. There are several public roadmaps/specifications of what has to be done for case weights. As you can see from the comments I've made above, it isn't just a single issue. AFAIK, not other R (or python) based modeling framework does a comprehensive job of handling case weights at every step of the process.

I'm sorry if the timeline is not conducive to your work. We are a handful of people working hard to create a lot of free modeling software. We are balancing many new features. In our developer survey from last year, only a single person indicated that they wanted us to prioritize case weights. Despite this, we have multiple people working on it now across multiple packages.

George-dr · 2021-08-19T17:45:56Z

I am glad to see that you are monitoring this thread as you have since January 2019 (your posts n=6 during this time). I only say this because this is a indication of the OUTCRY for this type of development in tidymodels. After reviewing the results of your developer survey--thank you for the thread, it is unclear to me as to the industries the 300 respondents represented. Perhaps, the respondents are not truly representative of the population of actual users. This thread seems to demonstrate a larger need for this type of incorporation of case weights. Oh before I close, I was only quoting YOUR timeline. Of course, it will be ready whenever you all say that it will be ready. I appreciate you and your work in this area.

jkafka · 2021-08-25T23:23:31Z

Thank you so much for working to integrate case weights in Tidymodels. Just echoing how helpful that would be. I'm a public health researcher and using weights is critical to allow us to make population-level inferences. I mostly work with categorical data and having support for weights in applying algorithms like recursive partitioning trees and ranger would be really helpful. Appreciate you all making this a priority!

topepo · 2022-05-05T14:25:45Z

Please see https://www.tidyverse.org/blog/2022/05/case-weights/ and add comments there.

I'll lock this thread now.

juliasilge added the feature a feature request or enhancement label Apr 3, 2020

juliasilge mentioned this issue May 27, 2021

Tuning random forest hyperparameters with #TidyTuesday trees data | Julia Silge juliasilge/juliasilge.com#6

Open

felxcon mentioned this issue Mar 11, 2022

Error using tune_grid with sample_size for lightgbm curso-r/treesnip#60

Open

topepo mentioned this issue Mar 25, 2022

case weights #692

Merged

topepo closed this as completed May 5, 2022

tidymodels locked as resolved and limited conversation to collaborators May 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No support for weights #136

No support for weights #136

pakuipers commented Jan 25, 2019

alexpghayes commented Jan 25, 2019

DavisVaughan commented Jan 25, 2019

alexpghayes commented Jan 25, 2019

DavisVaughan commented Jan 25, 2019

topepo commented Jan 25, 2019

ryankarel commented Sep 17, 2019

konradsemsch commented Oct 11, 2019

JakeRuss commented Feb 7, 2020 •

edited

Loading

beaulucas commented Mar 9, 2020

topepo commented Mar 16, 2020

lepennec commented Mar 23, 2020

Chris-Engelhardt commented Aug 6, 2020

JakeRuss commented Sep 16, 2020

topepo commented Sep 20, 2020

JakeRuss commented Sep 21, 2020 •

edited

Loading

topepo commented Sep 25, 2020

mpelath commented Dec 9, 2020

JakeRuss commented Dec 9, 2020

LeenSonneveld commented Jan 14, 2021

topepo commented Jan 14, 2021

DavZim commented Mar 9, 2021

anadiedrichs commented Mar 12, 2021

haydo1117 commented Mar 19, 2021 •

edited

Loading

edreddick commented Mar 25, 2021

ThomasWolf0701 commented Apr 3, 2021

AdrianCKent commented Jul 6, 2021

topepo commented Jul 8, 2021

George-dr commented Aug 19, 2021

topepo commented Aug 19, 2021

George-dr commented Aug 19, 2021

jkafka commented Aug 25, 2021 •

edited

Loading

topepo commented May 5, 2022

No support for weights #136

No support for weights #136

Comments

pakuipers commented Jan 25, 2019

alexpghayes commented Jan 25, 2019

DavisVaughan commented Jan 25, 2019

alexpghayes commented Jan 25, 2019

DavisVaughan commented Jan 25, 2019

topepo commented Jan 25, 2019

ryankarel commented Sep 17, 2019

konradsemsch commented Oct 11, 2019

JakeRuss commented Feb 7, 2020 • edited Loading

beaulucas commented Mar 9, 2020

topepo commented Mar 16, 2020

lepennec commented Mar 23, 2020

Chris-Engelhardt commented Aug 6, 2020

JakeRuss commented Sep 16, 2020

topepo commented Sep 20, 2020

JakeRuss commented Sep 21, 2020 • edited Loading

topepo commented Sep 25, 2020

mpelath commented Dec 9, 2020

JakeRuss commented Dec 9, 2020

LeenSonneveld commented Jan 14, 2021

topepo commented Jan 14, 2021

DavZim commented Mar 9, 2021

anadiedrichs commented Mar 12, 2021

haydo1117 commented Mar 19, 2021 • edited Loading

edreddick commented Mar 25, 2021

ThomasWolf0701 commented Apr 3, 2021

AdrianCKent commented Jul 6, 2021

topepo commented Jul 8, 2021

George-dr commented Aug 19, 2021

topepo commented Aug 19, 2021

George-dr commented Aug 19, 2021

jkafka commented Aug 25, 2021 • edited Loading

topepo commented May 5, 2022

JakeRuss commented Feb 7, 2020 •

edited

Loading

JakeRuss commented Sep 21, 2020 •

edited

Loading

haydo1117 commented Mar 19, 2021 •

edited

Loading

jkafka commented Aug 25, 2021 •

edited

Loading