Added regression syntax design doc #27

bbbales2 · 2020-09-26T23:08:14Z

I've been thinking about some sort of regression syntax in Stan. It's kind of a follow up to Speeding up hierarchical models. When I say regression syntax I'm talking about the interfaces (brms/rstanarm/lme4/glmer/lm/...) that take in something like the R regression syntax (y ~ 1 + x + (1 | group) + ...) and do something with it.

It's probably something that has come up before, but there's just so much there it's hard to talk about it without writing a lot down. So I tried to write down the bare minimum justification and implementation of a regression syntax in Stan.

Stan is written to do regressions and more, so at face value this doesn't seem that important but I'm trying to hit three points:

Make it possible to avoid design matrix libraries outside of Stan
Package up big computation to send it off to a GPU/accelerator
Build design matrices if you need them

Now that I got that written out I have some questions hoooray:

@jgabry @paul-buerkner @bgoodri does avoiding external design matrices buy us anything from an interface-writing perspective? I thought it would when I wrote this, but I've never actually written a high level regression interface and so I have my doubts.
@SteveBronder @t4c1 @rok-cesnovar is this syntax enough to pack things up into big, GPU-friendly calculations?
@dpsimpson is it actually useful to build out design matrices in the way I've written here? I remember the sparse stuff you talked about we do linear algebra on them, but maybe this syntax isn't enough to build the matrices you need and you're left having to pass them in from the outside anyway.

Rendered markdown is here

seantalts · 2020-09-27T19:45:14Z

Tagging @enetsee who is also interested in this topic!

dpsimpson · 2020-09-28T03:57:14Z

Short version. If we stick to just lme4 stuff initially, there are two cases; - Don't marginalize out anything, in which case probably dense is enough - Marginalize out stuff, in which case we need to be able to do sparse matrix stuff (like in the lme4 paper for the Gaussian case, or similarly with Laplace approximations). This is an extremely cool idea!! and it would be very nice to our Python users (but also to our R users, who currently translate from brms and then modify, which is messy). And obviously we can add cool things like GPs (dense) and spatial models (often sparse), and smoothing splines (often sparse), and time series models (sparse). It would be ideal if we could keep as much of the brms notation as possible for these complex models...

…

On Sat, 26 Sep 2020 at 19:08, Ben Bales ***@***.***> wrote: I've been thinking about some sort of regression syntax in Stan. It's kind of a follow up to Speeding up hierarchical models <https://discourse.mc-stan.org/t/speeding-up-hierarchical-models/14783>. When I say regression syntax I'm talking about the interfaces (brms/rstanarm/lme4/glmer/lm/...) that take in something like the R regression syntax (y ~ 1 + x + (1 | group) + ...) and do something with it. It's probably something that has come up before, but there's just so much there it's hard to talk about it without writing a lot down. So I tried to write down the bare minimum justification and implementation of a regression syntax in Stan. Stan is written to do regressions and more, so at face value this doesn't seem that important but I'm trying to hit three points: 1. Make it possible to avoid design matrix libraries outside of Stan 2. Package up big computation to send it off to a GPU/accelerator 3. Build design matrices if you need them Now that I got that written out I have some questions hoooray: 1. @jgabry <https://github.com/jgabry> @paul-buerkner <https://github.com/paul-buerkner> @bgoodri <https://github.com/bgoodri> does avoiding external design matrices buy us anything from an interface-writing perspective? I thought it would when I wrote this, but I've never actually written a high level regression interface and so I have my doubts. 2. @SteveBronder <https://github.com/SteveBronder> @t4c1 <https://github.com/t4c1> @rok-cesnovar <https://github.com/rok-cesnovar> is this syntax enough to pack things up into big, GPU-friendly calculations? 3. @dpsimpson <https://github.com/dpsimpson> is it actually useful to build out design matrices in the way I've written here? I remember the sparse stuff you talked about we do linear algebra on them, but maybe this syntax isn't enough to build the matrices you need and you're left having to pass them in from the outside anyway. Rendered markdown is here <https://github.com/bbbales2/design-docs/blob/feature/0026-regression-syntax/designs/0026-stan-regression-syntax.md> ------------------------------ You can view, comment on, or merge this pull request online at: #27 Commit Summary - Added regression syntax design doc File Changes - *A* designs/0026-stan-regression-syntax.md <https://github.com/stan-dev/design-docs/pull/27/files#diff-efdbb699c90c4837f62d2dbb93124ad2> (256) Patch Links: - https://github.com/stan-dev/design-docs/pull/27.patch - https://github.com/stan-dev/design-docs/pull/27.diff — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#27>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADRICBWL4WYYLHDYG2VQFRDSHZX6VANCNFSM4R3FTBEA> .

t4c1 · 2020-09-28T09:23:07Z

So this is just new syntax for something that is already possible? While I am more familiar with the existing syntax I guess that is fine if that syntax is closer to majority of Stan users.

I have to say I do not see what is the point of the design matrix stuff. To me it just looks like forcing sparse algebra, which will probably be slower that dense version. Are there any advantages in terms of usability?

Anyway, the dense should work fine with GPUs. With kernel generator it won't be hard to implement - we can just write something very similar to the vectorized form of the model. Although the CPU implementation will be hard to optimize, due to the version of Eigen we are using not supporting indexing with vectors. On the other hand we do not have any GPU implementations for sparse operations yet.

dpsimpson · 2020-09-28T13:20:54Z

From my point of view, this isn’t a speed thing. It’s shifting something that is currently done separately for each interface (ie a thing that is only available in RStan) into the language so that there is consistent access to it. It also lets us build a brms-like higher lever interface for all of the versions of Stan, which should help users immensely. A thing that is annoying is that that means the new syntax will need to consistently support factors, which is the thing that model.matrix in R does really well.

…

On Mon, Sep 28, 2020 at 05:23 Tadej Ciglarič ***@***.***> wrote: So this is just new syntax for something that is already possible? While I am more familiar with the existing syntax I guess that is fine if that syntax is closer to majority of Stan users. I have to say I do not see what is the point of the design matrix stuff. To me it just looks like forcing sparse algebra, which will probably be slower that dense version. Are there any advantages in terms of usability? Anyway, the dense should work fine with GPUs. With kernel generator it won't be hard to implement - we can just write something very similar to the vectorized form of the model. Although the CPU implementation will be hard to optimize, due to the version of Eigen we are using not supporting indexing with vectors. On the other hand we do not have any GPU implementations for sparse operations yet. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#27 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADRICBWCVHAGSZPTNB5OVSDSIBIYVANCNFSM4R3FTBEA> .

enetsee · 2020-09-28T13:35:18Z

@seantalts - thanks for the nudge!

@bbbales2, as Sean mentioned, I'm very interested in this topic and is a use case I had in mind when we were looking at the structure of the compiler. Rather than adding this as new syntax to Stan the language, the idea was to have an alternative frontend just for regression models which elaborated to core Stan or the intermediate representation we use in the compiler (i.e. we have an additional step where we recover anything that was left implicit in the terser language of regression).

There is some really interesting work by Andy Gordon on a 'regression calculus' here (which is just a fancy way of describing a core language for regression along with its meaning) which I was keen on extending to add BRMS like functionality. I spoke with Andy Gordon about this briefly at StanCon in Cambridge but I haven't had too much time to follow up of late.

There was also a presentation at StanCon of another front end language, Chronikis which was targeted at time series models and again, elaborated to Stan.

In terms of motivation, I think we have to recognize that syntax is important: it facilitates the expression of recurring patterns and allows us to restrict expressiveness. With specialized front-end we can also introduce stricter semantics allowing us to pick up errors more easily. The very fact that BRMS has a large user base is evidence of the utility of this approach.

I would be very keen to get involved with this from a language design perspective.

paul-buerkner · 2020-09-29T06:23:39Z

I agree this is a very cool idea, and I am happy to provide my expertise with brms in making this possible. Essentially, I do see that a little bit like (a subset of) brms but directly in Stan with potentially much more opportunities to optimize the code on a lower level (brms does try to write optimized Stan code but of course has only access to the top-level of the Stan language itself). Additional advantages would be that we would get methods such as posterior_predict and log_lik (using our R names right now) for these models directly in Stan, which is otherwise a little ugly to do in the generated quantities block.

There are obvious challenges of course but also a few perhaps not as obvious ones, for example, how to deal with predicting new data (which is a quite relevant aspect in brms users' workflows) or how to handle predictions for new levels of the groupings factors. Ideally, we could even achieve some sort of marginalization not only for estimation but also for predictions (e.g., as discussed in paul-buerkner/brms#489). Some of these features may go to far for the Stan regression syntax but would definitely improve its usability a lot.

SteveBronder

Couple little comments. My main Qs are mostly around how we would use this syntax to optimize things on the backend. @t4c1 if you check out the lmer paper they go into details on why the sparse matrix representation can be more efficient for large numbers of random effects. Though I do wonder if there is a way we can keep this dense while being efficient

https://cran.r-project.org/web/packages/lme4/vignettes/lmer.pdf

SteveBronder · 2020-09-29T17:03:08Z

designs/0026-stan-regression-syntax.md

+
+A basic regression might be:
+```
+y ~ 1 + x + (1 | group) + (x | group)


For the 1 in the random effects part what would the generated name be? Just like [group name]_intercept?

It would depend on the package, but I'd expect something like that (and for it to be a vector with the number of elements of the group).

SteveBronder · 2020-09-29T17:12:23Z

designs/0026-stan-regression-syntax.md

+
+So if the following returns a length `N` vector:
+```
+covariate .* (parameters | group)


Is this always going to be a .*? If so I think this would be easier to work with in the compiler and C++ code if we directly put the covariates in the () like

(covariate, parameters | group)

Yes, that seems good. I am nervous the .* there was actually more specialized than a .* in the wild might be, if that makes sense. The Fabular thing that @enetsee linked above does something like:

(covariate {parameters} | group)

Not opposed to something else either.

SteveBronder · 2020-09-29T17:46:03Z

designs/0026-stan-regression-syntax.md

+vector[N] mu = intercept + slope * x +
+    (group1_intercepts | group1_idx) + x .* (group1_slopes | group1_idx) +
+    (group12_intercepts | group1_idx, group2_idx) +
+    x .* (group12_slopes | group1_idx, group2_idx);


Looking at this I'm not really sure exactly how we'd optimize this on the backend. I kind of like something similar to your original proposal for a hiearchy() function for something like

vector[N] mu = intercept + slope * x + hierarchy(f, group1_idx, group2_idx, group12_intercepts, x, group12_slopes)

Where f is a function with a signature like vector f(real intercept, matrix x, vector slope) that would be applied to each group (in this case the group is the combo of groups 1 and 2).

I wonder if we could do anything useful if we have a group_index type in the stan language? Or could we give matrices an index? Maybe something like

group_index[N] group_idx1; group_index[N] group_idx2; // Anything after first two dims for a matrix are indices? matrix[N, M, group_idx1, group_idx2] X; // Or maybe a tag? matrix[N, M] X index(group_idx1, group_idx2);

Though I think the issue with what I have here is that we would have to pay for a sort / unsort if the data was not presorted so we could iterate over the groups nicely

I'm not really sure exactly how we'd optimize this on the backend

I imagined this would be something recognizable by the compiler. Like, "oh, I have a bunch of design matrices, better lump them together and turn them into a big hierarchy statement".

But also I think @t4c1 likes expressions and I could see that working too, so it kinda all comes together at runtime. As long as it gets into a big hierarchy-like statement it's good.

group_index type in the stan language
we would have to pay for a sort / unsort if the data

Is the goal here to express a sparseness or to express the ordering of the matrix to make accesses efficient in some way?

andrewgelman · 2020-09-30T02:22:11Z

Interesting! I'd love to be involved in some of this discussion. Also could someone tag Bob Carpenter (I'm not sure how to do that), as he and Matt Hoffman and I were talking a lot about these issues back in 2011. Also Jonah/BenG/Aki, because this stuff would be directly relevant to our forthcoming book on applied regression and multilevel models. Andrew

…

On Sep 29, 2020, at 2:23 AM, Paul-Christian Bürkner ***@***.***> wrote: I agree this is a very cool idea, and I am happy to provide my expertise with brms in making this possible. Essentially, I do see that a little bit like (a subset of) brms but directly in Stan with potentially much more opportunities to optimize the code on a lower level (brms does try to write optimized Stan code but of course has only access to the top-level of the Stan language itself). Additional advantages would be that we would get methods such as posterior_predict and log_lik (using our R names right now) for these models directly in Stan, which is otherwise a little ugly to do in the generated quantities block. There are obvious challenges of course but also a few perhaps not as obvious ones, for example, how to deal with predicting new data (which is a quite relevant aspect in brms users' workflows) or how to handle predictions for new levels of the groupings factors. Ideally, we could even achieve some sort of marginalization not only for estimation but also for predictions (e.g., as discussed in paul-buerkner/brms#489 <paul-buerkner/brms#489>). Some of these features may go to far for the Stan regression syntax but would definitely improve its usability a lot. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#27 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZYCUOKMQKTJLFGLGKMPHTSIF4PVANCNFSM4R3FTBEA>.

SteveBronder · 2020-09-30T03:02:16Z

@bob-carpenter

bbbales2 · 2020-10-04T21:57:10Z

And obviously we can add cool things like GPs (dense) and spatial models
(often sparse), and smoothing splines (often sparse), and time series
models (sparse). It would be ideal if we could keep as much of the brms
notation as possible for these complex models...

@dpsimpson Yeah I like the y ~ gp(x) sorta thing that brms does.

Separately I would like some sort of Stan library system with a package manager. I think that's the only practical way we're going to get all the fancy design matrices and stuff that would need built for splines, approx GPs, etc in Stan models. Pumping that stuff through Stan math would be too slow development-wise. I think if we had a more formal/centralized way for people to share their code, we'd see a lot of interesting code getting shared (and avoid everyone having to reinvent things, digging in forums for code snippets, etc.).

So this is just new syntax for something that is already possible?
Are there any advantages in terms of usability?

@t4c1 Yeah, having the design matrices and being able to evaluate/use them efficiently is a plus. Part of the goal is usability and part of the goal is to make it easier for the compiler/runtime to recognize that a bunch of calculations are related. A little of both.

There is some really interesting work by Andy Gordon on a 'regression calculus'
another front end language, Chronikis

@enetsee The Fabular thing was interesting. Yes this is very much what I want to see in Stan. I like how they talk about breaking the regressions out into different levels. I also like the variables being explicit, but I'm probably not considering some advantage of them being implicit.

I had a look at Chronikis. I was kinda thinking about domain-specific Stan packages (alternative frontends like you're saying, and I talked to @imadmali about one). The problem with front end languages is I don't know compiler stuff and so the extent of compiling I want to do is Python format-strings sorta code generation.

The discourse thread at the top (this one) has sortof a brute-force giant function that we could use for evaluating the right hand side of a hierarchical regression. This would be something we could compile a higher level language to.

The downsides to this are it's not terribly easy to read the syntax, we don't get the design matrices themselves, and there's no hierarchy of regressions like in the regression calculus paper. I don't want to clutter the Stan language up with too much extra syntax, but I also wouldn't want to limit it to a super-conservative syntax either (cuz the intention is lots of people happily use Stan and the interface languages -- not filter everyone from Stan to interfaces).

There are obvious challenges of course but also a few perhaps not as obvious ones

@paul-buerkner Could we talk about the workflow stuff on this sometime? I am rather scared of forgetting something here. Maybe in a couple weeks after the 2.25 release (on the 19th)? @enetsee or @andrewgelman or anyone else if you want to come to this, that's great. I'll update the design doc with whatever we talk about.

paul-buerkner · 2020-10-05T09:00:08Z

Happy to have a chat about this. Just let me know here or via email. Am So., 4. Okt. 2020 um 23:57 Uhr schrieb Ben Bales < [email protected]>:

…

And obviously we can add cool things like GPs (dense) and spatial models (often sparse), and smoothing splines (often sparse), and time series models (sparse). It would be ideal if we could keep as much of the brms notation as possible for these complex models... @dpsimpson <https://github.com/dpsimpson> Yeah I like the y ~ gp(x) sorta thing that brms does. Separately I would like some sort of Stan library system with a package manager. I think that's the only practical way we're going to get all the fancy design matrices and stuff that would need built for splines, approx GPs, etc in Stan models. Pumping that stuff through Stan math would be too slow development-wise. I think if we had a more formal/centralized way for people to share their code, we'd see a lot of interesting code getting shared (and avoid everyone having to reinvent things, digging in forums for code snippets, etc.). So this is just new syntax for something that is already possible? Are there any advantages in terms of usability? @t4c1 <https://github.com/t4c1> Yeah, having the design matrices and being able to evaluate/use them efficiently is a plus. Part of the goal is usability and part of the goal is to make it easier for the compiler/runtime to recognize that a bunch of calculations are related. A little of both. There is some really interesting work by Andy Gordon on a 'regression calculus' another front end language, Chronikis @enetsee <https://github.com/enetsee> The Fabular thing was interesting. Yes this is very much what I want to see in Stan. I like how they talk about breaking the regressions out into different levels. I also like the variables being explicit, but I'm probably not considering some advantage of them being implicit. I had a look at Chronikis. I was kinda thinking about domain-specific Stan packages (alternative frontends like you're saying, and I talked to @imadmali <https://github.com/imadmali> about one). The problem with front end languages is I don't know compiler stuff and so the extent of compiling I want to do is Python format-strings sorta code generation. The discourse thread at the top (this one <https://discourse.mc-stan.org/t/speeding-up-hierarchical-models/14783>) has sortof a brute-force giant function that we could use for evaluating the right hand side of a hierarchical regression. This would be something we could compile a higher level language to. The downsides to this are it's not terribly easy to read the syntax, we don't get the design matrices themselves, and there's no hierarchy of regressions like in the regression calculus paper. I don't want to clutter the Stan language up with too much extra syntax, but I also wouldn't want to limit it to a super-conservative syntax either (cuz the intention is lots of people happily use Stan and the interface languages -- not filter everyone from Stan to interfaces). There are obvious challenges of course but also a few perhaps not as obvious ones @paul-buerkner <https://github.com/paul-buerkner> Could we talk about the workflow stuff on this sometime? I am rather scared of forgetting something here. Maybe in a couple weeks after the 2.25 release (on the 19th)? @enetsee <https://github.com/enetsee> or @andrewgelman <https://github.com/andrewgelman> or anyone else if you want to come to this, that's great. I'll update the design doc with whatever we talk about. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#27 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADCW2AG5DHNRNSUB6PU4PBLSJDVUHANCNFSM4R3FTBEA> .

bbbales2 · 2020-10-26T12:11:56Z

@paul-buerkner @enetsee @andrewgelman are you free any time in the next couple weeks to talk about the not-so-obvious challenges here?

Edit: I'm available basically anytime. If there's no obvious other preferable time, before or after the Stan meeting Thursday is good.

paul-buerkner · 2020-10-26T12:39:57Z

I would be flexible on most days as well. Feel free to suggest a time.

bbbales2 · 2020-10-26T12:47:43Z

@paul-buerkner let's plan on before the Stan meeting then. That's 2PM UTC Thursday (10AM EST) if I'm doing my math correctly.

@enetsee if you want to come and that doesn't work, pick another time. Sounds like we're both flexible. I'll e-mail a call link before then.

paul-buerkner · 2020-10-26T13:38:37Z

If you want to talk this week, would it also work after the Stan meeting?

bbbales2 · 2020-10-26T13:43:26Z

After is good too, so let's say 4PM UTC, noon EST

andrewgelman · 2020-12-17T16:27:04Z

Hi, just curious what happened with this?

bbbales2 · 2020-12-17T16:38:38Z

@andrewgelman had a meeting and I never updated the design doc.

There were some things there worth writing down, but the thing that was nagging me when I was trying to update this were unresolved problems with defining parameters that might have different sizes as a function of data.

Problems like, you have an interaction:

parameters {
  matrix[N_ages, N_educs] alpha0;
}

And then you have a model where in your data you don't have at least one of all the ages and educations.

I sorta got stuck here. And if you don't solve this problem you still leave a lot of annoying shuffling of stuff to the outside.

bbbales2 · 2020-12-17T16:41:12Z

I'm not sure I talk about this problem in the design doc, but I think I am assuming that it's solved in some way, and that seems like a big assumption.

Added regression syntax design doc

2d4eff4

bbbales2 marked this pull request as draft September 26, 2020 23:08

SteveBronder reviewed Sep 29, 2020

View reviewed changes

bbbales2 mentioned this pull request Dec 5, 2020

ranef function for accumulating random effects without getting temporaries stan-dev/math#2237

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added regression syntax design doc #27

Added regression syntax design doc #27

bbbales2 commented Sep 26, 2020 •

edited

Loading

seantalts commented Sep 27, 2020

dpsimpson commented Sep 28, 2020 via email

t4c1 commented Sep 28, 2020

dpsimpson commented Sep 28, 2020 via email

enetsee commented Sep 28, 2020 •

edited

Loading

paul-buerkner commented Sep 29, 2020

SteveBronder left a comment

SteveBronder Sep 29, 2020

bbbales2 Oct 4, 2020

SteveBronder Sep 29, 2020

bbbales2 Oct 4, 2020

SteveBronder Sep 29, 2020 •

edited

Loading

bbbales2 Oct 4, 2020

andrewgelman commented Sep 30, 2020 via email

SteveBronder commented Sep 30, 2020

bbbales2 commented Oct 4, 2020

paul-buerkner commented Oct 5, 2020 via email

bbbales2 commented Oct 26, 2020 •

edited

Loading

paul-buerkner commented Oct 26, 2020

bbbales2 commented Oct 26, 2020

paul-buerkner commented Oct 26, 2020 •

edited

Loading

bbbales2 commented Oct 26, 2020

andrewgelman commented Dec 17, 2020

bbbales2 commented Dec 17, 2020

bbbales2 commented Dec 17, 2020

Added regression syntax design doc #27

Are you sure you want to change the base?

Added regression syntax design doc #27

Conversation

bbbales2 commented Sep 26, 2020 • edited Loading

seantalts commented Sep 27, 2020

dpsimpson commented Sep 28, 2020 via email

t4c1 commented Sep 28, 2020

dpsimpson commented Sep 28, 2020 via email

enetsee commented Sep 28, 2020 • edited Loading

paul-buerkner commented Sep 29, 2020

SteveBronder left a comment

Choose a reason for hiding this comment

SteveBronder Sep 29, 2020

Choose a reason for hiding this comment

bbbales2 Oct 4, 2020

Choose a reason for hiding this comment

SteveBronder Sep 29, 2020

Choose a reason for hiding this comment

bbbales2 Oct 4, 2020

Choose a reason for hiding this comment

SteveBronder Sep 29, 2020 • edited Loading

Choose a reason for hiding this comment

bbbales2 Oct 4, 2020

Choose a reason for hiding this comment

andrewgelman commented Sep 30, 2020 via email

SteveBronder commented Sep 30, 2020

bbbales2 commented Oct 4, 2020

paul-buerkner commented Oct 5, 2020 via email

bbbales2 commented Oct 26, 2020 • edited Loading

paul-buerkner commented Oct 26, 2020

bbbales2 commented Oct 26, 2020

paul-buerkner commented Oct 26, 2020 • edited Loading

bbbales2 commented Oct 26, 2020

andrewgelman commented Dec 17, 2020

bbbales2 commented Dec 17, 2020

bbbales2 commented Dec 17, 2020

bbbales2 commented Sep 26, 2020 •

edited

Loading

enetsee commented Sep 28, 2020 •

edited

Loading

SteveBronder Sep 29, 2020 •

edited

Loading

bbbales2 commented Oct 26, 2020 •

edited

Loading

paul-buerkner commented Oct 26, 2020 •

edited

Loading