Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shift() behaviour with missing periods #1530

Closed
pstoyanov opened this issue Feb 10, 2016 · 6 comments
Closed

shift() behaviour with missing periods #1530

pstoyanov opened this issue Feb 10, 2016 · 6 comments
Assignees
Milestone

Comments

@pstoyanov
Copy link

Hi,
I am not sure whether to formulate this as a question, a suggestion for warning in the docs or a feature request...

The way in which shift() currently works (with irregular intervals) is like this:

> DT <- data.table(time = 1:5, value = c("A", "B", "C", "D", "E"))[!3] # make irregular by dropping one row
> setkey(DT, time)
> DT
   time value
1:    1     A
2:    2     B
3:    4     D
4:    5     E
> DT[, lag_value := shift(x = value, n = 1, fill = NA, type = "lag")]
> DT
   time value lag_value
1:    1     A        NA
2:    2     B         A
3:    4     D         B
4:    5     E         D

My issue is the lagged value for time == 4. I'd expect behaviour like this if I specify rolling explicitly, but not in lagging. The documentation for shift specifically refers to the n argument as "...periods to lead/lag by..", and with n = 1 the value maybe should not silently be lagged 2 periods. lag {stats} also specifies their argument k as "The number of lags (in units of observations)"

What the user (if they tend to think like me, that is) probably meant was:

> DT[.(c(min(time):max(time))), lag_value := shift(x = value, n = 1, fill = NA, type = "lag"), nomatch = NA]
> DT
   time value lag_value
1:    1     A        NA
2:    2     B         A
3:    4     D        NA
4:    5     E         D

or even

> DT[CJ(c(min(time):max(time))), j = .(time, value, lag_value = shift(x = value, n = 1, fill = NA, type = "lag")), nomatch = NA]
   time value lag_value
1:    1     A        NA
2:    2     B         A
3:    3    NA         B
4:    4     D        NA
5:    5     E         D

I understand this is an issue only for ordered, regular, time-like cases but still maybe worth mentioning in the documentation, or adding a warning?
Or modifying the fill = argument (which currently is more like pad_by =) to provide the suggested behaviour above?

Thanks!

@franknarf1
Copy link
Contributor

shift() operates on the vector itself and has no knowledge of other vectors, like your time column. If you are looking for functions that are time-series aware, you'll probably want a specialized package for that, like zoo.

@MichaelChirico
Copy link
Member

what's wrong with the self-join?

@jangorecki
Copy link
Member

or join with full sequence of dates expanding your dataset, if needed do roll join so last known values is provided for the gaps

@pstoyanov
Copy link
Author

If the user is aware that they need to do it, there certainly are many ways to achieve the desired output, with data.table itself or other packages. My biggest issue with shift() is that the documentation explicitly uses the word periods but means positions in the vector due to the automatic and silent rolling. I find that un-intuitive and confusing as "shift x by n periods" should imply that n is constant for all values of x unless rolling is explicitly requested. (I wouldn't mind being overruled by a native speaker). So, at least in the case where the data is keyed by date/time, the user should be warned -- in the documentation or when running -- that if there are 'holes' in the dates, and some data pieces were shifted by irregular intervals. Or provide an argument to explicitly specify whether rolling should be done. Automatically & silently rolling the last observation has the potential to be too dangerous in many cases, and seems inconsistent with the roll=FALSE default in data.table().

My impression was that shift() was intended to make more faster and convenient stuff like DT[, lag_value := DT[.(time - 1)][, value]] or DT[DT[, j = .(time + 1, value)], lag_value_i := i.value]. Both of these implicitly use the default roll=FALSE in the calls to data.table, while shift() operates likeDT[, lag_value_WITH_ROLL := DT[.(time - 1), nomatch = NA, roll = TRUE][, value]].

Looking back at e.g. SO1 and SO2, rolling had to be explicitly set to TRUE if needed.

@franknarf1 I agree that trying to include logic to infer the user's intent in shift() is far from trivial, even if a few general rules can be constructed by looking at the type of the columns in the key or group= -- e.g. if the last one is a date/time...

@jangorecki
Copy link
Member

@pstoyanov that's a fair point on the documentation.
This in fact could work on periods, if shift could map IPeriod class to vector positions.
I've made a while ago IPeriod which is integer based periods data type, manual.
Usage also described in this SO.
Shift could simply take IPeriod integer values and maps to vector position, when values would be less than 100k (which is easy as you can use origin arg) then AFAIK should be additional speedup.

@MichaelChirico
Copy link
Member

I'm actually not sure it's nontrivial to get at what you have in mind. n = periods(x, "day"), where periods is an internal helper function that does any of the suggestions mentioned here should be able to accomplish the right thing.

actually this seems like a pretty good suggestion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants