shift() behaviour with missing periods #1530

pstoyanov · 2016-02-10T00:41:29Z

Hi,
I am not sure whether to formulate this as a question, a suggestion for warning in the docs or a feature request...

The way in which shift() currently works (with irregular intervals) is like this:

> DT <- data.table(time = 1:5, value = c("A", "B", "C", "D", "E"))[!3] # make irregular by dropping one row
> setkey(DT, time)
> DT
   time value
1:    1     A
2:    2     B
3:    4     D
4:    5     E
> DT[, lag_value := shift(x = value, n = 1, fill = NA, type = "lag")]
> DT
   time value lag_value
1:    1     A        NA
2:    2     B         A
3:    4     D         B
4:    5     E         D

My issue is the lagged value for time == 4. I'd expect behaviour like this if I specify rolling explicitly, but not in lagging. The documentation for shift specifically refers to the n argument as "...periods to lead/lag by..", and with n = 1 the value maybe should not silently be lagged 2 periods. lag {stats} also specifies their argument k as "The number of lags (in units of observations)"

What the user (if they tend to think like me, that is) probably meant was:

> DT[.(c(min(time):max(time))), lag_value := shift(x = value, n = 1, fill = NA, type = "lag"), nomatch = NA]
> DT
   time value lag_value
1:    1     A        NA
2:    2     B         A
3:    4     D        NA
4:    5     E         D

or even

> DT[CJ(c(min(time):max(time))), j = .(time, value, lag_value = shift(x = value, n = 1, fill = NA, type = "lag")), nomatch = NA]
   time value lag_value
1:    1     A        NA
2:    2     B         A
3:    3    NA         B
4:    4     D        NA
5:    5     E         D

I understand this is an issue only for ordered, regular, time-like cases but still maybe worth mentioning in the documentation, or adding a warning?
Or modifying the fill = argument (which currently is more like pad_by =) to provide the suggested behaviour above?

Thanks!

The text was updated successfully, but these errors were encountered:

franknarf1 · 2016-02-10T00:58:22Z

shift() operates on the vector itself and has no knowledge of other vectors, like your time column. If you are looking for functions that are time-series aware, you'll probably want a specialized package for that, like zoo.

MichaelChirico · 2016-02-10T02:39:26Z

what's wrong with the self-join?

jangorecki · 2016-02-10T08:52:20Z

or join with full sequence of dates expanding your dataset, if needed do roll join so last known values is provided for the gaps

pstoyanov · 2016-02-10T11:31:02Z

If the user is aware that they need to do it, there certainly are many ways to achieve the desired output, with data.table itself or other packages. My biggest issue with shift() is that the documentation explicitly uses the word periods but means positions in the vector due to the automatic and silent rolling. I find that un-intuitive and confusing as "shift x by n periods" should imply that n is constant for all values of x unless rolling is explicitly requested. (I wouldn't mind being overruled by a native speaker). So, at least in the case where the data is keyed by date/time, the user should be warned -- in the documentation or when running -- that if there are 'holes' in the dates, and some data pieces were shifted by irregular intervals. Or provide an argument to explicitly specify whether rolling should be done. Automatically & silently rolling the last observation has the potential to be too dangerous in many cases, and seems inconsistent with the roll=FALSE default in data.table().

My impression was that shift() was intended to make more faster and convenient stuff like DT[, lag_value := DT[.(time - 1)][, value]] or DT[DT[, j = .(time + 1, value)], lag_value_i := i.value]. Both of these implicitly use the default roll=FALSE in the calls to data.table, while shift() operates likeDT[, lag_value_WITH_ROLL := DT[.(time - 1), nomatch = NA, roll = TRUE][, value]].

Looking back at e.g. SO1 and SO2, rolling had to be explicitly set to TRUE if needed.

@franknarf1 I agree that trying to include logic to infer the user's intent in shift() is far from trivial, even if a few general rules can be constructed by looking at the type of the columns in the key or group= -- e.g. if the last one is a date/time...

jangorecki · 2016-02-10T12:36:38Z

@pstoyanov that's a fair point on the documentation.
This in fact could work on periods, if shift could map IPeriod class to vector positions.
I've made a while ago IPeriod which is integer based periods data type, manual.
Usage also described in this SO.
Shift could simply take IPeriod integer values and maps to vector position, when values would be less than 100k (which is easy as you can use origin arg) then AFAIK should be additional speedup.

MichaelChirico · 2016-02-10T12:41:12Z

I'm actually not sure it's nontrivial to get at what you have in mind. n = periods(x, "day"), where periods is an internal helper function that does any of the suggestions mentioned here should be able to accomplish the right thing.

actually this seems like a pretty good suggestion.

arunsrinivasan added documentation Low labels Feb 17, 2016

arunsrinivasan added this to the v1.9.8 milestone Feb 17, 2016

arunsrinivasan self-assigned this Feb 17, 2016

arunsrinivasan closed this as completed in 272331a Feb 17, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

shift() behaviour with missing periods #1530

shift() behaviour with missing periods #1530

pstoyanov commented Feb 10, 2016

franknarf1 commented Feb 10, 2016

MichaelChirico commented Feb 10, 2016

jangorecki commented Feb 10, 2016

pstoyanov commented Feb 10, 2016

jangorecki commented Feb 10, 2016

MichaelChirico commented Feb 10, 2016

shift() behaviour with missing periods #1530

shift() behaviour with missing periods #1530

Comments

pstoyanov commented Feb 10, 2016

franknarf1 commented Feb 10, 2016

MichaelChirico commented Feb 10, 2016

jangorecki commented Feb 10, 2016

pstoyanov commented Feb 10, 2016

jangorecki commented Feb 10, 2016

MichaelChirico commented Feb 10, 2016