Replace integers with explicit integers (1 -> 1L, etc.) #2573

MichaelChirico · 2018-01-18T03:38:46Z

Part of #2572, tried to make sure I respected "strong typing" (don't force integer on potential numerics).

Command to find these:

#numeric ones (1 -> 1L)
grep '^[^#]*[^a-zA-Z_]1[^L]' R/* > ~/Desktop/all1s.out
#for all integers
grep '^[^#]*[^a-zA-Z_0-9][0-9][^0-9L]' R/* > ~/Desktop/numeric_integers.out

mattdowle · 2018-01-18T05:11:57Z

Very nice!
I'm not sure the status of global small integer constants in base R (i.e. whether altrep got to that or not and if that's been merged to R-devel) but I know it was planned. Either way, this PR is needed to have the best chance to benefit from that.
Interested to see the time of test.data.table() (without Suggests packages, just plain DT tests) before and after. But even if no speedup on that, it's still worth doing.

MichaelChirico · 2018-01-18T06:13:21Z

head and tail GForce optimization failures are legit:

dt = data.table(
  x  = sample(letters, 300, TRUE),
  d1 = as.numeric(sample(-10:10, 300, TRUE)),
  d2 = as.numeric(sample(c(NA, NaN, -10:10), 300, TRUE))
)
# not optimized now
dt[x %in% letters[15:20], head(.SD,1), by=x, verbose=TRUE]
# still optimized
dt[x %in% letters[15:20], head(.SD,1L), by=x, verbose=TRUE]
#idem
dt[x %in% letters[15:20], tail(.SD,1), by=x, verbose=TRUE]
dt[x %in% letters[15:20], tail(.SD,1L), by=x, verbose=TRUE]

Doesn't make sense to require users to run head(.SD, 1L) to benefit from optimization (even if it's best practice); GForce should detect and convert.

Investigating, though I'm in the dark here for a moment. Sometimes strong typing would be nice 😬

MichaelChirico · 2018-01-18T06:15:30Z

Also, should probably turn the lines like:

if (verbose) {cat(round(proc.time()[3L]-last.started.at, 3L),"secs\n");flush.console}

into a function since they're so common -- @mattdowle IIUC this is what the timetaken function is for?

@mattdowle edit : yes, it should be using timetaken there.

MichaelChirico · 2018-01-18T06:20:25Z

R/data.table.R

-                          as.character(q[[1]]) %chin% "[" && is.numeric(q[[3]]) &&
-                            length(q[[3]])==1 && q[[3]]>0 )
+          ans = cond && length(q)==3L && ( as.character(q[[1L]]) %chin% c("head", "tail") &&
+                             (identical(q[[3L]], 1L) || identical(q[[3L]], 1L)) ||


Any reason this can't just be q[[3L]] == 1L?

MichaelChirico · 2018-01-18T06:28:06Z

R/data.table.R

+          ans = cond && length(q) == 3L &&
+            length(q[[3L]]) == 1L && is.numeric(q[[3L]]) && (
+              as.character(q[[1L]]) %chin% c("head", "tail") && q[[3L]] == 1L ||
+                as.character(q[[1L]]) %chin% "["  && q[[3L]] > 0 )


The logic now seems more natural to me (in particular, I see no reason to check is.numeric and length conditions only for the [ case and not for the head/tail case -- perhaps a few tests are in order?), and the code reads better too.

The use of identical may have had NA in mind. Also, for the head / tail case, is.numeric and length were being checked via identical(). Yes more tests would be great. Seems like codecov agrees and is not-passing which is good (confirms it's a good exercise).
The line needs a comment saying what it's supposed to be doing, doesn't it!
Can you remove the added spaces around == please. To my eyes there is now lots of separate things all with space between them and it's harder to see the tree of logic. = is already a different character to letters (and is not valid in symbols) so = is already a visual separator. Removing the spaces around == allows you to budge the pieces together in this case, so that the remaining spaces are more helpful.
~~Isn't head and tail with n>1 now going to pass? That doesn't seem to be the intent of the original line (only n==1L and n==1 were allowed, iiuc).~~ Ok I see the q[[3L]]==1L now.
An extra set of parens around the left and right of the last || would be clearer. There was a paren around the left of that || in the original but not on its right.

codecov-io · 2018-01-18T06:44:52Z

Codecov Report

Merging #2573 into master will increase coverage by 1.45%.
The diff coverage is 94.68%.

@@            Coverage Diff             @@
##           master    #2573      +/-   ##
==========================================
+ Coverage   91.49%   92.94%   +1.45%     
==========================================
  Files          63       61       -2     
  Lines       12229    12107     -122     
==========================================
+ Hits        11189    11253      +64     
+ Misses       1040      854     -186

Impacted Files	Coverage Δ
R/fmelt.R	`100% <ø> (ø)`	⬆️
R/utils.R	`82.6% <ø> (ø)`	⬆️
R/as.data.table.R	`100% <100%> (ø)`	⬆️
R/foverlaps.R	`100% <100%> (+5.69%)`	⬆️
R/getdots.R	`100% <100%> (ø)`	⬆️
R/bmerge.R	`100% <100%> (ø)`	⬆️
R/merge.R	`93.93% <100%> (ø)`	⬆️
R/print.data.table.R	`98.13% <100%> (ø)`	⬆️
R/groupingsets.R	`100% <100%> (ø)`	⬆️
R/IDateTime.R	`85.61% <100%> (+5.47%)`	⬆️
... and 17 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 05c5122...5be3049. Read the comment docs.

MichaelChirico · 2018-01-18T08:09:30Z

@mattdowle Ran this to test speed:

library(data.table)
library(magrittr)
replicate(10, {
  capture.output(suppressMessages(test.data.table())) %>% 
    capture.output %>% 
    grep('completed ok', ., value = TRUE) %>%
    gsub('.*completed ok in ([0-9:]+).*', '\\1', .) %>%
    strsplit(':') %>% extract2(1L) %>% extract(-1L) %>%
    as.numeric %>% (function(x) {x[1L] + x[2L]/60})
}) %>% mean

Average duration from testing 10 times on version currently at master:

# [1] 1.818333

Compared to this branch:

# [1] 1.758333

About a 3% improvement... pretty damn good if you ask me! Certainly more than I expected.

(I do have many of the external packages; perhaps we should add a testDeps argument to test.data.table?)

Loading required package: bit64
Loading required package: bit
Loading required package: knitr
Loading required package: ggplot2
Loading required package: plyr
Loading required package: reshape2
Loading required package: nlme
Loading required package: curl
Loading required package: rmarkdown
Loading required package: parallel

mattdowle · 2018-01-18T08:20:12Z

Nice result. What's the improvement in the min not mean, though (perhaps even better!)
Why not just time test.data.table() with n=10 using microbenchmark? Its runtime is the same as the time it prints at the end isn't it? It's important to have the gc() run by system.time, especially here where the change is something related to many small objects. Otherwise gc() will run a bit more randomly and the time to gc() is included in your replicate. Probably gc() happens so often in the test suite that it makes no difference, but easier to rule that out.
Yes : testDeps added to test.data.table would be great.

MichaelChirico · 2018-01-18T08:25:27Z

Certainly easier to do, but I wasn't sure if there's any overhead in test.data.table() that's skipped by its internal timer

mattdowle · 2018-01-18T08:27:33Z

I'd be surprised if the overhead is more than negligible. Easy to check...

> system.time(test.data.table())
...
All 7596 tests in inst/tests/tests.Rraw completed ok in 00:01:23
   user  system elapsed 
 85.512   0.716  82.933

00:01:23 == 82.933

MichaelChirico · 2018-01-18T08:47:56Z

Hmm, slightly different for me:

All 7661 tests in inst/tests/tests.Rraw completed ok in 00:01:48 on Thu Jan 18 16:46:12 2018
   user  system elapsed 
166.901   7.128 109.743

anyway i'll run with microbenchmark

MichaelChirico · 2018-01-18T09:32:25Z

Not as impressive this time (<1%):

Unit: seconds
            expr      min       lq     mean   median      uq     max neval
 explicitInteger  96.4129  97.7939 101.9369 101.4272 105.417 108.409    10
          master 100.3113 100.9985 102.4299 102.3071 103.081 106.099    10

(cobbled together from microbenchmark runs on separate sessions)

MichaelChirico · 2018-01-22T09:53:41Z

R/data.table.R

-      }
-      stop("problem recycling column ",i,", try a simpler type")
-    # }
-    stop("argument ",i," (nrow ",nrows[i],") cannot be recycled without remainder to match longest nrow (",nr,")")


As near as I can tell this error is superseded by the above warning? Also eliminated the commented branch since it seems to be vestigial.

MichaelChirico · 2018-01-22T10:59:35Z

R/between.R

@@ -22,18 +22,18 @@ inrange <- function(x,lower,upper,incbounds=TRUE) {
  subject = setDT(list(l=lower, u=upper))
  ops = if (incbounds) c(4L, 2L) else c(5L, 3L) # >=,<= and >,<
  verbose = getOption("datatable.verbose")
-  if (verbose) {last.started.at=proc.time()[3];cat("forderv(query) took ... ");flush.console()}


@mattdowle what about two functions, timestart and timetaken? timestart would accept a message as input, cat it, flush.console()s and return the proc.time(); timetaken takes the result of timestart, cats it, and flush.console(), and returns nothing.

(also perhaps this should be done as a separate PR?)

Yes separate PR. Sometimes there are several nested timing blocks in play. Storing the started.at time in a variable and then passing that (or the appropriate one) to timetaken I think I like the balance of using a common function along with the flexibility of using it in different ways depending on the circumstances. i.e. current timetaken() approach, but actually using it here. IIUC.

mattdowle · 2018-01-23T00:39:06Z

On the contrary, I'd say the min and lq figures confirm the 3%-4% improvement. The other figures are affected too much by a varying number of gc()'s or effects like that. Given that the test suite has a large number of small arbitrary tests and we didn't think at all about an appropriate test, it's impressive any improvement shows up at all. Take a particular test that hits those the changed lines, and repeat that one test many times and the improvement could be larger (but I'm not suggested doing that, no need). In R-devel the improvement may be greater by avoiding the small creation at all (which wasn't possible before as a type double).

mattdowle · 2018-01-23T01:00:19Z

Looks like it's a correct fail on test 1187.6 I see locally. I'll leave to you.

MichaelChirico · 2018-01-29T06:52:59Z

R/foverlaps.R

@@ -128,24 +129,15 @@ foverlaps <- function(x, y, by.x = if (!is.null(key(x))) key(x) else key(y), by.
  }
  # nomatch has no effect here, just for passing arguments consistently to `bmerge`
  .Call(Clookup, uy, nrow(y), indices(uy, y, yintervals, nomatch=0L, roll=roll), maxgap, minoverlap, mult, type, verbose)
-  if (maxgap == 0L && minoverlap == 1L) {


@arunsrinivasan check out this diff. axed iintervals as it appears unused. And axed the extra argument checks to simplify so long as the other options are unimplemented

@MichaelChirico good catch regarding iintervals. Seems like it wasn't used right from the first commit of foverlaps.

I understand why you simplified code, but I left it there to serve as a placeholder when I'll come back to work on it... The first if statement would always be run currently. So that's the only cost here, which is fine.. Would it be possible to revert this part to how it was please?

The PR overall looks great! Apologies for not being able to write earlier.

@arunsrinivasan I figured that's why you left it like that... only changed the logic to make testing easier, Codecov was complaining since it's impossible to get to the other branches to test them (all hail the mighty Codecov 🌞 ⛪️ ). I figured it'd be easy enough to revert to the old code when needed (that's what git's for right?)

as a compromise, I could leave the old code in, just commented out?

Why not wrap those lines with # nocov start and # nocov end so as to skip coverage tests? See https://github.com/Rdatatable/data.table/blob/70db3e48738026388c6487e83ee902ad37256ca5/R/c.factor.R

Yes, commented out with a comment pointing to this PR would be okay too.

simply because i didn't know that was possible! I'll add that to onAttach and onLoad as well... thanks!

MichaelChirico · 2018-01-29T08:40:51Z

@mattdowle OK I think I've hammered away enough at the codecov diff miss for now. AFAICT the major remaining miss is from changes to onAttach.R, which I don't think (?) are testable.

mattdowle

Fantastic!
This could help memory usage and gc() time, too (not just the 3-4% you found), by avoiding all those length 1 numeric vectors. Anyway, it puts data.table in the best position to benefit from the global-small-integer optimization on the way in R itself.
And increase of coverage by 0.33% (from 91.48% to 91.81%) is worth it by itself.

I've requested review from @arunsrinivasan re the foverlaps changes. Maybe those removed lines could be captured in an issue to implement or as a comment, if not already.

…f internal-to-R global small-integers

arunsrinivasan

@MichaelChirico overall great PR! I've commented under foverlaps.

Apologies for not replying earlier.

arunsrinivasan · 2018-01-31T19:01:17Z

R/foverlaps.R

@@ -15,7 +15,7 @@ foverlaps <- function(x, y, by.x = if (!is.null(key(x))) key(x) else key(y), by.
  mult = match.arg(mult)
  if (type == "equal")
    stop("type = 'equal' is not implemented yet. But note that this is just the same as a normal data.table join y[x, ...], unless you are also interested in setting 'minoverlap / maxgap' arguments. But those arguments are not implemented yet as well.")
-  if (maxgap > 0L || minoverlap > 1L)
+  if (maxgap != 0L || minoverlap != 1L)


maxgap < 0 and minoverlap < 1 both are checked in lines 6 and 8. What's the purpose of this change?

arunsrinivasan · 2018-01-31T19:03:12Z

R/foverlaps.R

-  setkey(uy)[, `:=`(lookup = list(list(integer(0))), type_lookup = list(list(integer(0))), count=0L, type_count=0L)]
-  if (verbose) {cat(round(proc.time()[3]-last.started.at,3),"secs\n");flush.console}
+  setkey(uy)[, `:=`(lookup = list(list(integer(0L))), type_lookup = list(list(integer(0L))), count=0L, type_count=0L)]
+  if (verbose) {cat(timetaken(last.started.at)); flush.console()}


I'm a bit confused as to how these changes are related to this PR (1 -> 1L like enhancements). But I'm guessing you've discussed these changes with Matt.

@arunsrinivasan the Codecov report was failing pretty miserably on the PR, since it seems to have uncovered a lot of un-tested lines (this one in particular would have come up from 3 becoming 3L, for example

arunsrinivasan · 2018-01-31T19:06:01Z

R/foverlaps.R

@@ -128,24 +129,15 @@ foverlaps <- function(x, y, by.x = if (!is.null(key(x))) key(x) else key(y), by.
  }
  # nomatch has no effect here, just for passing arguments consistently to `bmerge`
  .Call(Clookup, uy, nrow(y), indices(uy, y, yintervals, nomatch=0L, roll=roll), maxgap, minoverlap, mult, type, verbose)
-  if (maxgap == 0L && minoverlap == 1L) {


@MichaelChirico good catch regarding iintervals. Seems like it wasn't used right from the first commit of foverlaps.

arunsrinivasan · 2018-01-31T19:10:15Z

R/foverlaps.R

@@ -128,24 +129,15 @@ foverlaps <- function(x, y, by.x = if (!is.null(key(x))) key(x) else key(y), by.
  }
  # nomatch has no effect here, just for passing arguments consistently to `bmerge`
  .Call(Clookup, uy, nrow(y), indices(uy, y, yintervals, nomatch=0L, roll=roll), maxgap, minoverlap, mult, type, verbose)
-  if (maxgap == 0L && minoverlap == 1L) {


I understand why you simplified code, but I left it there to serve as a placeholder when I'll come back to work on it... The first if statement would always be run currently. So that's the only cost here, which is fine.. Would it be possible to revert this part to how it was please?

The PR overall looks great! Apologies for not being able to write earlier.

Merge branch 'master' of https://github.com/Rdatatable/data.table into explicitIntegers # Conflicts: # R/data.table.R # inst/tests/tests.Rraw

Merge branch 'explicitIntegers' of https://github.com/Rdatatable/data.table into explicitIntegers # Conflicts: # inst/tests/tests.Rraw

mattdowle · 2018-02-02T04:10:10Z

R/test.data.table.R

@@ -1,10 +1,10 @@
-
+# nocov start


Why a nocov start here? Surely test.data.table is tested; it's the main entry point to the test suite!

mattdowle · 2018-02-02T04:12:26Z

R/test.data.table.R

@@ -56,6 +61,7 @@ compactprint <- function(DT, topn=2) {
  invisible()
 }

+# nocov start


Also here. INT() is used in lots of tests. Why exclude it from code cov?

… have been added

… leniency.

…live. Added nocov block instead.

MichaelChirico · 2018-02-06T00:05:12Z

R/foverlaps.R

+    if (verbose) {cat(timetaken(last.started.at));flush.console()}
+    olaps = .Call(Coverlaps, uy, xmatches, mult, type, nomatch, verbose)
+  }
+  # nocov start


… newline in its return value; e.g. at the end of tests.Rraw it's used assuming no newline included

MichaelChirico · 2018-02-06T02:14:58Z

@mattdowle thanks for the detailed review, i learned a lot about codecov here :)

MichaelChirico force-pushed the explicitIntegers branch from b329434 to de4c4ec Compare January 18, 2018 03:53

MichaelChirico commented Jan 18, 2018

View reviewed changes

Replace integers with explicit integers (1 -> 1L, etc.)

ee82f0f

MichaelChirico force-pushed the explicitIntegers branch from de4c4ec to ee82f0f Compare January 18, 2018 06:25

MichaelChirico commented Jan 18, 2018

View reviewed changes

add some tests, comments

6afb0ee

MichaelChirico commented Jan 22, 2018

View reviewed changes

MichaelChirico force-pushed the explicitIntegers branch from 5746f5d to f1bdd23 Compare January 22, 2018 14:48

simplify verbose timing calls (including some undiscovered typos)

4c8162d

MichaelChirico force-pushed the explicitIntegers branch from 9f392fb to 4c8162d Compare January 23, 2018 03:20

Merge branch 'master' into explicitIntegers

e727345

MichaelChirico force-pushed the explicitIntegers branch 6 times, most recently from 97b1f47 to bf9fdf6 Compare January 24, 2018 08:43

more tests

7c73e5b

MichaelChirico force-pushed the explicitIntegers branch from bf9fdf6 to 7c73e5b Compare January 29, 2018 04:48

Merge branch 'master' into explicitIntegers

70e23cc

MichaelChirico commented Jan 29, 2018

View reviewed changes

MichaelChirico force-pushed the explicitIntegers branch from af1670d to 6828e3f Compare January 29, 2018 08:11

more tests (foverlaps, verbosity)

689e737

MichaelChirico force-pushed the explicitIntegers branch from 6828e3f to 689e737 Compare January 30, 2018 06:09

Merge branch 'master' into explicitIntegers

c900500

mattdowle added this to the v1.10.6 milestone Jan 30, 2018

mattdowle approved these changes Jan 30, 2018

View reviewed changes

mattdowle requested a review from arunsrinivasan January 30, 2018 23:37

mattdowle added 2 commits January 30, 2018 16:13

Create SDenv$.GRP and .N with explicit new vector() in anticipation o…

f199e12

…f internal-to-R global small-integers

Merge branch 'master' into explicitIntegers

87af37c

arunsrinivasan reviewed Jan 31, 2018

View reviewed changes

Michael Chirico added 5 commits February 1, 2018 14:53

add some nocov, reinstate some code in comments

b5c482b

merge recent commits

1e4e2cb

Merge branch 'master' of https://github.com/Rdatatable/data.table into explicitIntegers # Conflicts: # R/data.table.R # inst/tests/tests.Rraw

merge conflicts

3afa577

Merge branch 'explicitIntegers' of https://github.com/Rdatatable/data.table into explicitIntegers # Conflicts: # inst/tests/tests.Rraw

more nocov tags

4c449d1

another merge fix?

64b88ff

mattdowle reviewed Feb 5, 2018

View reviewed changes

mattdowle added 3 commits February 5, 2018 12:54

Removed nocov start/end from test.data.table.R to see why those might…

f1976f3

… have been added

Focussed nocov in test.data.table. Removed (unused) factor comparison…

6b793a1

… leniency.

Reverted commenting-out to avoid duplicated 4 lines and to keep if() …

701a9d8

…live. Added nocov block instead.

MichaelChirico commented Feb 6, 2018

View reviewed changes

A few more 0->0L. Also, timetaken() is exported so reverted including…

5be3049

… newline in its return value; e.g. at the end of tests.Rraw it's used assuming no newline included

mattdowle merged commit 3f835d4 into master Feb 6, 2018

mattdowle deleted the explicitIntegers branch February 6, 2018 02:01

MichaelChirico mentioned this pull request Aug 30, 2019

purge integer-as-real usages in R/ #3803

Merged

Replace integers with explicit integers (1 -> 1L, etc.) #2573

Replace integers with explicit integers (1 -> 1L, etc.) #2573

Conversation

MichaelChirico commented Jan 18, 2018

mattdowle commented Jan 18, 2018 • edited Loading

MichaelChirico commented Jan 18, 2018

MichaelChirico commented Jan 18, 2018 • edited by mattdowle Loading

Choose a reason for hiding this comment

MichaelChirico Jan 18, 2018 • edited Loading

Choose a reason for hiding this comment

mattdowle Jan 18, 2018 • edited Loading

Choose a reason for hiding this comment

codecov-io commented Jan 18, 2018 • edited Loading

Codecov Report

MichaelChirico commented Jan 18, 2018 • edited Loading

mattdowle commented Jan 18, 2018 • edited Loading

MichaelChirico commented Jan 18, 2018

mattdowle commented Jan 18, 2018 • edited Loading

MichaelChirico commented Jan 18, 2018

MichaelChirico commented Jan 18, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MichaelChirico Jan 22, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattdowle commented Jan 23, 2018

mattdowle commented Jan 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MichaelChirico Feb 1, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MichaelChirico commented Jan 29, 2018

mattdowle left a comment • edited Loading

Choose a reason for hiding this comment

arunsrinivasan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MichaelChirico commented Feb 6, 2018

mattdowle commented Jan 18, 2018 •

edited

Loading

MichaelChirico commented Jan 18, 2018 •

edited by mattdowle

Loading

MichaelChirico Jan 18, 2018 •

edited

Loading

mattdowle Jan 18, 2018 •

edited

Loading

codecov-io commented Jan 18, 2018 •

edited

Loading

MichaelChirico commented Jan 18, 2018 •

edited

Loading

mattdowle commented Jan 18, 2018 •

edited

Loading

mattdowle commented Jan 18, 2018 •

edited

Loading

MichaelChirico commented Jan 18, 2018 •

edited

Loading

MichaelChirico Jan 22, 2018 •

edited

Loading

mattdowle commented Jan 23, 2018 •

edited

Loading

MichaelChirico Feb 1, 2018 •

edited

Loading

mattdowle left a comment •

edited

Loading