Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TODO items from tests.Rraw #2572

Closed
28 of 51 tasks
MichaelChirico opened this issue Jan 18, 2018 · 35 comments
Closed
28 of 51 tasks

TODO items from tests.Rraw #2572

MichaelChirico opened this issue Jan 18, 2018 · 35 comments
Assignees

Comments

@MichaelChirico
Copy link
Member

MichaelChirico commented Jan 18, 2018

Items cut out from the end of tests.Rraw when this issue was first created:

  • Tests involving GForce functions needs to be run with optimisation level 1 and 2, so that both functions are tested all the time.
  • Add test for fixed bug report elapsed time even if tests fail #5519 - dcast returned error when a package imported data.table, but dint happen when "depends" on data.table. This is fixed (commit 1263 v1.9.3), but not sure how to add test. [add test from TODO list insuring dcast plays well with being imported #2571]
  • test and highlight in docs that negatives are fine and fast in forderv (ref R wish #15644)
  • bring ?setkey description up to date; Modernize ?setkey; and tackle long-deprecated functions #3399
  • tests of freading classes like Date and the verbose messages there. [Various colClasses enhancements #2545]
  • Test mid read bump of logical to character, rereads original True and False character strings faithfully.
  • add examples of multiple LHS (name and position) and multiple RHS to example(":=")
  • tests on double in add hoc by
  • test on -i that retain key e.g. DT[-4] and DT[-4,sum(v),by=b] should both retain key
  • test on out of bound i subsets e.g. 6:10 when DT has 7 rows, and mixed negative and positive i integer is error.
  • test that ordered subsets when i is unkeyed now retain x's key (using is.sorted(f__))
  • add FAQ that eval() is evaled in calling frame so don't need a, then update SO question of 14 March. See the test using variable name same as column name. Actually, is that true? Need "..J".
  • revisit SO answer. Can there be a better way than the repeated lapply(f,eval,.SD)?
  • change all 1 to 1L internally (done in data.table.R, other .R to do)
  • check the "j is named list could be inefficient" message from verbose than Chris N showed recently to 15 May
  • !make sure explicitly that unnamed lists are being executed by dogroups!
  • Add to warning about a previous copy that class<-, levels<- can also copy whole vector. Any fun<- form basically.
  • use looped := vs set test in example(":=") or example(setnames) to test overhead in [.data.table is tested to stay low in future.
  • add tests on smaller examples with NAs for 'frankv', even though can't compare to base::rank.

TODO items sprinkled throughout the rest of the file:

  • confirm with Matt the behavior of rbindlist w.r.t. naming of input in edge cases -- rbindlist should look for the first non-empty data.table - New changes (from Arun). Explanation below: Even if data.table is empty, as long as there are column names, they should be considered. Ex: What if all data.tables are empty? What'll be the column name then? If there are no names, then the first non-empty set of names will be allocated. I think this is the way to do it...
  • "Ragged" files that should have fill = TRUE generate warnings, not errors; as such, we need another way to trigger a STOP() inside fread.c. options(warn=2) isn't enough.
  • Add tests of NaN behavior in forder (testing internal behavior since can't compare to base::order)
  • Think of and add more tests for rbindlist
  • allow.cartesian is ignored if jsub[1L] has :=; could still warn if i has duplicates
  • Add comment=="#". Ensure only after last field is observed.
  • Add integer64 support for transpose
  • Add a test when fill=FALSE, but blank.lines.skip=TRUE, when the same effect as in tests 1585 should happen
  • Fix and add test for cases like this: a,b,c\n1,2,3\n4,5,6\n7,8,9,6\n1,2,3\n... -- extra column in line 4, but we've only detected 3
  • Add tests for nomatch=NA in non-equi-join suite (tests 1640-ish)... tested, but takes quite some time... so commented for now [split off as Add tests for nomatch=NA in non-equi-join suite #3411]
  • 0.0 written as 0, but TODO Small doubles are serialized incorrectly by fwrite #2398, probably related to the 2 lines after l==0 missing coverage in writeFloat64
  • Add strict.numeric (default FALSE) to all.equal.data.table()? [split off as Add strict.numeric (default FALSE) to all.equal.data.table()? #3410]
  • Begin to deprecate logicalAsInt argument
  • fread ignores first 2 lines of file with tabs after data in several lines  #1416 TODO (trailing tabs on most but not at the beginning and a "-" intended to mean missing but taken as text column name)
  • Begin deprecation of ..x as valid column names

TO DO items sprinkled throughout the rest of the file:

  • the eval() enabled you to use the 'a' in the calling scope, not 'a' in the TESTDT. TO DO: document this.
  • Test that changing colnames keep key in sync. TO DO: will have to do this for secondary keys, too, when implemented.
  • Test shallow copy verbose message from := adding a column, and (TO DO) only when X is NAMED.
  • Used to be warning about invalid .internal.selfref detected and fixed. As from v1.8.3 data.table() returns a NAMED==0 object, and key<- appears not to copy that. But within functions, key<- would still copy. TO DO: add tests....
  • ### ----> Therefore this TO DO may not be necessary here anymore (added by Arun 26th Jan 2014) ---> # TO DO when options(datatable.pedantic=TRUE): test(680.5, rbindlist(l[c(2,6)]), warning="Column 1 of item 2 is type 'double', inconsistent with column 1 of item 1's type ('integer')")
  • Test non ascii characters when passed as character by, Unexpected "Could not find first good line start after jump point 1 when sampling" #2134 TO DO: reinstate. Temporarily removed to pass CRAN's Mac using C locale (R-Forge's Mac is ok)
  • TO DO: add warning about inefficiency when datatable.pedantic=TRUE
  • TO DO: why doesn't last group have x=d, maybe groups=i in dogroups
  • the i groups when no match don't get carried through (would be hard to implement this and very unlikely to be useful. Just break into compound query, if needed to be used in j, to get them to carry through. TO DO: add to FAQ.
  • A is needed otherwise read as double with loss of precision (TO DO: should detect and bump to STR)
  • " protected ok, but \ needs taking off too (TO DO)
  • Takes too long for routine use. TO DO: move to a long running stress test script
  • Selection of columns, copy column to maintain the same as R <= 3.0.2, in Rdevel, for now; otherwise e.g. setkey changes the original columns too. TO DO: could allow shallow copy, perhaps.
  • TO DO: roll to -Inf. Also remove -Inf and test rolling to NaN and NA
  • rounding of milliseconds, workaround, TO DO: fix check as-cran note #5445
  • TO DO, by checking for balanced embedded quotes
  • to do: add tests for :=
@MichaelChirico
Copy link
Member Author

MichaelChirico commented Feb 2, 2018

My idea for testing:

use looped := vs set test in example(":=") or example(setnames) to test overhead in [.data.table is tested to stay low in future.

is to run:

t1 = system.time({
    DT = data.table(a = 1:10)
    for (jj in paste0(seq_len(100))) DT[ , (jj) := 10:1]
})

t2 = system.time({
    DT = data.table(a = 1:10)
    for (jj in paste0(seq_len(100))) set(DT, , jj, 10:1)
})

ratio = t1['elapsed']/t2['elapsed']

Then error if ratio is outside (1, R]. On my machine right now, the ratio is about 8; I guess setting R to 10 would be too tight/potentially machine-dependent. How about choosing R = 50? 25?

Could also make the loop more expensive (absolute time of t1 is currently about .025 on my machine -- very fast, but not sure what the right run time is for the test suite)

@MichaelChirico
Copy link
Member Author

MichaelChirico commented Feb 3, 2018

@mattdowle the following doesn't appear to be true anymore? unless I missed something about what's intended...

test that ordered subsets when i is unkeyed now retain x's key (using is.sorted(f__))

DT1 = data.table(ID = 1:4, v = 4:1, key = 'ID')
DT2 = data.table(ID = 3:6, w = c(3, 4, 1, 2))
key(DT1[DT2])
# NULL
key(DT1[DT2, on = 'ID'])
# NULL
DT1[DT2]
#    ID  v w
# 1:  3  2 3
# 2:  4  1 4
# 3:  5 NA 1
# 4:  6 NA 2

FWIW, I see that the line is.sorted(f__) has been commented out.


Also, I wasn't able to track down any question on SO on any March 14 mentioning eval or with you as a user (user:403310):

add FAQ that eval() is evaled in calling frame so don't need a, then update SO question of 14 March. See the test using variable name same as column name. Actually, is that true? Need "..J".

@MichaelChirico
Copy link
Member Author

@st-pasha you're more familiar with the fread test suite, is this covered by now?

Test mid read bump of logical T/F to character, collapse back to T and F.

@st-pasha
Copy link
Contributor

st-pasha commented Feb 9, 2018

I don't remember if there is such a test specifically, but it's easy to add:

DT = data.table(A=rep("True", 2200), B="FALSE")
DT[111, A:="here be"]
DT[111, B:="dragons"]
fwrite(DT, f<-tempfile())
data.table:::test(9999, fread(f, verbose=TRUE), DT, output="Column 2 (\"B\") bumped from 'bool8' to 'string' due to <<dragons>> on row 110")

@MichaelChirico
Copy link
Member Author

@st-pasha thanks, I wasn't sure if there's an exact row number to pick to ensure re-read.

Tried beefing it up, but fread doesn't take "T" "F" to be bool?

DT = data.table(A=rep("True", 2200), B="FALSE", C="T", D='0')
DT[111, LETTERS[1:4] := .("here", "be", "dra", "gons")]
fwrite(DT, f<-tempfile())
fread(f, verbose=TRUE)

@MichaelChirico
Copy link
Member Author

MichaelChirico commented Feb 9, 2018

@mattdowle this one appears to be obsolete as well?

check the "j is named list could be inefficient" message from verbose than Chris N showed recently to 15 May

grep "named list" R/* comes up blank.

@MichaelChirico
Copy link
Member Author

re:

tests on double in add hoc by

I am making a test along the lines of:

DT = data.table(a = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2), v = 1:12,
                jig = 10^(289:300))
DT[ , sum(v), by = .(a_jiggle = a + jig * .Machine$double.xmin)]
#     a_jiggle V1
#  1:        1  6
#  2:        1  4
#  3:        1  5
#  4:        1  6
#  5:        2  7
#  6:        2  8
#  7:        2  9
#  8:        2 10
#  9:        2 11
# 10:        2 12

I seem to recall it being documented somewhere what exactly the internal limits of accuracy are/how to query this for data.table; on my machine, it appears that roughly disturbances less than 1e-17 don't make a difference... not sure the robust/machine-independent way to design this test, however.

@mattdowle mattdowle changed the title Clean-up of TODO comments in tests TODO items from tests.Rraw Feb 22, 2018
@mattdowle mattdowle changed the title TODO items from tests.Rraw TODO items from end of tests.Rraw Feb 22, 2018
@MichaelChirico
Copy link
Member Author

  • use looped := vs set test in example(":=") or example(setnames) to test overhead in [.data.table is tested to stay low in future.

@jangorecki this can't make it to the CRAN test suite, but do we have anywhere we could readily drop in such a test (more like a dashboard, really) in dev?

@MichaelChirico
Copy link
Member Author

test and highlight in docs that negatives are fine and fast in forderv (ref R wish #15644)

?forder already mentions the use of - in both setorder and setorderv. If anything the documentation can be cleaned up -- hardly relevant to be pointing out differences of 1.9.2 vs 1.9.4 (i.e. 4.5-5 years ago)... that set ?setorder could go for a tidying

@MichaelChirico
Copy link
Member Author

check the "j is named list could be inefficient" message from verbose than Chris N showed recently to 15 May

grep "named list" R/* returns nothing; I think this is obsolete

@MichaelChirico
Copy link
Member Author

tests on double in add hoc by

IIUC we're not really supposed to be testing things like numerical accuracy. Limitations on double sorting are spelled out in the documentation IIRC... and anyway we haven't seen any issues raised on this in a long time. So I think this is covered.

@jangorecki
Copy link
Member

jangorecki commented Feb 12, 2019

  • use looped := vs set test in example(":=") or example(setnames) to test overhead in [.data.table is tested to stay low in future.

@jangorecki this can't make it to the CRAN test suite, but do we have anywhere we could readily drop in such a test (more like a dashboard, really) in dev?

There is benchmark.Rraw, or if there is no, there is a plan to have one. This will be proper place for all tests that needs to call system.time(). It should be now easier to add it as summary footer of Rraw script has been moved to test.data.table().

@MichaelChirico
Copy link
Member Author

MichaelChirico commented Feb 13, 2019

FWIW, did the following benchmark of set vs. :=:

library(microbenchmark)
library(data.table)

set.seed(5419681)

BB = 100L
times = CJ(rows = 10^(3:6), cols = c(10, 50, 100, 1000), 
           type = c(':=', 'set'), repetition = 1:BB)

for (ii in seq_len(nrow(times))) {
  gc()
  times[ii, duration := switch(type, ':=' = {
    DT = data.table(ID = seq_len(rows))
    t0 = get_nanotime()
    DT[ , paste0('V', 1:cols) := lapply(integer(cols), function(...) 1L)]
    get_nanotime() - t0
  }, 'set' = {
    DT = data.table(ID = seq_len(rows))
    t0 = get_nanotime()
    for (jj in 1:cols) set(DT, , paste0('V', jj), 1L)
    get_nanotime() - t0
  })]
}

times[ , .(ratio = duration[type == ':=']/duration[type == 'set'],
           avg = mean(duration)), keyby = .(rows, cols, repetition)
       ][ , as.list(c(absolute_seconds = mean(avg)/1e9, summary(ratio))), 
          keyby = .(rows, cols)]
#      rows cols absolute_seconds      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
#  1: 1e+03   10     0.0003131304 3.0605254 3.8471289 4.0469033 4.1023044 4.3067497  6.915743
#  2: 1e+03   50     0.0008448426 1.0162382 1.3855990 1.4398864 1.5730537 1.5222325 11.719836
#  3: 1e+03  100     0.0014121146 0.3493482 1.1065934 1.1422921 1.1502817 1.2021676  1.431746
#  4: 1e+03 1000     0.0182234995 0.2997860 0.4154354 0.4314394 0.4375887 0.4445711  0.769559
#  5: 1e+04   10     0.0006716579 1.5684996 1.8859146 1.9747949 1.9763925 2.0480862  2.625070
#  6: 1e+04   50     0.0024221906 0.7199252 1.0999812 1.1368900 1.1405756 1.1763198  2.045857
#  7: 1e+04  100     0.0049328084 0.4010431 1.0242058 1.0517822 1.0688827 1.0850502  2.420760
#  8: 1e+04 1000     0.0540806265 0.5498318 0.7804711 0.8012923 0.8062797 0.8221343  1.134574
#  9: 1e+05   10     0.0038287573 0.8844346 1.1154718 1.1282488 1.1298522 1.1366939  2.021127
# 10: 1e+05   50     0.0280899392 0.7681287 0.9787483 0.9973578 0.9989867 1.0211156  1.274340
# 11: 1e+05  100     0.0600740518 0.5545713 0.9471064 0.9744283 0.9680503 0.9994554  1.086229
# 12: 1e+05 1000     0.4934547127 0.8932713 0.9409418 0.9509392 0.9609168 0.9675374  1.573699
# 13: 1e+06   10     0.0358802545 0.7430620 0.9720593 0.9927718 0.9938076 1.0076720  1.534889
# 14: 1e+06   50     0.2703836096 0.7398572 0.9533737 0.9785185 0.9942344 1.0259876  1.501244
# 15: 1e+06  100     0.4599221823 0.9430767 0.9883373 1.0063509 1.0110553 1.0279786  1.265928
# 16: 1e+06 1000     3.9212144387 0.8491473 0.9790677 0.9865773 0.9899165 0.9928261  1.220000
cat('\nTotal Time:', times[ , sum(duration)/1e9/60], 'minutes\n')
# Total Time: 17.8525 minutes

@MichaelChirico
Copy link
Member Author

Add to warning about a previous copy that class<-, levels<- can also copy whole vector. Any fun<- form basically.

Not sure where it is the intention to put this. Axing barring that.

@jangorecki
Copy link
Member

why assign rnorm? it costs
good to call gc before each timing start.

@MichaelChirico
Copy link
Member Author

why assign rnorm? it costs

best to just assign TRUE?

will do re:gc

@MichaelChirico
Copy link
Member Author

updated set/:= benchmark, 20x as fast when not using rnorm

@MichaelChirico
Copy link
Member Author

!make sure explicitly that unnamed lists are being executed by dogroups!

honestly no idea what this is about. IINM dogroups only entryway in R is when computing groups and !GForce here.

Pretty trivial to see that there's plenty of coverage of this in tests already (e.g. some places in #3406 where optimization is turned off, among probably hundreds of others).

Alternative interpretation would be that .ok should be FALSE if j is a named list? Though this is obviously not the case (maybe it was at some point?)

Gonna mark this one done

@MichaelChirico
Copy link
Member Author

MichaelChirico commented Feb 16, 2019

tests of freading classes like Date and the verbose messages there.

Marking this complete within this issue since it's covered in a separate PR/issue already

@MichaelChirico MichaelChirico changed the title TODO items from end of tests.Rraw TODO items from tests.Rraw Feb 16, 2019
arunsrinivasan pushed a commit that referenced this issue Feb 16, 2019
* ignore .Rprofile if present

* Closes #3405 -- more careful about preset datatable.optimize

* clean up setting/unsetting of options in testing

* more thorough testing of datatable.optimize levels, part of #2572

* prettyprint.char needs reset
@MichaelChirico
Copy link
Member Author

confirm with Matt the behavior of rbindlist w.r.t. naming of input in edge cases -- rbindlist should look for the first non-empty data.table - New changes (from Arun). Explanation below: Even if data.table is empty, as long as there are column names, they should be considered. Ex: What if all data.tables are empty? What'll be the column name then? If there are no names, then the first non-empty set of names will be allocated. I think this is the way to do it...

@mattdowle any feedback here?

@MichaelChirico
Copy link
Member Author

  • "Ragged" files that should have fill = TRUE generate warnings, not errors; as such, we need another way to trigger a STOP() inside fread.c. options(warn=2) isn't enough.

@mattdowle ditto here... could you split off into a new issue if this is still a concern?

@MichaelChirico
Copy link
Member Author

MichaelChirico commented Feb 17, 2019

# TO DO: TODO: think of and add more tests for rbindlist

This is quite vague and 5 years old... closing as duplicate of the 40+ rbindlist issues outstnding

@MichaelChirico
Copy link
Member Author

allow.cartesian is ignored if jsub[1L] has :=; could still warn if i has duplicates

AFAICT this is subsumed by #3077; closing here

@MichaelChirico
Copy link
Member Author

MichaelChirico commented Feb 17, 2019

Add comment=="#". Ensure only after last field is observed.

Presumably this is just #856; closing here

@MichaelChirico
Copy link
Member Author

Add integer64 support for transpose

This is a full issue; filed as #3408 and closing here

@MichaelChirico
Copy link
Member Author

Add a test when fill=FALSE, but blank.lines.skip=TRUE, when the same effect should happen

Unclear what "same effect" is referring to; I see tests 1584.x that seem to cover this case? @arunsrinivasan this is under you per git blame (admittedly 3+ years ago), any idea?

@MichaelChirico
Copy link
Member Author

Fix and add test for cases like this: a,b,c\n1,2,3\n4,5,6\n7,8,9,6\n1,2,3\n... -- extra column in line 4, but we've only detected 3

This seems to be handled already:

s = "a,b,c
1,2,3
4,5,6
7,8,9,6
1,2,3"

fread(s)
#        a     b     c
#    <int> <int> <int>
# 1:     1     2     3
# 2:     4     5     6

Warning message:
In fread(s) :
Stopped early on line 4. Expected 3 fields but found 4. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<7,8,9,6>>

fread(s, fill=TRUE)
#        a     b     c    V4
#    <int> <int> <int> <int>
# 1:     1     2     3    NA
# 2:     4     5     6    NA
# 3:     7     8     9     6
# 4:     1     2     3    NA

@MichaelChirico
Copy link
Member Author

MichaelChirico commented Feb 17, 2019

Add tests for nomatch=NA in non-equi-join suite (tests 1640-ish)... tested, but takes quite some time... so commented for now

@jangorecki this set of tests could be added to the GitLab suite?

Is the TODO covered by the commented tests?

# TODO: add tests for nomatch=NA..
# tested, but takes quite some time.. so commenting for now
# nqjoin_test(x, y, 3L,1643.0)
# nqjoin_test(dt1,dt2,3L,1652.0)

# nqjoin_test(  x,dt2,1L,1644.0) # without NA only in x
# nqjoin_test(  x,dt2,2L,1645.0)
# nqjoin_test(  x,dt2,3L,1646.0)
# nqjoin_test(dt1,  y,1L,1647.0) # without NA only in i
# nqjoin_test(dt1,  y,2L,1648.0)
# nqjoin_test(dt1,  y,3L,1649.0)

@MichaelChirico
Copy link
Member Author

0.0 written as 0, but TODO #2398, probably related to the 2 lines after l==0 missing coverage in writeFloat64

As stated, this is just #2398

@MichaelChirico
Copy link
Member Author

Add strict.numeric (default FALSE) to all.equal.data.table()?

@jangorecki can revisit this and file as a separate issue if still viable? But I guess no because of the difficulty in testing identical for numeric across platforms? Or could include it but discourage it "use at your own risk"?

@MichaelChirico
Copy link
Member Author

Begin to deprecate logicalAsInt argument

This was 1.11.0 (May 2018); can begin the deprecation process with next release?

@MichaelChirico
Copy link
Member Author

#1416 TODO (trailing tabs on most but not at the beginning and a "-" intended to mean missing but taken as text column name)

As stated, this is just #1416

@MichaelChirico
Copy link
Member Author

Begin deprecation of ..x as valid column names

This is also 1.11.0 (May 2018). But see #3321, making things more complicated...

@MichaelChirico
Copy link
Member Author

Same as #678 -- better to handle this with a linter

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants