-
Notifications
You must be signed in to change notification settings - Fork 993
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Various colClasses enhancements #2545
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2545 +/- ##
==========================================
+ Coverage 97.01% 97.03% +0.02%
==========================================
Files 66 66
Lines 12484 12559 +75
==========================================
+ Hits 12111 12187 +76
+ Misses 373 372 -1
Continue to review full report at Codecov.
|
Hi @HughParsonage. I've invited you to be project member; you'll need to accept the invite before the change will apply. This enables you to create branches in the main repository directly so we can all push to each other's branches. Welcome! |
inst/tests/tests.Rraw
Outdated
test(167, names(print(ggplot(DT,aes(b,f))+geom_point()))[c(1,3)], c("data","plot")) | ||
DF <- data.frame( a=1:5, b=11:50, d=c("A","B","C","D"), f=1:5, grp=1:5 ) | ||
res167 <- names(print(ggplot(DF,aes(b,f))+geom_point()))[c(1,3)] | ||
test(167, names(print(ggplot(DT,aes(b,f))+geom_point()))[c(1,3)], res167) | ||
# The names() is a stronger test that it has actually plotted, but also because test() sees the invisible result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for this change please? The test no longer seems to test anything and repeats the test expression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible you hit this unrelated issue with the same test 167 : #2546
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understood the test to mean that using data.table
as opposed to data.frame
does not affect ggplot
s. Note res167
uses data.frame
.
On my computer: with ggplot2 v 2.2.1.9000,
names(print(ggplot(DT,aes(b,f))+geom_point()))[c(1,3)]
# [1] "data" "scales"
not c("data", "plot")
. Indeed, none of the names are "plot"
. Would it be better to use all the names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see now. I didn't spot the difference DF
vs DT
in those two lines. ggplot2
changed its names then and if you upgrade to CRAN version ggplot2 v2.2.1 (released a year ago) it should work. Or, is 2.2.1.9000 a more recent development version of ggplot2. If so, has it changed again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, ggplot2 2.2.1.9000 is latest dev version you have. Good test. Ok they changed again and your solution looks good to me. Maybe just a comment alongside pointing out that DF
is being used to find out what the names are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, 2.2.1.9000
is the current dev version. Looking at the code for ggplot2
, I can't tell what the expected behaviour ought to be:
ggplot.data.frame
returns the list that I see: https://github.com/tidyverse/ggplot2/blob/7b5c185eff0dabef50e838701ab805ae09a1170f/R/plot.r#L92
ggplot_build.ggplot
returns the list that the test originally expected. https://github.com/tidyverse/ggplot2/blob/152910a698121ac4fa771d110507e607d45cfc8f/R/plot-build.r#L99
Note that the test appears to proxy the actual appearance of the plot, and the plot does indeed occur on my machine.
inst/tests/tests.Rraw
Outdated
test(1445, fread("doublequote_newline.csv")[7:10], data.table(A=c(1L,1L,2L,1L), B=c("a","embedded \"\"field\"\"\nwith some embedded new\nlines as well","not this one","a"))) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is passing successfully on AppVeyor Windows tests and on CRAN Windows tests. Something else must be going wrong on your Windows machine, to fix separately? Important to keep the test running on Windows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you prefer I raise this an issue and revert this change, or relax the test to permit e.g. embedded ""field""\r\nwith some embedded new\r\nlines as well
as well as embedded ""field""\nwith some embedded new\nlines as well
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes : new issue please and revert the change. Why does it pass on AppVeyor and CRAN WIndows but it doesn't pass on your Windows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, I think it's because when I clone the repo, \n
is replaced with \r\n
by git. My bad.
fread("https://raw.githubusercontent.com/Rdatatable/data.table/master/inst/tests/doublequote_newline.csv")[["B"]][8]
#> [1] "embedded \"\"field\"\"\nwith some embedded new\nlines as well"
fread("inst/tests/doublequote_newline.csv")[["B"]][8]
#> [1] "embedded \"\"field\"\"\r\nwith some embedded new\r\nlines as well"
Test failed locally because the test file changed `\n` to `\r\n` when cloned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Thinking out loud...)
The end result is great but going too far I feel. check_colClasses_validity()
has quite a lot of lines to maintain relative to what it does. When I filed the issue, I had in mind just a quick 2 or 3 extra lines in fread! In freadR.c on line 32 are the R names that are matched to colClasses
supplied by user. When there's no match, the idea was at some point to read data as character and then call the class method afterwards at R level, like read.csv does. If there is no class method (e.g. no as.foo
exists at R level, not even defined by user), that's when it should fail, for that reason (standard R error that as.foo()
does not exist). There's 'CLASS' at C level in freadR.c with that in mind but I don't recall how far we went implementing that. Listing the valid and invalid types explicitly in check_colClasses_validity
isn't in the spirit of S3 dispatch. Maybe all we need is the class method dispatched to call, say, as.Date
in the meantime until we implement directly at C level. And when we do implement directly at C level, there will be no change for the user other than experiencing a speedup.
Do you want me to attempt a simpler version using the same approach (most of the complexity is just to construct the phrasing of the warning message), like if (!all(colClasses %chin% c(NA, "logical", "integer", "numeric", "double", "character", "factor"))) {
warning("colClasses contains unsupported values.")
} (with corresponding tests modified). or do you think listing all the valid classes in |
My current thought is that the dispatch to the class method should be implemented rather than this stop-gap. I'm now thinking that would be quicker. I didn't realize that before. |
I thought it would be a case of just calling something like
So that's what
It seems you have to use
Maybe |
I was thinking something like existsMethod("as.numeric")
#> [1] TRUE
existsMethod("as.Date")
#> [1] FALSE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome. minor comments:
vapply
takeslogical(1L)
- I'm a big fan of
sprintf
instead ofpaste
for error messaging
Apologies for the enormous diff, but there were a few tricky paths, as evinced by the number of additional tests
I need to update the |
(Also fix spurious movement of })
@HughParsonage Small question: what do the ❓marks next to issues #1426 and #1656 in this PR's description mean? Does it mean those issues remain unresolved? Is it because they are too hard to resolve, or you don't know how to resolve, or out-of-scope for this PR, or you don't know whether they should be implemented at all, or ..? |
Closest meaning would be 'partially resolved': if select is used, colClasses need only correspond to the columns in select (with caveat)Basically resolved: there are combinations of Issue #1656 fread/fwrite base data types directly for efficiencyNot fully resolved: base data types can now be implemented within In terms of why the latter isn't resolved, it's a combination of being too difficult for me and out-of-scope: this PR is meant to implement arbitrary |
…from C level to simplify R level. Not yet passing all tests; wip.
}, | ||
warning = fun <- function(e) { | ||
etype = if (inherits(e,"error")) "error" else "warning" | ||
warning(sprintf("Column '%s' was set by colClasses to be '%s' but fread encountered the following %s:\n\t%s\nso the column has been left as type '%s'", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Construct without sprintf
(I don't think I have write access to Hugh's branch...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind either way, but it's ok with sprintf
isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took the cue from WRE Section 1.7:
Try not to split up messages into small pieces. In C error messages use a single format string containing all English words in the messages.
In R error messages do not construct a message withpaste
(such messages will not be translated) but via multiple arguments tostop
orwarning
, or viagettextf
.
I already purged other instances of this construction elsewhere in the code, I can't find the PR right now...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it's merged now, follow up PRs are good. I double-checked Hugh is a project member so he can create branches too. Good catch: I'd forgotten about that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, working on it 👍
"complex" = as.complex(v), | ||
"raw" = as_raw(v), # Internal implementation | ||
"Date" = as.Date(v), | ||
"POSIXct" = as.POSIXct(v), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any way to build customized timezones into the API? Otherwise we should use tz = 'UTC'
by default? Since the same string on different machines will parse differently otherwise...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea: could add tz=
to fread I guess. But shouldn't default be whatever as.POSIXct()
does by default (which is local time iirc) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm basically arguing against R default behavior since I've always found it a bit murky to use inferred timezones --> fread
behaves differently on different machines.
Adding tz
to fread
feels a bit strange, maybe accept tz
as a "class" within colClasses
a la
colClasses = list(POSIXct = 'time', tz = 'UTC')
Or (requires a bit more parsing within our code but I think feels pretty natural):
colClasses = list(POSIXct[UTC] = 'time')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But eventually colClasses
won't be needed to get a POSIXct
(it would be built in as native type). Hence tz=
being argument to fread
. That way, tz=
would apply to all datetime columns in the file without needing to specify them in colClasses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I guess tz
argument could have an API like colClasses
to allow flexibly specifying multiple time zones:
tz = list(UTC = c('start', 'end'), 'Australia/South' = c('start_local', 'end_local'))
and
tz = c(start = 'UTC', end = 'UTC', start_local = 'Australia/South', end_local = 'Australia/South')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it make sense to use tz="UTC"
and just document that, is there any follow up issue to not forget about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my only hesitation is I'm not 100% sure we can do this before implementing the tz argument... e.g. some times only exist in certain time zones, so conversion may fail if we assume UTC but could read on user's machine otherwise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it may fail then tryCatch
, providing more complex structure to colClasses
is good idea but should be added with care (ideally another PR), ideally aligning to API we will use in csvy specification. Using tz="UTC"
seems to be simplest way to ensure some consistency.
\itemize{ | ||
\item{If coercion results in an error or introduces \code{NA}s, the attempt is aborted for that column with warning and the column's type is left unchanged (probably \code{character}).} | ||
\item{Named list of vectors of column names or numbers are supported where the list names are the class names. The \code{list} form makes it easier to set a batch of columns to be a particular class; see examples. When column numbers are used in the `list` form, they refer to the column number in the file, not the column number after \code{select} or \code{drop} has been applied.} | ||
\item{Columns are not demoted to a lower type if this would risk loss of information. You have to coerce such columns afterwards yourself, if you really require data loss.} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Points 1 & 3 have a lot of overlap, maybe just combine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 is more about classes like POSIXct which take as input the character string.
3 is more about numeric data like 3.14 being specified as integer.
Maybe more examples and/or better wording?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. At worst a tiny bit repetitive but maybe that's a good thing as it's pretty crucial.
if (!is.null(names(colClasses))) { # names are column names; convert to list approach | ||
if (!length(colClasses)) { | ||
colClasses=NULL; | ||
} else if (identical(colClasses, "NULL")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the use case for this? It's not mentioned in the manual...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure. Just saw it done in Hugh's branch and liked it. Maybe just to catch beginner errors? There is a warning that it is taken to mean NULL.
I can see that it breaks the case-1-not-different rule. So a 1-column file should have the column dropped (and null data.table returned) when colClasses="NULL"? For consistency. On the other hand beginners might well try "NULL".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe user has a loop through types and all columns should be coerced to that type. for (class in c("integer","double","NULL")) fread(...., colClasses=class)
. But I can't imagine a good reason to do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems innocuous enough, just wondering if we should document it if there's a specific use case in mind... @HughParsonage any thoughts?
Closes #491 fread: colClasses does not covert to non-builtin types
Closes #2610 colClasses = POSIXct
Closes #1634 fread doesn't check colClasses to be valid type
Closes #2025 allow
stringsAsFactors
parameter to be a fraction between 0 and 1(no issue number) When
colClasses
includes multiple list elements namedfactor
(e.g.list(factor = 1, factor = 2)
), all elements are respected, not just the first.#1445 has been fixed when
select
is non-monotonic.Not tackled here
#1656 fread/fwrite *base data types* directly for efficiency
#1426 if
select
is used,colClasses
need only correspond to the columns inselect
; done in #3547