-
Notifications
You must be signed in to change notification settings - Fork 991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optional arg that returns a list of parse symbols fread() used to intuit raw file #2437
Comments
related to rant: #1701 |
I think the https://github.com/tidyverse/readr/blob/master/man/spec.Rd the result of
What's still missing is returning an object that contains the parsing pattern used I think if fread had something similar, that would be even better than readr. For the same reasons the readr guys recommend fread sometimes https://github.com/tidyverse/readr#datatable-and-fread |
Does running |
good suggestion. it looks like
returns most of the information we want into the console
one thing missing is the 'quotes' for data values, say 'sdsf' in the example above with that, it looks like structuring the printout into say an R list would be the next step |
what do you mean by the quotes?
verbose IIRC also reports to you the "quoting rule" used, i.e. internally
fread determines which of several (I think 7) valid rules for quoting text
is in.
On Feb 23, 2018 1:39 AM, "statsccpr" <[email protected]> wrote:
good suggestion. it looks like
fread(...,verbose = TRUE)
returns most of the information we want into the console
fread("A,B\n1,2\n3,4\nNA,na\n,\n'sdsf',",verbose=TRUE)
one thing missing is the 'quotes' for data values, say 'sdsf' in the
example above
with that, it looks like structuring the printout into say an R list would
be the next step
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2437 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AHQQdbYXB5jcOfUyiUhE6QZ_cPP54dnDks5tXaZBgaJpZM4QDg2r>
.
|
by quotes, i mean the quoting rule. I do not see the quoting rule in the below printout. if fread internally determines the quoting rule, might as well print it out.
|
The latest version of
|
I guess it can't hurt to put a quick blurb describing the quote rule
numbers, rather than make people chase down the comment in C
…On Feb 23, 2018 4:45 AM, "Pasha Stetsenko" ***@***.***> wrote:
The latest version of fread produces much more verbose output, including
the quoting rule:
> data.table::fread("A,B\n1,2\n3,4\nNA,na\n,\n'sdsf',",verbose=TRUE)
Input contains a \n or is "". Taking this to be text input (not a filename)
[01] Check arguments
Using 8 threads (omp_get_max_threads()=8, nth=8)
NAstrings = [<<NA>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
`input` argument is provided rather than a file name, interpreting as raw text to read
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<A,B>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 6 lines of 2 fields using quote rule 0
Detected 2 columns on line 1. This line is either column names or first data row. Line starts as: <<A,B>>
Quote rule picked = 0
fill=false and the most number of columns found is 2
[07] Detect column types, good nrow estimate and whether first row is column names
Number of sampling jump points = 2 because (27 bytes from row 1 to eof) / (2 * 27 jump0size) == 0
Type codes (jump 000) : AA Quote rule 0
'header' determined to be true because all columns are type string and a better guess is not possible
All rows were sampled since file is small so we know nrow=5 exactly
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : AA
[10] Allocate memory for the datatable
Allocating 2 column slots (2 - 0 dropped) with 5 rows
[11] Read the data
jumps=[0..1), chunk_size=1048576, total_size=23
Read 5 rows x 2 columns from 27 bytes file in 00:00.001 wall clock time
[12] Finalizing the datatable
Type counts:
2 : string 'A'
=============================
0.000s ( 4%) Memory map 0.000GB file
0.000s ( 44%) sep=',' ncol=2 and header detection
0.000s ( 1%) Column type detection using 5 sample rows
0.000s ( 44%) Allocation of 5 rows x 2 cols (0.000GB) of which 5 (100%) rows used
0.000s ( 7%) Reading 1 chunks (0 swept) of 1.000MB (-2147483648 rows) using 1 threads
+ 0.000s ( 0%) Parse to row-major thread buffers (grown 0 times)
+ 0.000s ( 0%) Transpose
+ 0.000s ( 7%) Waiting
0.000s ( 0%) Rereading 0 columns due to out-of-sample type exceptions
0.001s Total
A B
1: 1 2
2: 3 4
3: NA na
4:
5: 'sdsf'
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2437 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AHQQdWdiC5VQeK_5XfX6oj99X1_ajNBsks5tXdHwgaJpZM4QDg2r>
.
|
I think it would be a terrific feature addition if some kind of argument like fread(return_parse=TRUE) was implemented.
I know the developers have put in a lot of work to 'intuit / guess ' these type of delimiter characters in the fread() body, such as
#2436
#2248
#2431
All of these internal parsing decisions would be a benefit if it was optionally returned to the end user outside of fread()
This request is motivated by the rant blogged about here
http://www.johnmyleswhite.com/notebook/2016/09/23/no-juice-for-you-csv-format-it-just-makes-you-more-awful/
Basically, i think a 'csv schema' would be useful for the type of workflow
I mocked up a protottype of a function that writes this basic schema here
https://github.com/mikejacktzen/datzen/blob/master/man_md/scheme.md
and realized that fread() internally tries to detect these types of raw parser characters
edit 10/24 for more use case context
I see this being useful if no pre-existing schema / preamble / yaml for csvy exists.
So the user wishes to create one painlessly using fread()
I do not think many people exhaustively lists the parser spec first.
If, like me, people ham handedly read the raw file into R with fread, we rely on fread to guess the spec (the hard work).
Once the fread generated spec is there, the user can then write out the scheme to whatever format they want, json, yaml, the preamble for the csvy.
The text was updated successfully, but these errors were encountered: