Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optional arg that returns a list of parse symbols fread() used to intuit raw file #2437

Open
statsccpr opened this issue Oct 23, 2017 · 8 comments

Comments

@statsccpr
Copy link

statsccpr commented Oct 23, 2017

I think it would be a terrific feature addition if some kind of argument like fread(return_parse=TRUE) was implemented.

I know the developers have put in a lot of work to 'intuit / guess ' these type of delimiter characters in the fread() body, such as

#2436
#2248
#2431

All of these internal parsing decisions would be a benefit if it was optionally returned to the end user outside of fread()

This request is motivated by the rant blogged about here
http://www.johnmyleswhite.com/notebook/2016/09/23/no-juice-for-you-csv-format-it-just-makes-you-more-awful/

Basically, i think a 'csv schema' would be useful for the type of workflow

raw csv -> fread -> user transforms data -> schema -> write external transformed data

I mocked up a protottype of a function that writes this basic schema here

https://github.com/mikejacktzen/datzen/blob/master/man_md/scheme.md

and realized that fread() internally tries to detect these types of raw parser characters

edit 10/24 for more use case context

I see this being useful if no pre-existing schema / preamble / yaml for csvy exists.
So the user wishes to create one painlessly using fread()

I do not think many people exhaustively lists the parser spec first.
If, like me, people ham handedly read the raw file into R with fread, we rely on fread to guess the spec (the hard work).
Once the fread generated spec is there, the user can then write out the scheme to whatever format they want, json, yaml, the preamble for the csvy.

@mattdowle mattdowle changed the title feature request: optional arg that returns a list of parse symbols fread() used to intuit raw file optional arg that returns a list of parse symbols fread() used to intuit raw file Oct 24, 2017
@MichaelChirico
Copy link
Member

related to rant: #1701

@statsccpr
Copy link
Author

statsccpr commented Oct 27, 2017

I think the readr::spec() function is a perfect way to go about it

https://github.com/tidyverse/readr/blob/master/man/spec.Rd

the result of dat_in=readr::read_csv('raw_file') contains hidden info of the column type used.
then spec_in = readr::spec(dat_in) exposes the types. Or alternatively, readr::spec_csv('raw_file')

attributes(dat_in) shows that dat_in$spec is the key ingredient

What's still missing is returning an object that contains the parsing pattern used

I think if fread had something similar, that would be even better than readr. For the same reasons the readr guys recommend fread sometimes

https://github.com/tidyverse/readr#datatable-and-fread

@HughParsonage
Copy link
Member

Does running fread with verbose = TRUE provide the information you need (if possibly not in a structure you'd like)?

@statsccpr
Copy link
Author

good suggestion. it looks like

fread(...,verbose = TRUE)

returns most of the information we want into the console

fread("A,B\n1,2\n3,4\nNA,na\n,\n'sdsf',",verbose=TRUE)

one thing missing is the 'quotes' for data values, say 'sdsf' in the example above

with that, it looks like structuring the printout into say an R list would be the next step

@MichaelChirico
Copy link
Member

MichaelChirico commented Feb 22, 2018 via email

@statsccpr
Copy link
Author

by quotes, i mean the quoting rule. I do not see the quoting rule in the below printout. if fread internally determines the quoting rule, might as well print it out.

> fread("A,B\n1,2\n3,4\nNA,na\n,\n'sdsf',",verbose=TRUE)
Input contains a \n (or is ""). Taking this to be text input (not a filename)
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
Positioned on line 1 after skip or autostart
This line is the autostart and not blank so searching up for the last non-blank ... line 1
Detecting sep ... ','
Detected 2 columns. Longest stretch was from line 1 to line 6
Starting data input on line 1 (either column names or first row of data). First 10 characters: A,B
All the fields on line 1 are character fields. Treating as the column names.
Count of eol: 5 (including 0 at the end)
Count of sep: 5
nrow = MIN( nsep [5] / (ncol [2] -1), neol [5] - endblanks [0] ) = 5
Type codes (point  0): 44
Type codes: 44 (after applying colClasses and integer64)
Type codes: 44 (after applying drop or select (if supplied)
Allocating 2 column slots (2 - 0 dropped)
Read 5 rows. Exactly what was estimated and allocated up front
   0.001s ( 17%) Memory map (rerun may be quicker)
   0.001s ( 17%) sep and header detection
   0.001s ( 17%) Count rows (wc -l)
   0.002s ( 33%) Column type detection (100 rows at 10 points)
   0.001s ( 17%) Allocation of 5x2 result (xMB) in RAM
   0.000s (  0%) Reading data
   0.000s (  0%) Allocation for type bumps (if any), including gc time if triggered
   0.000s (  0%) Coercing data already read in type bumps (if any)
   0.000s (  0%) Changing na.strings to NA
   0.006s        Total
        A  B
1:      1  2
2:      3  4
3:     NA na
4:          
5: 'sdsf'   

@st-pasha
Copy link
Contributor

The latest version of fread produces much more verbose output, including the quoting rule:

> data.table::fread("A,B\n1,2\n3,4\nNA,na\n,\n'sdsf',",verbose=TRUE)
Input contains a \n or is "". Taking this to be text input (not a filename)
[01] Check arguments
  Using 8 threads (omp_get_max_threads()=8, nth=8)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  `input` argument is provided rather than a file name, interpreting as raw text to read
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<A,B>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 6 lines of 2 fields using quote rule 0
  Detected 2 columns on line 1. This line is either column names or first data row. Line starts as: <<A,B>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 2
[07] Detect column types, good nrow estimate and whether first row is column names
  Number of sampling jump points = 2 because (27 bytes from row 1 to eof) / (2 * 27 jump0size) == 0
  Type codes (jump 000)    : AA  Quote rule 0
  'header' determined to be true because all columns are type string and a better guess is not possible
  All rows were sampled since file is small so we know nrow=5 exactly
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : AA
[10] Allocate memory for the datatable
  Allocating 2 column slots (2 - 0 dropped) with 5 rows
[11] Read the data
  jumps=[0..1), chunk_size=1048576, total_size=23
Read 5 rows x 2 columns from 27 bytes file in 00:00.001 wall clock time
[12] Finalizing the datatable
  Type counts:
         2 : string    'A'
=============================
   0.000s (  4%) Memory map 0.000GB file
   0.000s ( 44%) sep=',' ncol=2 and header detection
   0.000s (  1%) Column type detection using 5 sample rows
   0.000s ( 44%) Allocation of 5 rows x 2 cols (0.000GB) of which 5 (100%) rows used
   0.000s (  7%) Reading 1 chunks (0 swept) of 1.000MB (-2147483648 rows) using 1 threads
   +    0.000s (  0%) Parse to row-major thread buffers (grown 0 times)
   +    0.000s (  0%) Transpose
   +    0.000s (  7%) Waiting
   0.000s (  0%) Rereading 0 columns due to out-of-sample type exceptions
   0.001s        Total
        A  B
1:      1  2
2:      3  4
3:     NA na
4:          
5: 'sdsf'   

@MichaelChirico
Copy link
Member

MichaelChirico commented Feb 23, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants