optional arg that returns a list of parse symbols fread() used to intuit raw file #2437

statsccpr · 2017-10-23T21:20:49Z

I think it would be a terrific feature addition if some kind of argument like fread(return_parse=TRUE) was implemented.

I know the developers have put in a lot of work to 'intuit / guess ' these type of delimiter characters in the fread() body, such as

#2436
#2248
#2431

All of these internal parsing decisions would be a benefit if it was optionally returned to the end user outside of fread()

This request is motivated by the rant blogged about here
http://www.johnmyleswhite.com/notebook/2016/09/23/no-juice-for-you-csv-format-it-just-makes-you-more-awful/

Basically, i think a 'csv schema' would be useful for the type of workflow

raw csv -> fread -> user transforms data -> schema -> write external transformed data

I mocked up a protottype of a function that writes this basic schema here

https://github.com/mikejacktzen/datzen/blob/master/man_md/scheme.md

and realized that fread() internally tries to detect these types of raw parser characters

edit 10/24 for more use case context

I see this being useful if no pre-existing schema / preamble / yaml for csvy exists.
So the user wishes to create one painlessly using fread()

I do not think many people exhaustively lists the parser spec first.
If, like me, people ham handedly read the raw file into R with fread, we rely on fread to guess the spec (the hard work).
Once the fread generated spec is there, the user can then write out the scheme to whatever format they want, json, yaml, the preamble for the csvy.

The text was updated successfully, but these errors were encountered:

MichaelChirico · 2017-10-24T02:08:35Z

related to rant: #1701

statsccpr · 2017-10-27T17:30:13Z

I think the readr::spec() function is a perfect way to go about it

https://github.com/tidyverse/readr/blob/master/man/spec.Rd

the result of dat_in=readr::read_csv('raw_file') contains hidden info of the column type used.
then spec_in = readr::spec(dat_in) exposes the types. Or alternatively, readr::spec_csv('raw_file')

attributes(dat_in) shows that dat_in$spec is the key ingredient

What's still missing is returning an object that contains the parsing pattern used

I think if fread had something similar, that would be even better than readr. For the same reasons the readr guys recommend fread sometimes

https://github.com/tidyverse/readr#datatable-and-fread

HughParsonage · 2018-02-19T12:48:40Z

Does running fread with verbose = TRUE provide the information you need (if possibly not in a structure you'd like)?

statsccpr · 2018-02-22T17:39:10Z

good suggestion. it looks like

fread(...,verbose = TRUE)

returns most of the information we want into the console

fread("A,B\n1,2\n3,4\nNA,na\n,\n'sdsf',",verbose=TRUE)

one thing missing is the 'quotes' for data values, say 'sdsf' in the example above

with that, it looks like structuring the printout into say an R list would be the next step

MichaelChirico · 2018-02-22T17:46:33Z

what do you mean by the quotes? verbose IIRC also reports to you the "quoting rule" used, i.e. internally fread determines which of several (I think 7) valid rules for quoting text is in. On Feb 23, 2018 1:39 AM, "statsccpr" <[email protected]> wrote: good suggestion. it looks like fread(...,verbose = TRUE) returns most of the information we want into the console fread("A,B\n1,2\n3,4\nNA,na\n,\n'sdsf',",verbose=TRUE) one thing missing is the 'quotes' for data values, say 'sdsf' in the example above with that, it looks like structuring the printout into say an R list would be the next step — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2437 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHQQdbYXB5jcOfUyiUhE6QZ_cPP54dnDks5tXaZBgaJpZM4QDg2r> .

statsccpr · 2018-02-22T17:49:40Z

by quotes, i mean the quoting rule. I do not see the quoting rule in the below printout. if fread internally determines the quoting rule, might as well print it out.

> fread("A,B\n1,2\n3,4\nNA,na\n,\n'sdsf',",verbose=TRUE)
Input contains a \n (or is ""). Taking this to be text input (not a filename)
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
Positioned on line 1 after skip or autostart
This line is the autostart and not blank so searching up for the last non-blank ... line 1
Detecting sep ... ','
Detected 2 columns. Longest stretch was from line 1 to line 6
Starting data input on line 1 (either column names or first row of data). First 10 characters: A,B
All the fields on line 1 are character fields. Treating as the column names.
Count of eol: 5 (including 0 at the end)
Count of sep: 5
nrow = MIN( nsep [5] / (ncol [2] -1), neol [5] - endblanks [0] ) = 5
Type codes (point  0): 44
Type codes: 44 (after applying colClasses and integer64)
Type codes: 44 (after applying drop or select (if supplied)
Allocating 2 column slots (2 - 0 dropped)
Read 5 rows. Exactly what was estimated and allocated up front
   0.001s ( 17%) Memory map (rerun may be quicker)
   0.001s ( 17%) sep and header detection
   0.001s ( 17%) Count rows (wc -l)
   0.002s ( 33%) Column type detection (100 rows at 10 points)
   0.001s ( 17%) Allocation of 5x2 result (xMB) in RAM
   0.000s (  0%) Reading data
   0.000s (  0%) Allocation for type bumps (if any), including gc time if triggered
   0.000s (  0%) Coercing data already read in type bumps (if any)
   0.000s (  0%) Changing na.strings to NA
   0.006s        Total
        A  B
1:      1  2
2:      3  4
3:     NA na
4:          
5: 'sdsf'

st-pasha · 2018-02-22T20:45:30Z

The latest version of fread produces much more verbose output, including the quoting rule:

> data.table::fread("A,B\n1,2\n3,4\nNA,na\n,\n'sdsf',",verbose=TRUE)
Input contains a \n or is "". Taking this to be text input (not a filename)
[01] Check arguments
  Using 8 threads (omp_get_max_threads()=8, nth=8)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  `input` argument is provided rather than a file name, interpreting as raw text to read
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<A,B>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep ...
  sep=','  with 6 lines of 2 fields using quote rule 0
  Detected 2 columns on line 1. This line is either column names or first data row. Line starts as: <<A,B>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 2
[07] Detect column types, good nrow estimate and whether first row is column names
  Number of sampling jump points = 2 because (27 bytes from row 1 to eof) / (2 * 27 jump0size) == 0
  Type codes (jump 000)    : AA  Quote rule 0
  'header' determined to be true because all columns are type string and a better guess is not possible
  All rows were sampled since file is small so we know nrow=5 exactly
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : AA
[10] Allocate memory for the datatable
  Allocating 2 column slots (2 - 0 dropped) with 5 rows
[11] Read the data
  jumps=[0..1), chunk_size=1048576, total_size=23
Read 5 rows x 2 columns from 27 bytes file in 00:00.001 wall clock time
[12] Finalizing the datatable
  Type counts:
         2 : string    'A'
=============================
   0.000s (  4%) Memory map 0.000GB file
   0.000s ( 44%) sep=',' ncol=2 and header detection
   0.000s (  1%) Column type detection using 5 sample rows
   0.000s ( 44%) Allocation of 5 rows x 2 cols (0.000GB) of which 5 (100%) rows used
   0.000s (  7%) Reading 1 chunks (0 swept) of 1.000MB (-2147483648 rows) using 1 threads
   +    0.000s (  0%) Parse to row-major thread buffers (grown 0 times)
   +    0.000s (  0%) Transpose
   +    0.000s (  7%) Waiting
   0.000s (  0%) Rereading 0 columns due to out-of-sample type exceptions
   0.001s        Total
        A  B
1:      1  2
2:      3  4
3:     NA na
4:          
5: 'sdsf'

MichaelChirico · 2018-02-23T00:35:50Z

I guess it can't hurt to put a quick blurb describing the quote rule numbers, rather than make people chase down the comment in C

…

On Feb 23, 2018 4:45 AM, "Pasha Stetsenko" ***@***.***> wrote: The latest version of fread produces much more verbose output, including the quoting rule: > data.table::fread("A,B\n1,2\n3,4\nNA,na\n,\n'sdsf',",verbose=TRUE) Input contains a \n or is "". Taking this to be text input (not a filename) [01] Check arguments Using 8 threads (omp_get_max_threads()=8, nth=8) NAstrings = [<<NA>>] None of the NAstrings look like numbers. show progress = 1 0/1 column will be read as boolean [02] Opening the file `input` argument is provided rather than a file name, interpreting as raw text to read [03] Detect and skip BOM [04] Arrange mmap to be \0 terminated \n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal. [05] Skipping initial rows if needed Positioned on line 1 starting: <<A,B>> [06] Detect separator, quoting rule, and ncolumns Detecting sep ... sep=',' with 6 lines of 2 fields using quote rule 0 Detected 2 columns on line 1. This line is either column names or first data row. Line starts as: <<A,B>> Quote rule picked = 0 fill=false and the most number of columns found is 2 [07] Detect column types, good nrow estimate and whether first row is column names Number of sampling jump points = 2 because (27 bytes from row 1 to eof) / (2 * 27 jump0size) == 0 Type codes (jump 000) : AA Quote rule 0 'header' determined to be true because all columns are type string and a better guess is not possible All rows were sampled since file is small so we know nrow=5 exactly [08] Assign column names [09] Apply user overrides on column types After 0 type and 0 drop user overrides : AA [10] Allocate memory for the datatable Allocating 2 column slots (2 - 0 dropped) with 5 rows [11] Read the data jumps=[0..1), chunk_size=1048576, total_size=23 Read 5 rows x 2 columns from 27 bytes file in 00:00.001 wall clock time [12] Finalizing the datatable Type counts: 2 : string 'A' ============================= 0.000s ( 4%) Memory map 0.000GB file 0.000s ( 44%) sep=',' ncol=2 and header detection 0.000s ( 1%) Column type detection using 5 sample rows 0.000s ( 44%) Allocation of 5 rows x 2 cols (0.000GB) of which 5 (100%) rows used 0.000s ( 7%) Reading 1 chunks (0 swept) of 1.000MB (-2147483648 rows) using 1 threads + 0.000s ( 0%) Parse to row-major thread buffers (grown 0 times) + 0.000s ( 0%) Transpose + 0.000s ( 7%) Waiting 0.000s ( 0%) Rereading 0 columns due to out-of-sample type exceptions 0.001s Total A B 1: 1 2 2: 3 4 3: NA na 4: 5: 'sdsf' — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2437 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHQQdWdiC5VQeK_5XfX6oj99X1_ajNBsks5tXdHwgaJpZM4QDg2r> .

mattdowle added the feature request label Oct 24, 2017

mattdowle changed the title ~~feature request: optional arg that returns a list of parse symbols fread() used to intuit raw file~~ optional arg that returns a list of parse symbols fread() used to intuit raw file Oct 24, 2017

statsccpr mentioned this issue Oct 27, 2017

feature request: return the 'final' parse patterns used for read in data (along with already returned col types) tidyverse/readr#728

Closed

jangorecki added the fread label Aug 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optional arg that returns a list of parse symbols fread() used to intuit raw file #2437

optional arg that returns a list of parse symbols fread() used to intuit raw file #2437

statsccpr commented Oct 23, 2017 •

edited

Loading

MichaelChirico commented Oct 24, 2017

statsccpr commented Oct 27, 2017 •

edited

Loading

HughParsonage commented Feb 19, 2018

statsccpr commented Feb 22, 2018

MichaelChirico commented Feb 22, 2018 via email

statsccpr commented Feb 22, 2018

st-pasha commented Feb 22, 2018

MichaelChirico commented Feb 23, 2018 via email

optional arg that returns a list of parse symbols fread() used to intuit raw file #2437

optional arg that returns a list of parse symbols fread() used to intuit raw file #2437

Comments

statsccpr commented Oct 23, 2017 • edited Loading

edit 10/24 for more use case context

MichaelChirico commented Oct 24, 2017

statsccpr commented Oct 27, 2017 • edited Loading

HughParsonage commented Feb 19, 2018

statsccpr commented Feb 22, 2018

MichaelChirico commented Feb 22, 2018 via email

statsccpr commented Feb 22, 2018

st-pasha commented Feb 22, 2018

MichaelChirico commented Feb 23, 2018 via email

statsccpr commented Oct 23, 2017 •

edited

Loading

statsccpr commented Oct 27, 2017 •

edited

Loading