Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better quote rule decision for tied options. #2436

Merged
merged 3 commits into from
Oct 24, 2017
Merged

Better quote rule decision for tied options. #2436

merged 3 commits into from
Oct 24, 2017

Conversation

mattdowle
Copy link
Member

@mattdowle mattdowle commented Oct 23, 2017

Closes #2404
Closes #2196
Thanks @franknarf1 for the great file!

It wasn't the jumps per se. The first 100 lines all had quoted fields which all contained a comma. The quoteRule was then bumped to 3 (meaning ignore quoting) because it found 100 consistent lines of 17 columns rather than 16 lines with that quoteRule (17 beat 16). I've changed it to only resolve ties (i.e. different ways to parse the same number of consistent lines) with a different sep and no longer the same sep but different quoteRule. The lower quoteRule (0=double and 1=escape) are the standard ones so those should (and now do) take preference in these tied cases.

That wrong quote rule was then causing the jump problem later. I've lessened the impact of that (it now skips sample jumps where there's a format error even after nextGoodLine) but there's more work to do on that.

The test file grr.csv is added to the test suite with 4 large columns removed to reduce the size from 860KB down to 283KB. It generated CRAN's >5MB warning at the original size. The columns removed are the ones with long strings (>100 chars) without any of the quoting features being tested.

First 16 lines from the test file with quoted fields highlighted :

image

@codecov-io
Copy link

codecov-io commented Oct 23, 2017

Codecov Report

Merging #2436 into master will increase coverage by 0.04%.
The diff coverage is 97.56%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2436      +/-   ##
==========================================
+ Coverage   91.29%   91.33%   +0.04%     
==========================================
  Files          62       62              
  Lines       12026    12032       +6     
==========================================
+ Hits        10979    10990      +11     
+ Misses       1047     1042       -5
Impacted Files Coverage Δ
src/fread.c 96.2% <97.56%> (+0.45%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9d1f3e2...0f7dfc1. Read the comment docs.

@mattdowle mattdowle changed the title Better quote rule decision for tied lines. Also, format error on samp… Better quote rule decision for tied options. Oct 23, 2017
@mattdowle mattdowle requested a review from st-pasha October 23, 2017 22:32
@st-pasha
Copy link
Contributor

Does this also solve #2196 ?

@mattdowle
Copy link
Member Author

@st-pasha Yes. I'll add a test to this PR ...

@mattdowle mattdowle merged commit 6015373 into master Oct 24, 2017
@mattdowle mattdowle deleted the issue_2404 branch October 24, 2017 00:20
@mattdowle mattdowle added this to the v1.10.6 milestone Oct 24, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fread's jumps causing trouble fread heuristic predicts wrong column count on a valid file
3 participants