Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing to parse quotes #25

Closed
fredrikj83 opened this issue Mar 24, 2020 · 9 comments
Closed

Failing to parse quotes #25

fredrikj83 opened this issue Mar 24, 2020 · 9 comments

Comments

@fredrikj83
Copy link

fredrikj83 commented Mar 24, 2020

I tried the following CSV:

`
"A""B""""C";"A""""B""C";Test

`

But it seems to fail to parse the second field (which should result in A""B"C but ends up with C). I haven't been able to fully understand why, but it seems to be because of the first condition in the while loop.

Any ideas?

@FourierTransformer
Copy link
Owner

So, it does parse the first field correctly? I'm going to bet one of the flags isn't getting cleared properly somewhere. I'll poke around with it a little more later today.

@fredrikj83
Copy link
Author

fredrikj83 commented Mar 25, 2020

Ah, sorry I didn't mention it. But yes, the first field is parsed correctly as A"B""C.

I played around a bit and got it working by making the following changes (although I'm not sure I've understood everything, so any input from you is highly appreciated!

In findClosingQuote I modified the pattern because I saw that when a field had many quotes in a row it would of course skip over all of them until finding a non-quote character. Changing to the other pattern would instead result in recursive calls to the function, and that seemed to work.

i, j = inputString:find('""?', i) -- modified from original pattern: '"+'

That, however, caused issues when starting with many quotes in a field. For example:

"""A""""B";XYZ would end up with first field parsed as A""B.

I was able to get around this by adding a counter in the while loop inside parseString. When two consecutive delimiters are found, do this:

emptyIdentified = true
numEmpty = numEmpty + 1 -- ADDED, keep track of how many pairs we find

and in the next condition:

if emptyIdentified then
  fieldStart = fieldStart - (2 * numEmpty)
  emptyIdentified = false
  numEmpty = 0
end

All my tests so far have been working, but may have missed something. I'm open for alternative solutions. It's very likely that I missed something obvious that could make a simpler solution. Thanks!

@FourierTransformer
Copy link
Owner

I think we'll have to fix the non-luajit version of the findClosingQuote, as that function should "find" the closing quote, and interestingly enough, the luajit version (with the string bytes, instead of the find) actually returns the correct value.

I stripped this down to the bare bones, and tried switching between the two findClosingQuotes and they returned different values.

local sbyte = string.byte
-- vanilla lua closing quote finder
local function vanillaFindClosingQuote(i, inputLength, inputString, quote, doubleQuoteEscape)
    local j, difference
    i, j = inputString:find('"+', i)
    if j == nil then return end
    if i == nil then
	return inputLength-1, doubleQuoteEscape
    end
    difference = j - i
    -- print("difference", difference, "I", i, "J", j)
    if difference >= 1 then doubleQuoteEscape = true end
    if difference == 1 then
	return vanillaFindClosingQuote(j+1, inputLength, inputString, quote, doubleQuoteEscape)
    end
    return j-1, doubleQuoteEscape
end

function luajitFindClosingQuote(i, inputLength, inputString, quote, doubleQuoteEscape)
    local currentChar, nextChar = sbyte(inputString, i), nil
    while i <= inputLength do
	-- print(i)
	nextChar = sbyte(inputString, i+1)

	-- this one deals with " double quotes that are escaped "" within single quotes "
	-- these should be turned into a single quote at the end of the field
	if currentChar == quote and nextChar == quote then
	    doubleQuoteEscape = true
	    i = i + 2
	    currentChar = sbyte(inputString, i)

	-- identifies the escape toggle
	elseif currentChar == quote and nextChar ~= quote then
	    -- print("exiting", i-1)
	    return i-1, doubleQuoteEscape
	else
	    i = i + 1
	    currentChar = nextChar
	end
    end
end

local a = '"A""B""""C"'
local b = '"A""""B""C"'
print("vanilla a:", vanillaFindClosingQuote(2, #a, a, sbyte('"'), false))
print("luajit b:", luajitFindClosingQuote(2, #a, a, sbyte('"'), false))
print("vanilla b:", vanillaFindClosingQuote(2, #b, b, sbyte('"'), false))
print("luajit b:", luajitFindClosingQuote(2, #b, b, sbyte('"'), false))

results in:

vanilla a:	8	true
luajit b:	10	true
vanilla b:	5	true
luajit b:	10	true

and those should all be 10... Feel free to keep working on it, the help is appreciated! I'll poke around some more tomorrow.

@fredrikj83
Copy link
Author

Thanks for taking the time to look into this.

To be honest I forgot about the luajit function since I was only using the other one. But as previously mentioned I got a better result (same as the luajit version in your tests) from findClosingQuote by changing the pattern into ""?.

But only changing that did not work when a field started with quotes. For example """XYZ""""" would result in XYZ"" iirc. That's why I added numEmpty:

local numEmpty = 0 -- <== NEW
if currentChar == quote and nextChar == quote then
  skipChar = 1
  fieldStart = i + 2
  emptyIdentified = true
  numEmpty = numEmpty + 1 -- <== NEW
  -- print("fs+2:", fieldStart)

  -- identifies the escape toggle.
  -- This can only happen if fields have quotes around them
  -- so the current "start" has to be where a quote character is.
elseif currentChar == quote and nextChar ~= quote and fieldStart == i then
  -- print("New Quoted Field", i)
  fieldStart = i + 1

  -- if an empty field was identified before assignment, it means
  -- that this is a quoted field that starts with escaped quotes
  -- ex: """a"""
  if emptyIdentified then
    fieldStart = fieldStart - (2 * numEmpty) -- <== NEW
    emptyIdentified = false
    numEmpty = 0 -- <== NEW
  end

  i, doubleQuoteEscape = M.findClosingQuote(i+1, inputLength, inputString, quote, doubleQuoteEscape)
  skipChar = 1

But honestly this was just something I found by trying out a few things without actually knowing what I was doing :)

@FourierTransformer
Copy link
Owner

So, if we change vanilla quote finder to this:

local function vanillaFindClosingQuote(i, inputLength, inputString, quote, doubleQuoteEscape)
    local j, difference
    i, j = inputString:find('"+', i)
    if j == nil then return end
    if i == nil then
	return inputLength-1, doubleQuoteEscape
    end
    difference = j - i
    -- print("difference", difference, "I", i, "J", j)
    if difference >= 1 then doubleQuoteEscape = true end
    if difference % 2 == 1 then
	return vanillaFindClosingQuote(j+1, inputLength, inputString, quote, doubleQuoteEscape)
    end
    return j-1, doubleQuoteEscape
end

everything works as expected. It handles the use cases for any number of escaped double quotes and all the unit tests pass. I tried doing a difference >= 1, and while it handled your use case, not all the unit tests passed. While explicitly checking for an odd difference (with differece % 2 == 1), everything passes.

with two "" the difference is one, and with four """" the difference is three, and so on, which lets it all work.

I can roll these changes into 1.2.0 (in the parseLineIterator branch) and try to get it out in the next few days.

@fredrikj83
Copy link
Author

Great! I will try this next week, thanks for your quick support.

@FourierTransformer
Copy link
Owner

I've not got the changes in the parseLineIterator branch if you want to give it a try!

@fredrikj83
Copy link
Author

I tried updating only the findLastQuote function with your suggested change. So far it seems to work as expected!

I haven't tried the parseLineIterator branch since I made some local changes in my file that was taken from the last release. Maybe possible to merge the parseLineIterator into that? What else is new compared to the last release? (1.1.5). Unless there's anything major I might as well stick with this and just patch it.

Thanks!

@FourierTransformer
Copy link
Owner

It's the addition of fixed-buffer reading, so you don't have to load the entire file into memory first, a lot of refactoring, and some speed improvements.

Regardless, thanks for letting me know that it seems to be working for you!

FourierTransformer added a commit that referenced this issue Apr 4, 2020
## Features
 * Can now parse files line by line in a fixed-size reading mode
 * Now has an option to ignore quotes when parsing

## Improvements
 * Speed increases in vanilla Lua and LuaJIT (benchmarks updated!)
 * Refactored code for easier maintenance

## Bugfixes
 * Better handling of multiple escaped quotes in vanilla lua (thanks @fredrikj83 #25)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants