Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

csvgrep fails on (giant) csvclean'ed CSV on 2.7 only #617

Closed
jeremybmerrill opened this issue Jun 8, 2016 · 10 comments
Closed

csvgrep fails on (giant) csvclean'ed CSV on 2.7 only #617

jeremybmerrill opened this issue Jun 8, 2016 · 10 comments
Labels

Comments

@jeremybmerrill
Copy link

I found a weird issue where csvkit works on 3.5 but not 2.7. This is not a problem for my workflow (since 3.5 works fine), but wanted to report it in case it's useful. I'm trying to grep thru a giant (3.1gb) CSV that contains some cells with internal line breaks. (So normal grep won't work right).

I keep getting a list index out of range error; full traceback is below. The error occurs in an identical manner on both my original file and the one that I ran through csvclean. The command looks like this csvgrep -v -c 4 -m "2016-05-12" jeremys_giant_csv_out.csv.

When I run the command like that (without directing the output to a file), everything around the last line printed appears to be well-formed -- and indeed, when I separate out the 200 lines into their own file, csvgrep can handle it without problems.

Got any idea on how to fix the problem? Happy to share the file in private, under frieNDA, if we can figure out a good way to get you 3.1gb. Is it possible that csvgrep has issues with huge files? Should I split it into pieces (manually checking to make sure that head/tail don't cut off a row in the middle, at an internal line break)? If so, what's a dependably-small-enough size?

Additional possibly-relevant information is that the file was generated by Ruby's CSV library. I'm running csvkit 0.9.1, installed with pip, with 2.7. The same file, same command works fine in 3.5.

Traceback (most recent call last):
  File "/Users/Jeremy/.pyenv/versions/2.7/bin/csvgrep", line 11, in <module>
    sys.exit(launch_new_instance())
  File "/Users/Jeremy/.pyenv/versions/2.7/lib/python2.7/site-packages/csvkit/utilities/csvgrep.py", line 65, in launch_new_instance
    utility.main()
  File "/Users/Jeremy/.pyenv/versions/2.7/lib/python2.7/site-packages/csvkit/utilities/csvgrep.py", line 60, in main
    for row in filter_reader:
  File "/Users/Jeremy/.pyenv/versions/2.7/lib/python2.7/site-packages/six.py", line 558, in next
    return type(self).__next__(self)
  File "/Users/Jeremy/.pyenv/versions/2.7/lib/python2.7/site-packages/csvkit/grep.py", line 59, in __next__
    if self.test_row(row):
  File "/Users/Jeremy/.pyenv/versions/2.7/lib/python2.7/site-packages/csvkit/grep.py", line 69, in test_row
    if not self.any_match and not test(row[idx]):
IndexError: list index out of range
@jpmckinney
Copy link
Member

Does the same error occur when using the latest code from GitHub? You can install it with:

pip install -e git+git://github.com/onyxfish/csvkit.git@master#egg=csvkit

@jeremybmerrill
Copy link
Author

jeremybmerrill commented Jun 8, 2016

Yes, it does still occur. pip show csvkit shows 1.0.0; still on 2.7.

@jpmckinney
Copy link
Member

Is the backtrace different? The backtrace above doesn't match the current code.

@jeremybmerrill
Copy link
Author

Ah, yeah, sorry:

  File "/Users/jeremy/.pyenv/versions/2.7/bin/csvgrep", line 9, in <module>
    load_entry_point('csvkit', 'console_scripts', 'csvgrep')()
  File "/Users/jeremy/code/my_project_name/src/csvkit/csvkit/utilities/csvgrep.py", line 67, in launch_new_instance
    utility.main()
  File "/Users/jeremy/code/my_project_name/src/csvkit/csvkit/utilities/csvgrep.py", line 61, in main
    for row in filter_reader:
  File "/Users/jeremy/.pyenv/versions/2.7/lib/python2.7/site-packages/six.py", line 558, in next
    return type(self).__next__(self)
  File "/Users/jeremy/code/my_project_name/src/csvkit/csvkit/grep.py", line 60, in __next__
    if self.test_row(row):
  File "/Users/jeremy/code/my_project_name/src/csvkit/csvkit/grep.py", line 67, in test_row
    result = test(row[idx])
IndexError: list index out of range

@jpmckinney
Copy link
Member

It seems to be that at some point, a row doesn't have a column with index 4. Can you try with the 617 branch?

pip install --upgrade -e git+git://github.com/onyxfish/csvkit.git@617#egg=csvkit

@jpmckinney jpmckinney added bug and removed question labels Jun 8, 2016
@jeremybmerrill
Copy link
Author

Yes, I agree that's what's wrong. Is that a thing that csvclean is supposed to fix?

the command seems to fail right away with 617:

Traceback (most recent call last):
  File "/Users/jeremy/.pyenv/versions/2.7/bin/csvgrep", line 9, in <module>
    load_entry_point('csvkit', 'console_scripts', 'csvgrep')()
  File "/Users/jeremy/code/myproject/src/csvkit/csvkit/utilities/csvgrep.py", line 67, in launch_new_instance
    utility.main()
  File "/Users/jeremy/code/myproject/src/csvkit/csvkit/utilities/csvgrep.py", line 61, in main
    for row in filter_reader:
  File "/Users/jeremy/.pyenv/versions/2.7/lib/python2.7/site-packages/six.py", line 558, in next
    return type(self).__next__(self)
  File "/Users/jeremy/code/myproject/src/csvkit/csvkit/grep.py", line 60, in __next__
    if self.test_row(row):
  File "/Users/jeremy/code/myproject/src/csvkit/csvkit/grep.py", line 67, in test_row
    result = test(row.get(idx))
AttributeError: 'list' object has no attribute 'get'

@jpmckinney
Copy link
Member

Ah, indeed. I fixed the commit, so we can try again.

@jeremybmerrill
Copy link
Author

in the same spot, different traceback

Traceback (most recent call last):
  File "/Users/jeremy/.pyenv/versions/2.7/bin/csvgrep", line 9, in <module>
    load_entry_point('csvkit', 'console_scripts', 'csvgrep')()
  File "/Users/jeremy/code/my_project/src/csvkit/csvkit/utilities/csvgrep.py", line 67, in launch_new_instance
    utility.main()
  File "/Users/jeremy/code/my_project/src/csvkit/csvkit/utilities/csvgrep.py", line 61, in main
    for row in filter_reader:
  File "/Users/jeremy/.pyenv/versions/2.7/lib/python2.7/site-packages/six.py", line 558, in next
    return type(self).__next__(self)
  File "/Users/jeremy/code/my_project/src/csvkit/csvkit/grep.py", line 60, in __next__
    if self.test_row(row):
  File "/Users/jeremy/code/my_project/src/csvkit/csvkit/grep.py", line 71, in test_row
    result = test(value)
  File "/Users/jeremy/code/my_project/src/csvkit/csvkit/grep.py", line 122, in <lambda>
    return lambda x: obj in x
TypeError: argument of type 'NoneType' is not iterable

@jpmckinney
Copy link
Member

Thanks - I've written a test now, so it should finally work.

@jeremybmerrill
Copy link
Author

It worked! Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants