Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSVSTAT Memory limits? #581

Closed
dartdog opened this issue Mar 5, 2016 · 7 comments
Closed

CSVSTAT Memory limits? #581

dartdog opened this issue Mar 5, 2016 · 7 comments

Comments

@dartdog
Copy link

dartdog commented Mar 5, 2016

Tried to run csvstat on a 1.9gb file about 7m rows x 74 columns (mixed and sparse) after a long time just got "killed". I'm on a 8GB machine with Linux 14.04 & Python 3+ Is there a way to approximate the limits that can be used? Or can I get a more informative error?

@anthnyprschka
Copy link

Same problem here... Tried to csvstat on a ~5gb file on my 4gb machine as i read csvkit computes "line-by-line"?

@jpmckinney
Copy link
Member

Not all utilities are line-by-line - calculating statistics requires holding much more of the file in memory.

@anthnyprschka
Copy link

Cool thanks for the quick response 👍 found a way to circumvent it

@jpmckinney
Copy link
Member

What was your solution?

@anthnyprschka
Copy link

To be honest I just switched to using vanilla Python and read the csv line-by-line there. Looping through 2m lines and 9gb took a few seconds for simple stats

@jpmckinney
Copy link
Member

Yeah, a generic tool like csvstat does come with significant performance penalties.

@onyxfish
Copy link
Collaborator

Closing. csvstat will never be performant for files this large.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants
@jpmckinney @onyxfish @dartdog @anthnyprschka and others