Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integer chunksize value only works with JSONL strings #95

Open
maliksMOJ opened this issue May 4, 2023 · 1 comment
Open

Integer chunksize value only works with JSONL strings #95

maliksMOJ opened this issue May 4, 2023 · 1 comment
Labels
bug Something isn't working

Comments

@maliksMOJ
Copy link
Contributor

arrow-pd-parser should support two different value types for the chunksize variable (string value denoting the memory allocation size i.e. 1GB or an integer value specifying how many rows to split by). However when specifying an integer value, the reader will only successfully split data from a JSONL file (line-delimited). I was unable to chunk when giving a comma-delimited JSON file.

@maliksMOJ maliksMOJ added the bug Something isn't working label May 4, 2023
@mratford
Copy link
Contributor

mratford commented May 11, 2023

This issue won't be easily solved, as pandas and awswrangler only support chunking for line-delimited json files.

We could possibly use smart_open.open and readline? It might need some tricky parsing if the json records are across different lines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants