Integer chunksize value only works with JSONL strings #95

maliksMOJ · 2023-05-04T09:56:36Z

arrow-pd-parser should support two different value types for the chunksize variable (string value denoting the memory allocation size i.e. 1GB or an integer value specifying how many rows to split by). However when specifying an integer value, the reader will only successfully split data from a JSONL file (line-delimited). I was unable to chunk when giving a comma-delimited JSON file.

mratford · 2023-05-11T12:08:37Z

This issue won't be easily solved, as pandas and awswrangler only support chunking for line-delimited json files.

We could possibly use smart_open.open and readline? It might need some tricky parsing if the json records are across different lines.

maliksMOJ added the bug Something isn't working label May 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integer chunksize value only works with JSONL strings #95

Integer chunksize value only works with JSONL strings #95

maliksMOJ commented May 4, 2023

mratford commented May 11, 2023 •

edited

Loading

Integer chunksize value only works with JSONL strings #95

Integer chunksize value only works with JSONL strings #95

Comments

maliksMOJ commented May 4, 2023

mratford commented May 11, 2023 • edited Loading

mratford commented May 11, 2023 •

edited

Loading