-
Notifications
You must be signed in to change notification settings - Fork 906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add JSON option to prune columns #14996
Add JSON option to prune columns #14996
Conversation
Profiled on GV100 machine. Reading 1 columns out of JSON with 512 columns, 10k rows. unnecesary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great. I tried it out again and the time for pulling one item out of 512 went from 17 seconds to 9 seconds. I have not done traces on it yet to see what the next steps would be, but it is a lot better.
Thank you @karthikeyann, this is a great demonstration! When you mention:
What do you mean by "filter 1 row"? |
Sorry. I meant to type "filter 1 column". keys.json content in each line: import cudf
import nvtx
# read all 512 columns
with nvtx.annotate("read_json", color="purple"):
df = cudf.read_json(open("keys.json"), engine="cudf", lines=True)
# read only 1 column
with nvtx.annotate("read_json", color="purple"):
df = cudf.read_json(open("keys.json"), engine="cudf", lines=True, dtype={"key_10": str}, use_dtypes_as_filter=True) |
/ok to test |
/ok to test |
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor nitpicks, but this LGTM.
It's a little late now to suggest this, but one wonders if "column pruning" might have been an acceptable replacement to "column filter", to avoid potential confusion.
I've learnt a couple of things from reviewing this PR, as per usual with @karthikeyann's PRs.
cpp/tests/io/json_test.cpp
Outdated
{std::map<std::string, cudf::io::schema_element> dtype_schema{ | ||
{"a", {dtype<int32_t>()}}, | ||
}; | ||
in_options.set_dtypes(dtype_schema); | ||
cudf::io::table_with_metadata result = cudf::io::read_json(in_options); | ||
// Make sure we have column "a" | ||
ASSERT_EQ(result.tbl->num_columns(), 1); | ||
ASSERT_EQ(result.metadata.schema_info.size(), 1); | ||
EXPECT_EQ(result.metadata.schema_info[0].name, "a"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the formatting here a little off?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// include only one column
{// schema
{std::map<std::string, cudf::io::schema_element> dtype_schema{
This part of code makes the formatting off.
Consecutive {
(even with comment), makes clang-format think, it's consecutive uniform initialization braces {{
.
I removed the extra {
}
.
/ok to test |
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filter to prune changes. This looks good to me.
Co-authored-by: Mike Wilson <[email protected]>
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for entertaining my nits. I always love a good stoptimization and this is a perfect example!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
One question - do we need to include this option in the java code as well?
Yes. @revans2 Should I include the java code changes as well in this PR? |
/merge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we please have a docstring addition for the python read_json
? Additionally, perhaps I am dumb, I couldn't understand the C++ docstring for the prune_columns
option.
/ok to test |
Description
Resolves #14951
This adds an option
prune_columns
to json_reader_options (default False)When set to True, the dtypes option is used as filter instead of type inference suggestion. If dtypes (vector of dtypes, map of dtypes or nested schema), is not specified, output is empty dataframe.
Checklist