Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve compatibility when reading timestamps from JSON and CSV sources #4938

Merged
merged 23 commits into from
Mar 24, 2022

Conversation

andygrove
Copy link
Contributor

@andygrove andygrove commented Mar 11, 2022

Signed-off-by: Andy Grove [email protected]

Closes #4863 and closes #123

Improves timestamp support in JSON and CSV to match Spark, by reading from cuDF as strings and then converting to timestamps in the plugin.

There is one follow-on issues:

Status

  • Basic functionality in place
  • Fix regression in CsvScanSuite
  • Improve code documentation
  • Review/update compatibility guide
  • Review tests

@andygrove andygrove added this to the Feb 28 - Mar 18 milestone Mar 11, 2022
@andygrove andygrove self-assigned this Mar 11, 2022
@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

build

@andygrove andygrove changed the title WIP: Improve compatibility when reading timestamps from JSON and CSV sources Improve compatibility when reading timestamps from JSON and CSV sources Mar 14, 2022
@andygrove andygrove marked this pull request as ready for review March 14, 2022 20:04
@andygrove
Copy link
Contributor Author

build

2 similar comments
@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

build

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code is okay, but it is really complicated and a lot of assumptions that only a very specific set of formats are allowed. I keep thinking that there might be a simpler way to make it more data driven with look up tables instead of transpiling everything. But then I see we convert the format to both regular expressions and to the CUDF format and I just don't know if the lookup table will actually be small/less error prone or not.


// fix timestamps that have milliseconds but no microseconds
// example ".296" => ".296000"
val placeholder = "@@@"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment about why @@@ is an okay sequence to use here an will never interfere with a real timestamp.

@andygrove
Copy link
Contributor Author

build

revans2
revans2 previously approved these changes Mar 18, 2022
@andygrove
Copy link
Contributor Author

build

1 similar comment
@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

@revans2 could you re-approve this one, please. I had to upmerge since your last approval.

@revans2 revans2 merged commit ceb3558 into NVIDIA:branch-22.04 Mar 24, 2022
@sameerz sameerz added the feature request New feature or request label Mar 25, 2022
@andygrove andygrove deleted the json-timestamp branch March 30, 2022 14:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
3 participants