Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for chunked parquet reading [databricks] #6934

Merged
merged 8 commits into from
Nov 21, 2022

Conversation

revans2
Copy link
Collaborator

@revans2 revans2 commented Oct 27, 2022

This depends on rapidsai/cudf#11961
This fixes the parquet part of #4968 I need to file a separate follow on issue for the ORC part of this.

In my tests with non-nested values there was no performance difference, and in general the chunked reader was able to avoid memory problems.

I still need to do some performance testing for nested types.

Signed-off-by: Robert (Bobby) Evans <[email protected]>
@revans2 revans2 self-assigned this Oct 27, 2022
@sameerz sameerz added the reliability Features to improve reliability or bugs that severly impact the reliability of the plugin label Oct 29, 2022
@ttnghia ttnghia self-requested a review November 3, 2022 13:30
@revans2
Copy link
Collaborator Author

revans2 commented Nov 3, 2022

@jlowe I think I have finished all of the rework you requested. Please take another look.

*
* @tparam T what it is that we are wrapping
*/
abstract class GpuDataProducer[T] extends AutoCloseable {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
abstract class GpuDataProducer[T] extends AutoCloseable {
trait GpuDataProducer[T] extends AutoCloseable {

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also sealed it?

override def next(): ColumnarBatch = throw new NoSuchElementException()
}

class SingleGpuColumnarBatchIterator(var batch: ColumnarBatch)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
class SingleGpuColumnarBatchIterator(var batch: ColumnarBatch)
class SingleGpuColumnarBatchIterator(private var batch: ColumnarBatch)

@@ -70,19 +70,29 @@ def read_parquet_sql(data_path):
original_parquet_file_reader_conf = {'spark.rapids.sql.format.parquet.reader.type': 'PERFILE'}
multithreaded_parquet_file_reader_conf = {'spark.rapids.sql.format.parquet.reader.type': 'MULTITHREADED'}
coalesce_parquet_file_reader_conf = {'spark.rapids.sql.format.parquet.reader.type': 'COALESCING'}
coalesce_parquet_file_reader_multithread_filter_chunked_conf = {'spark.rapids.sql.format.parquet.reader.type': 'COALESCING',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any test with different size limits here? Some limit values ranging from small to large should be better than the default max 2GB value.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default limit in these tests is 512 MiB. We don't have any files that are read that would produce more than a single batch anyways. I don't think we have any files that would be larger than a page per row so even if we tried to turn it on the output would be the same. I have done manual tests because the size needed does not really lend itself to this type of test.

@@ -67,6 +67,8 @@ public static class ReadBuilder {
private Configuration conf = null;
private int maxBatchSizeRows = Integer.MAX_VALUE;
private long maxBatchSizeBytes = Integer.MAX_VALUE;
private long targetBatchSizeBytes = Integer.MAX_VALUE;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me, I'm very confused in differentiating between maxBatchSizeBytes and targetBatchSizeBytes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want a comment in here? The long term plan is to eventually get rid of the maxBatchSizeBytes and the maxBatchSizeRows, but we can only do that one we have chunked reads the only option for Parquet and ORC and depending on how things go for avro, CSV, and JSON too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes please add comment here to clarify how they are different.

Comment on lines +55 to +57
while (hasNext) {
func(next)
}
Copy link
Collaborator

@ttnghia ttnghia Nov 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In cudf we had issue when not calling next before hasNext for the empty input file. In such cases, the valid output should be a table having 0 rows but with all (0 rows) columns in the file schema. We can only get such output table if we call next before hasNext:

do {
  func(next)
} while(hasNext)

I'm not sure if such output table is also desired in Spark.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow that is totally not the order that I would expect the APIs to need to be called in. It is totally opposite of all java iterators, but OK. I will need to change it in a very different place than this because GpuDataProducer is abstract and says that it should operate like an iterator.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure that the JNI API is clearly documented because like I said this totally is the opposite of java semantics.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is documenting this weird behavior the right answer? This seems like behavior that should be fixed in cudf. The entire point of hasNext is to be a predicate function to let the calling code know that it is valid to call next, as the method name implies. Needing to special-case that for the first call is bizarre and error-prone, IMO.

Copy link
Collaborator

@ttnghia ttnghia Nov 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not just a specific issue with cudf, but a general logic issue. In the case of empty file, what should hasNext return? It should always return false as there is no data in the file (except metadata) to read. But what you expect is a table with empty columns, not a table without anything, right? So if you check hasNext and see a false then next will never be called and you never get a table with empty columns.

On the other hand, if hasNext return true for empty file then it will always return true, and you can't know when to stop.

Copy link
Collaborator

@ttnghia ttnghia Nov 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talked with Bobby and decided to workaround this in cudf, hasNext (cudf Java JNI) will return true at least once (always returns true for the first time).

Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, just a minor question on the single iterators.

Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, assuming iterator issues get worked out in the cudf Java bindings.

@revans2
Copy link
Collaborator Author

revans2 commented Nov 18, 2022

lgtm, assuming iterator issues get worked out in the cudf Java bindings.

Yes the latest patch has the iterator issues fixed on the java side. Hopefully it gets merged in soon.

@revans2 revans2 marked this pull request as ready for review November 21, 2022 15:53
@revans2
Copy link
Collaborator Author

revans2 commented Nov 21, 2022

build

1 similar comment
@revans2
Copy link
Collaborator Author

revans2 commented Nov 21, 2022

build

@revans2 revans2 merged commit 74d5f1e into NVIDIA:branch-22.12 Nov 21, 2022
@revans2 revans2 deleted the read_chunked_parquet branch November 21, 2022 20:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
reliability Features to improve reliability or bugs that severly impact the reliability of the plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Split batches from parquet that are too large, and try to guess better before decompressing
4 participants