-
Notifications
You must be signed in to change notification settings - Fork 906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
avro reader integration tests #7156
Conversation
rerun tests |
1 similar comment
rerun tests |
Codecov Report
@@ Coverage Diff @@
## branch-0.19 #7156 +/- ##
==============================================
Coverage ? 82.22%
==============================================
Files ? 100
Lines ? 16969
Branches ? 0
==============================================
Hits ? 13953
Misses ? 3016
Partials ? 0 Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good stuff. Got some questions/suggestions.
python/cudf/cudf/tests/test_avro_reader_fastavro_integration.py
Outdated
Show resolved
Hide resolved
python/cudf/cudf/tests/test_avro_reader_fastavro_integration.py
Outdated
Show resolved
Hide resolved
python/cudf/cudf/tests/test_avro_reader_fastavro_integration.py
Outdated
Show resolved
Hide resolved
records = [ | ||
{"prop": avro_val}, | ||
{"prop": None}, | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the dataframe shape (1,2)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expected
and actual
are the same shape. I don't know what shape that should be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also have some tests with a large number of rows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can test a large number of values,. It would be nice to have a test data generator. I see we're generating random values for fuzz testing. Are we able to do that in a deterministic manner so it can be also be used for unit tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC the data generator optionally takes a seed value; that the output is deterministic for each seed. CC @galipremsagar for pointer to the generator + sample use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are discussing having large rows, I'd recommend staying in <30 rows range to not slow down things in pytests by a lot as that would slow down in gpu CI too. If there is a bug that only reproduces for a large column scenarion then we can widen the test coverage for large columns, else I think fuzz tests should take care of large rows testing. For using the dataset generator, here is how we can use it:
>>> import cudf
>>> from cudf.tests.dataset_generator import rand_dataframe
>>> rand_dataframe(dtypes_meta=[{"dtype": "int64", "null_frequency": 0.4, "cardinality": 10}], 100, seed=2)
File "<stdin>", line 1
SyntaxError: positional argument follows keyword argument
>>> rand_dataframe(dtypes_meta=[{"dtype": "int64", "null_frequency": 0.4, "cardinality": 10}], rows=100, seed=2)
pyarrow.Table
0: int64
>>> cudf.DataFrame.from_arrow(rand_dataframe(dtypes_meta=[{"dtype": "int64", "null_frequency": 0.4, "cardinality": 10}], rows=100, seed=2))
0
0 -1468954783236838137
1 <NA>
2 2200161065918338095
3 -1193091257902529461
4 -5448271019629827509
.. ...
95 <NA>
96 2200161065918338095
97 -8745117541724490168
98 <NA>
99 -4301277553722975852
[100 rows x 1 columns]
Alternatively, There is also an existing API that also returns deterministic data with the same seed values that is widely used across our pytests:
https://github.com/rapidsai/cudf/blob/branch-0.18/python/cudf/cudf/datasets.py#L60
This is much simpler to use and fits the use-case here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we rather just change this test to be a list of values(cudf_val
be length 5/10) instead of 1 value?
Looks like the PR is failing due to mypy style checks unrelated to these changes. Can we ignore that? |
Fix incoming: #7279 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requesting changes based on the two unresolved comments.
rerun tests |
python/cudf/cudf/tests/test_avro_reader_fastavro_integration.py
Outdated
Show resolved
Hide resolved
Retargeted to branch-0.19. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe better to add some PR description?
@gpucibot merge |
Added some avro reader integration tests for fastavro. These cover type detection, single-value parsing, and null value parsing, but do not cover parsing multiple values.